Automated data collection tool for real-world cohort studies of chronic hepatitis B: Leveraging OCR and NLP technologies for improved efficiency

New Microbes New Infect. 2024 Aug 28:62:101469. doi: 10.1016/j.nmni.2024.101469. eCollection 2024 Dec.

Abstract

Background: Collecting and standardizing clinical research data is a very tedious task. This study is to develop an intelligent data collection tool, named CHB-EDC, for real-world cohort studies of chronic hepatitis B (CHB), which can assist in standardized and efficient data collection.

Methods: CHB_EDC is capable of automatically processing various formats of data, including raw data in image format, using internationally recognized data standards, OCR, and NLP models. It can automatically populate the data into eCRFs designed in the REDCap system, supporting the integration of patient data from electronic medical record systems through commonly used web application interfaces. This tool enables intelligent extraction and aggregation of data, as well as secure and anonymous data sharing.

Results: For non-electronic data collection, the average accuracy of manual collection was 98.65 %, with an average time of 63.64 min to collect information for one patient. The average accuracy CHB-EDC was 98.66 %, with an average time of 3.57 min to collect information for one patient. In the same data collection task, CHB-EDC achieved a comparable average accuracy to manual collection. However, in terms of time, CHB-EDC significantly outperformed manual collection (p < 0.05). Our research has significantly reduced the required collection time and lowered the cost of data collection while ensuring accuracy.

Conclusion: The tool has significantly improved the efficiency of data collection while ensuring accuracy, enabling standardized collection of real-world data.

Keywords: Data collection; Hepatitis B; Natural language processing; Optical character recognition.