Semi-Automated Data Curation from Biomedical Literature

Protiva Rahman; Daniel Fabbri

Semi-Automated Data Curation from Biomedical Literature

AMIA Annu Symp Proc. 2023 Apr 29:2022:884-891. eCollection 2022.

Authors

Protiva Rahman¹, Daniel Fabbri¹

Affiliation

¹ Vanderbilt University Medical Center, Nashville, TN.

PMID: 37128469
PMCID: PMC10148326

Abstract

Data curation is a bottleneck for many informatics pipelines. A specific example of this is aggregating data from preclinical studies to identify novel genetic pathways for atherosclerosis in humans. This requires extracting data from published mouse studies such as the perturbed gene and its impact on lesion sizes and plaque inflammation, which is non-trivial. Curation efforts are resource-heavy, with curators manually extracting data from hundreds of publications. In this work, we describe the development of a semi-automated curation tool to accelerate data extraction. We use natural language processing (NLP) methods to auto-populate a web-based form which is then reviewed by a curator. We conducted a controlled user study to evaluate the curation tool. Our NLP model has a 70% accuracy on categorical fields and our curation tool accelerates task completion time by 49% compared to manual curation.

MeSH terms

Animals
Data Curation* / methods
Humans
Mice
Natural Language Processing*
Publications