Development and validation of a rheumatoid arthritis case definition: a machine learning approach using data from primary care electronic medical records

BMC Med Inform Decis Mak. 2024 Nov 27;24(1):360. doi: 10.1186/s12911-024-02776-w.

Abstract

Background: Rheumatoid Arthritis (RA) is a chronic inflammatory disease that is primarily diagnosed and managed by rheumatologists; however, it is often primary care providers who first encounter RA-related symptoms. This study developed and validated a case definition for RA using national surveillance data in primary care settings.

Methods: This cross-sectional validation study used structured electronic medical record (EMR) data from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN). Based on the reference set generated by EMR reviews by five experts, three machine learning steps: 'bag-of-words' approach to feature generation, feature reduction using a feature importance measure coupled with recursive feature elimination and clustering, and classification using tree-based methods (Decision Tree, Random Forest, and Extreme Gradient Boosting). The three tree-based algorithms were compared to identify the procedure that generated the optimal evaluation metrics. Nested cross-validation was used to allow evaluation and comparison and tuning of models simultaneously.

Results: Of 1.3 million patients from seven Canadian provinces, 5,600 people aged 19 + were randomly selected. The optimal algorithm for selecting RA cases was generated by the XGBoost classification method. Based on feature importance scores for features in the XGBoost output, a human-readable case definition was created, where RA cases are identified when there are at least 2 occurrences of text "rheumatoid" in any billing, encounter diagnosis, or health condition table of the patient chart. The final case definition had sensitivity of 81.6% (95% CI, 75.6-86.4), specificity of 98.0% (95% CI, 97.4-98.5), positive predicted value of 76.3% (95% CI, 70.1-81.5), and negative predicted value of 98.6% (95% CI, 98.0-98.6).

Conclusion: A case definition for RA in using primary care EMR data was developed based off the XGBoost algorithm. With high validity metrics, this case definition is expected to be a reliable tool for future epidemiological research and surveillance investigating the management of RA in CPCSSN dataset.

Keywords: Case definition; EMR phenotyping; Electronic medical records; Machine learning; Rheumatoid arthritis.

Publication types

  • Validation Study

MeSH terms

  • Adult
  • Aged
  • Arthritis, Rheumatoid* / diagnosis
  • Canada
  • Cross-Sectional Studies
  • Electronic Health Records*
  • Female
  • Humans
  • Machine Learning*
  • Male
  • Middle Aged
  • Primary Health Care*