Development of a novel machine learning model to predict presence of nonalcoholic steatohepatitis

Matt Docherty; Stephane A Regnier; Gorana Capkun; Maria-Magdalena Balp; Qin Ye; Nico Janssens; Andreas Tietz; Jürgen Löffler; Jennifer Cai; Marcos C Pedrosa; Jörn M Schattenberg

doi:10.1093/jamia/ocab003

Development of a novel machine learning model to predict presence of nonalcoholic steatohepatitis

J Am Med Inform Assoc. 2021 Jun 12;28(6):1235-1241. doi: 10.1093/jamia/ocab003.

Affiliations

¹ ZS, Princeton, New Jersey, USA.
² Novartis Pharma AG, Basel, Switzerland.
³ Novartis Pharmaceuticals Inc, East Hanover, USA.
⁴ Metabolic Liver Research Program. I. Department of Medicine, University Medical Center, Mainz, Germany.

Abstract

Objective: To develop a computer model to predict patients with nonalcoholic steatohepatitis (NASH) using machine learning (ML).

Materials and methods: This retrospective study utilized two databases: a) the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) nonalcoholic fatty liver disease (NAFLD) adult database (2004-2009), and b) the Optum® de-identified Electronic Health Record dataset (2007-2018), a real-world dataset representative of common electronic health records in the United States. We developed an ML model to predict NASH, using confirmed NASH and non-NASH based on liver histology results in the NIDDK dataset to train the model.

Results: Models were trained and tested on NIDDK NAFLD data (704 patients) and the best-performing models evaluated on Optum data (~3,000,000 patients). An eXtreme Gradient Boosting model (XGBoost) consisting of 14 features exhibited high performance as measured by area under the curve (0.82), sensitivity (81%), and precision (81%) in predicting NASH. Slightly reduced performance was observed with an abbreviated feature set of 5 variables (0.79, 80%, 80%, respectively). The full model demonstrated good performance (AUC 0.76) to predict NASH in Optum data.

Discussion: The proposed model, named NASHmap, is the first ML model developed with confirmed NASH and non-NASH cases as determined through liver biopsy and validated on a large, real-world patient dataset. Both the 14 and 5-feature versions exhibit high performance.

Conclusion: The NASHmap model is a convenient and high performing tool that could be used to identify patients likely to have NASH in clinical settings, allowing better patient management and optimal allocation of clinical resources.

Keywords: NAFLD; NASH; artificial intelligence; machine learning; non-alcoholic fatty liver disease.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Adult
Biopsy
Humans
Machine Learning
Non-alcoholic Fatty Liver Disease* / complications
Non-alcoholic Fatty Liver Disease* / diagnosis
Non-alcoholic Fatty Liver Disease* / epidemiology
Retrospective Studies
United States / epidemiology