Using machine learning to identify risk factors for pancreatic cancer: a retrospective cohort study of real-world data

Front Pharmacol. 2024 Nov 21:15:1510220. doi: 10.3389/fphar.2024.1510220. eCollection 2024.

Abstract

Objectives: This study aimed to identify the risk factors for pancreatic cancer through machine learning.

Methods: We investigated the relationships between different risk factors and pancreatic cancer using a real-world retrospective cohort study conducted at West China Hospital of Sichuan University. Multivariable logistic regression, with pancreatic cancer as the outcome, was used to identify covariates associated with pancreatic cancer. The machine learning model extreme gradient boosting (XGBoost) was adopted as the final model for its high performance. Shapley additive explanations (SHAPs) were utilized to visualize the relationships between these potential risk factors and pancreatic cancer.

Results: The cohort included 1,982 patients. The median ages for pancreatic cancer and nonpancreatic cancer groups were 58.1 years (IQR: 51.3-64.4) and 57.5 years (IQR: 49.5-64.9), respectively. Multivariable logistic regression indicated that kirsten rats arcomaviral oncogene homolog (KRAS) gene mutation, hyperlipidaemia, pancreatitis, and pancreatic cysts are significantly correlated with an increased risk of pancreatic cancer. The five most highly ranked features in the XGBoost model were KRAS gene mutation status, age, alcohol consumption status, pancreatitis status, and hyperlipidaemia status.

Conclusion: Machine learning algorithms confirmed that KRAS gene mutation, hyperlipidaemia, and pancreatitis are potential risk factors for pancreatic cancer. Additionally, the coexistence of KRAS gene mutation and pancreatitis, as well as KRAS gene mutation and pancreatic cysts, is associated with an increased risk of pancreatic cancer. Our findings offered valuable implications for public health strategies targeting the prevention and early detection of pancreatic cancer.

Keywords: KRAS gene mutation; machine learning; multivariable logistic regression; pancreatic cancer; risk factors.

Grants and funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. NS was supported by grants from the Sichuan Province Science and Technology Support Program (grant number 2023JDR0243) and the Health Commission Program (grant number 2020-111). This research was supported by the National Key Clinical Specialties Construction Program.