Larger sample sizes are needed when developing a clinical prediction model using machine learning in oncology: methodological systematic review

J Clin Epidemiol. 2025 Jan 13:111675. doi: 10.1016/j.jclinepi.2025.111675. Online ahead of print.

Abstract

Background: Having a sufficient sample size is crucial when developing a clinical prediction model. We reviewed details of sample size in studies developing prediction models for binary outcomes using machine learning (ML) methods within oncology and compared the sample size used to develop the models with the minimum required sample size needed when developing a regression-based model (Nmin).

Methods: We searched the Medline (via OVID) database for studies developing a prediction model using ML methods published in December 2022. We reviewed how sample size was justified. We calculated Nmin, which is the minimum required sample size needed when developing a regression-based model, and compared this with the sample size that was used to develop the models.

Results: Only one of 36 included studies justified their sample size. We were able to calculate Nmin for 17 (47%) studies. 5/17 studies met Nmin, allowing to precisely estimate the overall risk and minimise overfitting. There was a median deficit of 302 participants with the event (n= 17; range: -21331 to 2298) when developing the ML models. An additional three out of the 17 studies met the required sample size to precisely estimate the overall risk only.

Conclusion: Studies developing a prediction model using ML in oncology seldom justified their sample size and sample sizes were often smaller than Nmin. As ML models almost certainly require a larger sample size than regression models, the deficit is likely larger. We recommend that researchers consider and report their sample size and at least meet the minimum sample size required when developing a regression-based model.

Keywords: machine learning; methodology; oncology; prediction model; sample size; systematic review.