Predicting few disinfection byproducts in the water distribution systems using machine learning models

Environ Sci Pollut Res Int. 2025 Jan 20. doi: 10.1007/s11356-025-35933-3. Online ahead of print.

Abstract

Concerns regarding disinfection byproducts (DBPs) in drinking water persist, with measurements in water treatment plants (WTPs) being relatively easier than those in water distribution systems (WDSs) due to accessibility challenges, especially during adverse weather conditions. Machine learning (ML) models offer improved predictions of DBPs in WDSs. This study developed multiple ML models to predict Trihalomethanes (THMs), Haloacetic Acids (HAAs), Dichloroacetonitrile (DCAN), and N-nitrosodimethylamine (NDMA) in WDSs using data collected over 13 years (2008-2020) from 113 water supply systems (WSS) in Ontario. Data were collected tri-monthly (four times/year) following Ontario's regulatory requirements. Four common ML models-linear regressor (LR), random forest regressor (RFR), support vector regressor (SVR), and artificial neural networks with multiple folds cross-validation (ANN-MV) and single fold validation (ANN-SV)-were trained and tested using different datasets. R2 values for training datasets of THMs, HAAs, DCAN, and NDMA models ranged from 0.533 to 0.976, 0.560 to 0.980, 0.602 to 0.993, and 0.449 to 0.858, respectively. For testing datasets, R2 ranged from 0.517 to 0.939, 0.437 to 0.945, 0.565 to 0.973, and 0.517 to 0.718, respectively. Among THMs, HAAs, and DCAN, ANN-SV models were identified as the best, followed by the RFR model, whereas for NDMA, SVR was the superior model, followed by the LR model. Some models reliably predicted DBPs, suggesting they could replace costly sampling and experimental analysis for DBPs in the WDSs, thereby enhancing DBPs control in WDSs and reducing human exposure and associated risks.

Keywords: Disinfection byproducts; Drinking water; Machine learning models; Model training and testing; Risk reduction; Water distribution system.