Identification of driving factors for heavy metals and polycyclic aromatic hydrocarbons pollution in agricultural soils using interpretable machine learning

Sci Total Environ. 2025 Jan 15:960:178384. doi: 10.1016/j.scitotenv.2025.178384. Epub 2025 Jan 9.

Abstract

This study integrated data-driven interpretable machine learning (ML) with statistical methods, complemented by knowledge-driven discrimination diagrams, to identify the primary driving factors of heavy metal (HM) and polycyclic aromatic hydrocarbon (PAH) contamination in agricultural soils influenced by complex sources in a rapidly industrializing region of a megacity in southern China. First, the statistical characteristics of the concentrations of HMs and PAHs, and their correlations with the environmental covariates were explored. Three ML models and a statistical model comprising multiple environmental variable predictors were developed and assessed to predict the concentration of HMs in the agricultural soil. The Shapley Additive Explanations (SHAP) tool was introduced to reveal the influences of the main driving factors on pollutant concentrations. In addition, knowledge-based discrimination diagrams were adopted to discriminate the potential sources of the PAHs. Our findings indicated that Cd, Hg and Cu could be effectively predicted by the LightGBM and RF models. The identification of pollution drivers revealed that traffic emission, industry activity and irrigation significantly contributed to the pollution of Cd, Hg, Cu and high-ring PAHs in the study area, while the soil nature properties including SOM and pH also played crucial roles in influencing the HM and PAH concentrations. This work introduced an innovative approach to leverage ML for understanding complex urban soil pollution, thereby setting a precedent for data-driven environmental protection strategies to mitigate the pollution of HMs and PAHs. Future research is encouraged to optimize the models, enhance the prediction accuracy, and incorporate a broader range of influential parameters.

Keywords: Heavy metals; Machine learning; Pollution driving factors; Polycyclic aromatic hydrocarbons; Shapley additive explanations; Soil pollution.