A hybrid data-driven framework for diagnosing contributing factors for soil heavy metal contaminations using machine learning and spatial clustering analysis

J Hazard Mater. 2022 Sep 5:437:129324. doi: 10.1016/j.jhazmat.2022.129324. Epub 2022 Jun 9.

Abstract

The efficacy of source apportionment is often limited by a lack of information on natural and anthropogenic contributing factors influencing soil heavy metal (HM) contaminations. To overcome this limitation and develop the data mining methods, a novel hybrid data-driven framework was proposed to diagnose the contributing factors in an industrialized region in Guangdong Province, China, mainly using a combination of naive Bayes (NB), random forest (RF), and bivariate local Moran's I (BLMI) on the basis of the multi-source big data. The medium industry types of enterprises from the freely available Baidu point of interest data were successfully classified, and then the 250 contaminating enterprises as a contributing factor were identified by the optimized NB classifier. The quantitative contributions of the nine contributing factors for the As, Cd, and Hg concentrations were determined by the optimized RF. The twelve spatial clustering maps between the three HM concentrations and the four key contributing factors were generated by BLMI, explicitly revealing their mutual interactions and internal effects and also intuitively showing the "high-high" areas and their distributions. This framework can obtain rich information on contributing factors such as medium industry types, contribution rates, spatial clusters, and spatial distributions.

Keywords: Bivariate local Moran’s I; Data mining; Medium industries; Naive Bayes; Random forest.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Bayes Theorem
  • China
  • Cluster Analysis
  • Environmental Monitoring / methods
  • Machine Learning
  • Metals, Heavy* / analysis
  • Risk Assessment
  • Soil
  • Soil Pollutants* / analysis

Substances

  • Metals, Heavy
  • Soil
  • Soil Pollutants