Comparison of principal component analysis algorithms for imputation in agrometeorological data in high dimension and reduced sample size

PLoS One. 2024 Dec 31;19(12):e0315574. doi: 10.1371/journal.pone.0315574. eCollection 2024.

Abstract

Meteorological data acquired with precision, quality, and reliability are crucial in various agronomy fields, especially in studies related to reference evapotranspiration (ETo). ETo plays a fundamental role in the hydrological cycle, irrigation system planning and management, water demand modeling, water stress monitoring, water balance estimation, as well as in hydrological and environmental studies. However, temporal records often encounter issues such as missing measurements. The aim of this study was to evaluate the performance of alternative multivariate procedures for principal component analysis (PCA), using the Nonlinear Iterative Partial Least Squares (NIPALS) and Expectation-Maximization (EM) algorithms, for imputing missing data in time series of meteorological variables. This was carried out on high-dimensional and reduced-sample databases, covering different percentages of missing data. The databases, collected between 2011 and 2021, originated from 45 automatic weather stations in the São Paulo region, Brazil. They were used to create a daily time series of ETo. Five scenarios of missing data (10%, 20%, 30%, 40%, 50%) were simulated, in which datasets were randomly withdrawn from the ETo base. Subsequently, imputation was performed using the NIPALS-PCA, EM-PCA, and simple mean imputation (IM) procedures. This cycle was repeated 100 times, and average performance indicators were calculated. Statistical performance evaluation utilized the following indicators: correlation coefficient (r), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Mean Square Error (MSE), Normalized Root Mean Square Error (nRMSE), Willmott Index (d), and performance index (c). In the scenario with 10% missing data, NIPALS-PCA achieved the lowest MAPE (15.4%), followed by EM-PCA (17.0%), while IM recorded a MAPE of 24.7%. In the scenario with 50% missing data, there was a performance reversal, with EM-PCA showing the lowest MAPE (19.1%), followed by NIPALS-PCA (19.9%). The NIPALS-PCA and EM-PCA approaches demonstrated good results in imputation (10% ≤ nRMSE < 20%), with NIPALS-PCA excelling in the 10%, 20%, and 30% scenarios, and EM-PCA in the 40% and 50% scenarios. Based on statistical evaluation, the NIPALS-PCA, EM-PCA, and IM imputation models proved suitable for estimating missing ETo data, with PCA imputation models in the NIPALS and EM algorithms showing the most promise. Future research should explore the effectiveness of various imputation methods in diverse climatic and geographical contexts, as well as develop new techniques considering the temporal and spatial structure of meteorological data, to advance understanding and climate prediction.

Publication types

  • Comparative Study

MeSH terms

  • Algorithms*
  • Brazil
  • Least-Squares Analysis
  • Principal Component Analysis*
  • Sample Size

Grants and funding

This work was carried out with the support of the Coordination for the Improvement of Higher Education Personnel - Brazil (CAPES) - Funding Code 001. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.