A novel importance scores based variable selection approach and validation using a MIR and NIR dataset

Spectrochim Acta A Mol Biomol Spectrosc. 2025 Jan 4:330:125701. doi: 10.1016/j.saa.2025.125701. Online ahead of print.

Abstract

Variable selection is important in spectral analysis for improving interpretation quality and accuracy. This study introduces a novel variable selection process, named "VMHBSC", which consists of six steps, with each letter representing one step. To demonstrate its process and advantages, two datasets were employed, a mid-infrared spectral (MIR) dataset (234 × 7468, sample number × variables) of Chenpi samples (a traditional Chinese medicinal material derived from the dried peel of mature tangerines) and a near-infrared spectral (NIR) dataset (16000 × 256) for modeling competition. In the MIR dataset, VMHBSC selected 3 important variables from all 7468 variables, and models established using Decision Trees (DT), Gradient Boosting Decision Tree (GBDT), and Extreme Gradient Boosting (XGBoost) achieved higher accuracy compared to models using other variable selection methods. For the NIR dataset, VMHBSC selected 24 important variables from all 256 variables. Based on these 24 common variables, three hybrid models (VMHBSC-DT, VMHBSC-GBDT and VMHBSC-XGBoost) were also established and shown stable performance. These findings indicate the effectiveness of the VMHBSC process in enhancing model performance and robustness.

Keywords: Discriminate analysis; Machine learning; VMHBSC; Variable selection.