The long satellite aerosol data record enables assessments of historical PM2.5 level in regions where routine PM2.5 monitoring began only recently. However, most previous models reported decreased prediction accuracy when predicting PM2.5 levels outside the model-training period. In this study, we proposed an ensemble machine learning approach that provided reliable PM2.5 hindcast capabilities. The missing satellite data were first filled by multiple imputation. Then the modeling domain, China, was divided into seven regions using a spatial clustering method to control for unobserved spatial heterogeneity. A set of machine learning models including random forest, generalized additive model, and extreme gradient boosting were trained in each region separately. Finally, a generalized additive ensemble model was developed to combine predictions from different algorithms. The ensemble prediction characterized the spatiotemporal distribution of daily PM2.5 well with the cross-validation (CV) R2 (RMSE) of 0.79 (21 μg/m3). The cluster-based subregion models outperformed national models and improved the CV R2 by ∼0.05. Compared with previous studies, our model provided more accurate out-of-range predictions at the daily level ( R2 = 0.58, RMSE = 29 μg/m3) and monthly level ( R2 = 0.76, RMSE = 16 μg/m3). Our hindcast modeling system allows for the construction of unbiased historical PM2.5 levels.