姚红岩,施润和.基于周边站点优化选取的随机森林PM2.5小时浓度预测研究[J].环境科学学报,2021,41(4):1565-1573
基于周边站点优化选取的随机森林PM2.5小时浓度预测研究
- Research on hourly PM2.5 concentration prediction of random forest based on optimal selection of surrounding stations
- 基金项目:国家重点研发计划项目(No.2016YFC1302602);教育部哲学社会科学研究重大课题攻关项目(No.19JZD023);上海市科委科技创新行动计划(No.19DZ1201505);中央高校基本科研业务费项目
- 姚红岩
- 1. 华东师范大学地理信息科学教育部重点实验室, 上海 200241;2. 华东师范大学地理科学学院, 上海 200241;3. 华东师范大学环境遥感与数据同化联合实验室, 上海 200241
- 施润和
- 1. 华东师范大学地理信息科学教育部重点实验室, 上海 200241;2. 华东师范大学地理科学学院, 上海 200241;3. 华东师范大学环境遥感与数据同化联合实验室, 上海 200241;4. 华东师范大学资源与环境联合研究院, 上海 200062;5. 华东师范大学崇明生态研究院, 上海 202162
- 摘要:空气中的PM2.5是威胁人体健康的主要大气污染物,对其进行有效预测和及时预警具有重要意义.大量研究表明,纳入周边站点信息的随机森林模型在单站点PM2.5预测中显示出良好的效果,但在周边站点选取问题上目前尚缺乏针对性研究,部分选取方法带有主观性.本文提出了一种基于时间滞后互相关分析的周边站点优化选取方法,并以上海十五厂空气质量监测站(国控站)为例,构建了预测该站未来1~24 h PM2.5浓度的随机森林回归模型集,比较分析了预测模型中各输入因子的重要性.研究发现,预测站点当前PM2.5浓度值对未来1~16 h的预测最为重要,而气象要素中的风向则对于未来17~24 h的预测重要性最高;周边站点PM2.5信息随着预测时间的延长,其重要程度排名有明显提升,且不同站点对不同时间预测的影响具有显著差异,在建模时应区别对待,优化选取.比较结果表明,使用本文方法选取周边站点建立的预测模型不仅在RMSE等精度指标上具有一定优势(12 h和24 h预报RMSE分别降低11.8%和13.3%),还在有实用价值的污染事件空报率上有明显降低(12 h和24 h预报空报率分别降低16.1%和25.6%),具有业务应用潜力.
- Abstract:PM2.5 is a major air pollutant that threatens human health, and it is significant to be effectively predicted and promptly warned. Many studies have shown that the Random Forest model (RF) has good results in the prediction of PM2.5 concentration at a single station by incorporating the information of surrounding stations. However, the research on the selection of surrounding stations is lack of pertinence, and some existing selection methods are subjective. We proposed a method for optimizing the selection of surrounding stations based on Time-Lag Cross-Correlation (TLCC) analysis in this research. Taking the air quality monitoring station (national-level station) of Shanghai Shiwuchang as an example, a set of RF regression models were constructed to predict the PM2.5 concentration of the station in the next 1 to 24 hours, and the importance of each input factor in the prediction model was compared and analyzed. We found that the current PM2.5 concentration of the prediction station would significantly impact the prediction of the next 1 to 16 hours, while the wind direction was crucial for the prediction of the next 17 to 24 hours. As the forecast time increased, PM2.5 concentration of the surrounding stations significantly improved in importance ranking, and the impact of different stations was significantly different when forecasting at different times. Therefore it was treated differently when modeling. The comparison results showed that the prediction model established by the method of selected surrounding stations proposed in this paper not only had certain advantages in accuracy (12-hour and 24-hour forecast RMSE decreased by 11.8% and 13.3%), but the false alarm ratio also decreased significantly (the forecasted false alarm ratio for 12 hours and 24 hours dropped by 16.1% and 25.6%). The study has practical value and potential applications in predicting and prewarning air pollution.