研究报告

  • 李艺,华静,刘保双,张裕芬,冯银厂.大气污染物监测数据异常值判别方法研究[J].环境科学学报,2022,42(12):341-352

  • 大气污染物监测数据异常值判别方法研究
  • Study on the outlier identification approaches for atmospheric pollutant monitoring data
  • 基金项目:中国博士后科学基金项目(No.2019M660986);中国工程院院地合作项目(No.2020C0-0002);国家大气污染控制关键问题研究项目(No.DQGG2021301)
  • 作者
  • 单位
  • 李艺
  • 南开大学环境科学与工程学院,国家环境保护城市空气颗粒物污染防治重点实验室,天津 300350;中国气象局-南开大学大气环境与健康研究联合实验室,天津 300350
  • 华静
  • 天津市生态环境局,天津 300191
  • 刘保双
  • 南开大学环境科学与工程学院,国家环境保护城市空气颗粒物污染防治重点实验室,天津 300350;中国气象局-南开大学大气环境与健康研究联合实验室,天津 300350
  • 张裕芬
  • 南开大学环境科学与工程学院,国家环境保护城市空气颗粒物污染防治重点实验室,天津 300350;中国气象局-南开大学大气环境与健康研究联合实验室,天津 300350
  • 冯银厂
  • 南开大学环境科学与工程学院,国家环境保护城市空气颗粒物污染防治重点实验室,天津 300350;中国气象局-南开大学大气环境与健康研究联合实验室,天津 300350
  • 摘要:大气环境监测数据的质控,特别是异常数据的精准判别是准确分析大气污染成因的重要前提.目前对于异常值的判别主要基于人工经验,这对于快速有效地从海量环境数据中剔除异常值进而保证分析数据的准确性带来巨大挑战.结合大气污染物监测数据的时间序列波动特点,本文基于滑动窗口机制和统计学指标分别构建了滑动四分位、滑动四分位差距及滑动标准差等异常值快速判别方法,然后利用含有异常值的清洁天和污染天常规大气污染物(PM2.5、PM10、SO2、NO2、CO和O3)时间序列数据对3种异常值判别方法的有效性进行测试评估,从而得到不同污染物异常值判别的最优方法及相关参数指标.结果表明:无论是清洁天还是污染天,滑动四分位法对PM2.5、PM10、SO2、NO2、CO和O3浓度时间序列异常值的判别效果均最优.其中,清洁天最优滑动窗口长度范围分别为10~16、14~16、12~16、38~40、6~38和6~8,最优宽容度常数范围分别为1.6~1.7、1.6~2.6、1.7~2.0、2.3~2.5、1.6~4.5和3.7~3.8;而污染天最优滑动窗口长度范围分别为10~44、10~14、10~32、14~48、10~48和14~20,最优宽容度常数范围分别为2.7~4.5、1.4~2.8、2.8~4.5、2.7~4.5、1.5~4.5和2.5~3.8.清洁天和污染天中不同大气污染物时间序列波动特征不同,使得适用方法的最优参数存在显著差异.本文构建的异常值快速判别方法旨在为环境大数据异常值的快速识别及更准确地分析大气污染 成因提供一定技术支撑.
  • Abstract:The quality control of atmospheric environment monitoring data, especially the accurate discrimination of the outliers is an important prerequisite for accurately analyzing the causes of air pollution. At present, the discrimination of outliers is mainly based on manual experience, which brings a great challenge to quickly and effectively eliminate outliers from massive environmental data and ensure the accuracy of analysis data. Combined with the time series fluctuation characteristics of atmospheric pollutant concentration data, this study constructed outlier discrimination approaches including sliding quartile, sliding quartile gap, and sliding standard deviation based on sliding window mechanism and statistical parameters. Effectiveness of the three outlier discrimination methods was then tested and evaluated using the time series data of air pollutants (i.e., PM2.5, PM10, SO2, NO2, CO, and O3) on clean days and polluted days with real outliers, so as to obtain the optimal discrimination method and relevant parameters. The results showed that the sliding quartile method had the optimal discrimination effect on the outliers of PM2.5, PM10, SO2, NO2, CO, and O3 data on the polluted and clean days. For clean days, the optimal sliding window lengths were 10~16, 14~16, 12~16, 38~40, 6~38, and 6~8, respectively; and the optimal tolerance constants were 1.6~1.7, 1.6~2.6, 1.7~2.0, 2.3~2.5, 1.6~4.5, and 3.7~3.8, respectively. For polluted days, the optimal sliding window lengths were 10~44, 10~14, 10~32, 14~48, 10~48, and 14~20, respectively; and the optimal tolerance constants were 2.7~4.5, 1.4~2.8, 2.8~4.5, 2.7~4.5, 1.5~4.5, and 2.5~3.8, respectively. Fluctuations for time series of different air pollutants on polluted days and clean days showed the significant difference, which caused the optimal parameters of applicable discrimination method to be substantially different. The outlier discrimination methods that constructed in this study were expected to provide the supports for the rapid identification of outliers from environmental big-data and more accurate analysis of the air pollution in the future.

  • 摘要点击次数: 220 全文下载次数: 488