• 基于机器学习的河流铈异常预测模型
  • Construction and analysis of machine learning based cerium anomaly prediction model for river
  • 基金项目:国家自然科学基金地区项目(No.21966013/B060303);江西省创新创业高层次人才“千人计划”创新人才长期项目(青年类,No.205201000006);江西理工大学高层次引进人才项目(No.3401223393)
  • 作者
  • 单位
  • 张鹏泽
  • 江西理工大学资源与环境工程学院
  • 叶丽
  • 江西理工大学资源与环境工程学院
  • 姚军强
  • 江西理工大学资源与环境工程学院
  • 王灶生
  • 江西理工大学资源与环境工程学院
  • 摘要:深入了解河流铈(Cerium, Ce)异常与水质参数之间的关系,建立有效的Ce异常预测模型,对于理解水体Ce的地球化学行为具有重要意义。基于收集的河流稀土数据,筛选了8个关键水质参数作为模型的输入变量,运用支持向量回归、随机森林、极端梯度提升3种机器学习算法分别构建Ce异常的预测模型。同时,利用特征重要性分析和SHAP方法评估了输入变量的重要性及其影响程度。结果表明,支持向量回归模型在训练集和测试集上的表现较差。相比之下,集成算法随机森林和极端梯度提升模型展示出了良好的拟合效果和预测性能,且两者间的差距不大。鉴于模型在未知数据上的预测能力更为重要,综合比较了两者在测试集上的表现,发现随机森林的表现比极端梯度提升更优,其决定系数(Coefficient of determination, R2)达到了0.8051,均方根误差(Root Mean Square Error, RMSE)和平均绝对误差(Mean Absolute Error, MAE)分别为0.1250、0.08966。因此,在构建的模型中,随机森林是预测河流Ce异常的最佳模型。基于随机森林模型的特征重要性分析和SHAP方法结果显示,Fe、Al、pH和Mn是影响河流Ce异常预测的重要因素,其中pH与Ce异常之间存在显著的负相关关系。研究结果为理解河流水质参数与Ce异常之间的关系提供了科学参考依据。
  • Abstract:To explore the relationship between cerium (Ce) anomalies and water quality parameters, an effective model for predicting Ce anomalies was developed, having significant implications on understanding the geochemical behavior of Ce in water systems. In this study, by utilizing the collected sample dataset of stream rare earth elements and then extracting eight descriptors from water quality parameters as input variables, three machine learning algorithms, i.e., Support Vector Regression (SVR), Random Forest (RF), and eXtreme Gradient Boosting (XGBoost), were employed to construct the predictive models of Ce anomalies. Meanwhile, feature importance analysis and Shapley Additive Explanations (SHAP) values were applied to evaluate the importance of input variables and their influencing degrees. The findings indicated that the SVR model had relatively inferior behaviors on training and test categories, while the ensemble methods in terms of the RF and XGBoost models had similar functions, exhibiting superior characteristics on fitting performance and prediction accuracy. Considering the importance of predictive data on unmonitored samples, the RF model outweighed the XGBoost by comprehensively comparing their behaviors on the test dataset, with the coefficient of determination (R2) as 0.8051, Root Mean Square Error (RMSE) as 0.1250, and Mean Absolute Error (MAE) as 0.08966, respectively. Collectively, the RF model showcased superior performances on predicting stream Ce anomalies. Further feature importance analysis and SHAP values calculated revealed that the Fe, Al, pH, and Mn significantly affected the prediction of Ce anomalies, and especially, pH had an apparently negative correlation with the Ce anomalies. Overall, these findings provided a valuable basis for scientifically elucidating the relationship between water quality parameters and stream Ce anomalies.

  • 摘要点击次数: 6 全文下载次数: 0