THE APPLICATION OF MULTIPLE IMPUTATION METHOD BASED ON HYBRID MULTI-STRATEGY IN HANDLING MISSING AIR QUALITY MONITORING DATA

Authors

  • ZhiQuan Zheng (Corresponding Author) School of Data Science and Information Engineering, Guizhou Minzu University, Guiyang 550025, Guizhou, China
  • WenYong Zhang School of Data Science and Information Engineering, Guizhou Minzu University, Guiyang 550025, Guizhou, China , School of Computer Science and Engineering, South China University of Technology, Guangzhou 510641, Guangdong, China
  • ZhongChen Luo School of Nursing, Guizhou Medical University, Guiyang 550001, Guizhou, China

Keywords:

Air quality monitoring, Missing mechanism, Data imputation, Confidence interval, Multiple imputation, Hybrid multi-strategy imputation

Abstract

Air quality monitoring data is a crucial basis for assessing air pollution levels and formulating control measures. However, missing data is a prevalent issue due to instrument malfunctions, human factors, and other reasons, significantly compromising data integrity and usability. To address this problem, this study collected nearly 1 million air quality monitoring records from 12 monitoring stations between 2015 and 2023, summarizing and analyzing the mechanisms and characteristics of missing data in such datasets. Data imputation experiments were conducted using R. Through missing mechanism control and imputation experimental design strategies, the imputation performance of algorithms was evaluated under the criteria of MAE, RMSE, and WMAPE based on completely random missingness. Specifically, data imputation experiments under different missing scenarios were repeated N times, and the mean values were used to evaluate four multiple imputation algorithms, with 95% confidence intervals provided. The experimental results show that: (1) the hybrid multi-strategy imputation method MNPRF demonstrates significant advantages across all datasets, with the smallest confidence limits and interval widths; (2) this method not only inherits the strengths of parent algorithms, substantially improving data quality, but also mitigates the weaknesses of the original algorithms to some extent.

References

[1] Di Z, Guarnera U, Luzi O. Imputation through finite Gaussian mixture models. Computational Statistics and Data Analysis, 2007, 51: 5305-5316.

[2] Junninen H, Niska H, Tuppurainen K, et al. Methods for imputation of missing values in air quality data sets. Atmospheric Environment, 2004, 38: 2895-2907.

[3] Sabine V, Karlien V B, Peter G. Sequential imputation for missing values. Computational biology and chemistry, 2007, 31, 320-327.

[4] Pedro J, Garcia L, Jose-Luis S G, et al. K nearest neighbours with mutual information for simultaneous classification and missing data imputation, Neurocomputing, 2009, 72: 1483-1493.

[5] Wu S, Feng X D, Shan Z G. Missing Data Imputation Approach Based on Incomplete Data Clustering.Chinese journal of computer, 2012, 35: 1726-1738.

[6] Sethia K, Gosain A, Singh J. Review of Single Imputation and Multiple Imputation Techniques for Handling Missing Values. Lecture Notes in Networks and Systems, 2023, 730: 33-50.

[7] Rubin D B. Inference and Missing Data. Biometrika, 1976, 63: 581-592.

[8] Little R, Rubin D B. Statistical Analysis With Missing Data; Wiley and Sons Inc: New York, USA, 1987.

[9] Enders C K. Applied Missing Data Analysis. Guilford Press: New York, USA, 2010.

[10] Sebastian J, Arndt A, Felix B. A Benchmark for Data Imputation Methods. Frontiers in big data, 2021, 4, 674-693.

[11] Hakan D. Flexible Imputation of Missing Data. Journal of Statistical Software, 2018, 85: 1-5.

[12] Schafer J L. Analysis of Incomplete Multivariate Data. Chapman & Hall: Oxfordshire, UK, 1997.

[13] Schafer J L, Yucel R M. Computational strategies for multivariate linear mixed-effects models with missing values. Journal of Computational and Graphical Statistics, 2002, 11: 437-457.

[14] Stef V B. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 2007, 16: 219-242.

[15] Van B S, Brand J P L, Groothuis-Oudshoorn C G M, et al. Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation, 2006, 76: 1049-1064.

[16] Yusuke Y, Toshihiro M, Kazushi M. A comparison of multiple imputation methods for incomplete longitudinal binary data. Journal of Biopharmaceutical Statistics, 2018, 28: 645-667.

[17] Kim H J, Reiter J P, Wang Q, et al. Multiple Imputation of Missing or Faulty Values Under Linear Constraints. Journal of Business & Economic Statistics, 2014, 32: 375-386.

[18] Enders C K, Keller B T, Levy R. A fully conditional specification approach to multilevel imputation of categorical and continuous variables. Psychological methods, 2018, 23: 298-317.

[19] Vincent A, Ndeye N. Clustering with missing data: which equivalent for Rubin's rules? Advances in Data Analysis and Classification, 2023, 17: 623-657.

[20] Van B S. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 2007, 16: 219-242.

[21] Goldstein H, Carpenter J, Kenward M G, et al. Multilevel models with multivariate mixed response types. Statistical Modelling, 2009, 9: 173-197.

[22] Yang Z. Diagnostic checking of multiple imputation models. AStA Advances in Statistical Analysis, 2022, 106: 271-286.

[23] Zhi Q Z, Yan C, Meng M W, et al. Research on Stability of Data Imputation Algorithms With Different Miss Rates. Statistics and The Decision, 2023, 33: 12-17.

Downloads

Published

2025-02-26

How to Cite

Zheng, Z., Zhang, W., Luo, Z. (2025). The Application Of Multiple Imputation Method Based On Hybrid Multi-Strategy In Handling Missing Air Quality Monitoring Data. Eurasia Journal of Science and Technology, 7(1), 52-63. https://doi.org/10.61784/jcsee3037