THE APPLICATION OF MULTIPLE IMPUTATION METHOD BASED ON HYBRID MULTI-STRATEGY IN HANDLING MISSING AIR QUALITY MONITORING DATA
Volume 7, Issue 1, Pp 52-63, 2025
DOI: https://doi.org/10.61784/jcsee3037
Author(s)
ZhiQuan Zheng1*, WenYong Zhang1,2, ZhongChen Luo3
Affiliation(s)
1School of Data Science and Information Engineering, Guizhou Minzu University, Guiyang 550025, Guizhou, China.
2School of Computer Science and Engineering, South China University of Technology, Guangzhou 510641, Guangdong, China.
3School of Nursing, Guizhou Medical University, Guiyang 550001, Guizhou, China.
Corresponding Author
ZhiQuan Zheng
ABSTRACT
Air quality monitoring data is a crucial basis for assessing air pollution levels and formulating control measures. However, missing data is a prevalent issue due to instrument malfunctions, human factors, and other reasons, significantly compromising data integrity and usability. To address this problem, this study collected nearly 1 million air quality monitoring records from 12 monitoring stations between 2015 and 2023, summarizing and analyzing the mechanisms and characteristics of missing data in such datasets. Data imputation experiments were conducted using R. Through missing mechanism control and imputation experimental design strategies, the imputation performance of algorithms was evaluated under the criteria of MAE, RMSE, and WMAPE based on completely random missingness. Specifically, data imputation experiments under different missing scenarios were repeated N times, and the mean values were used to evaluate four multiple imputation algorithms, with 95% confidence intervals provided. The experimental results show that: (1) the hybrid multi-strategy imputation method MNPRF demonstrates significant advantages across all datasets, with the smallest confidence limits and interval widths; (2) this method not only inherits the strengths of parent algorithms, substantially improving data quality, but also mitigates the weaknesses of the original algorithms to some extent.
KEYWORDS
Air quality monitoring; Missing mechanism; Data imputation; Confidence interval; Multiple imputation; Hybrid multi-strategy imputation
CITE THIS PAPER
ZhiQuan Zheng, WenYong Zhang, ZhongChen Luo. The application of multiple imputation method based on hybrid multi-strategy in handling missing air quality monitoring data. Journal of Computer Science and Electrical Engineering. 2025, 7(1): 52-63. DOI: https://doi.org/10.61784/jcsee3037.
REFERENCES
[1] Di Z, Guarnera U, Luzi O. Imputation through finite Gaussian mixture models. Computational Statistics and Data Analysis, 2007, 51: 5305-5316.
[2] Junninen H, Niska H, Tuppurainen K, et al. Methods for imputation of missing values in air quality data sets. Atmospheric Environment, 2004, 38: 2895-2907.
[3] Sabine V, Karlien V B, Peter G. Sequential imputation for missing values. Computational biology and chemistry, 2007, 31, 320-327.
[4] Pedro J, Garcia L, Jose-Luis S G, et al. K nearest neighbours with mutual information for simultaneous classification and missing data imputation, Neurocomputing, 2009, 72: 1483-1493.
[5] Wu S, Feng X D, Shan Z G. Missing Data Imputation Approach Based on Incomplete Data Clustering.Chinese journal of computer, 2012, 35: 1726-1738.
[6] Sethia K, Gosain A, Singh J. Review of Single Imputation and Multiple Imputation Techniques for Handling Missing Values. Lecture Notes in Networks and Systems, 2023, 730: 33-50.
[7] Rubin D B. Inference and Missing Data. Biometrika, 1976, 63: 581-592.
[8] Little R, Rubin D B. Statistical Analysis With Missing Data; Wiley and Sons Inc: New York, USA, 1987.
[9] Enders C K. Applied Missing Data Analysis. Guilford Press: New York, USA, 2010.
[10] Sebastian J, Arndt A, Felix B. A Benchmark for Data Imputation Methods. Frontiers in big data, 2021, 4, 674-693.
[11] Hakan D. Flexible Imputation of Missing Data. Journal of Statistical Software, 2018, 85: 1-5.
[12] Schafer J L. Analysis of Incomplete Multivariate Data. Chapman & Hall: Oxfordshire, UK, 1997.
[13] Schafer J L, Yucel R M. Computational strategies for multivariate linear mixed-effects models with missing values. Journal of Computational and Graphical Statistics, 2002, 11: 437-457.
[14] Stef V B. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 2007, 16: 219-242.
[15] Van B S, Brand J P L, Groothuis-Oudshoorn C G M, et al. Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation, 2006, 76: 1049-1064.
[16] Yusuke Y, Toshihiro M, Kazushi M. A comparison of multiple imputation methods for incomplete longitudinal binary data. Journal of Biopharmaceutical Statistics, 2018, 28: 645-667.
[17] Kim H J, Reiter J P, Wang Q, et al. Multiple Imputation of Missing or Faulty Values Under Linear Constraints. Journal of Business & Economic Statistics, 2014, 32: 375-386.
[18] Enders C K, Keller B T, Levy R. A fully conditional specification approach to multilevel imputation of categorical and continuous variables. Psychological methods, 2018, 23: 298-317.
[19] Vincent A, Ndeye N. Clustering with missing data: which equivalent for Rubin's rules? Advances in Data Analysis and Classification, 2023, 17: 623-657.
[20] Van B S. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 2007, 16: 219-242.
[21] Goldstein H, Carpenter J, Kenward M G, et al. Multilevel models with multivariate mixed response types. Statistical Modelling, 2009, 9: 173-197.
[22] Yang Z. Diagnostic checking of multiple imputation models. AStA Advances in Statistical Analysis, 2022, 106: 271-286.
[23] Zhi Q Z, Yan C, Meng M W, et al. Research on Stability of Data Imputation Algorithms With Different Miss Rates. Statistics and The Decision, 2023, 33: 12-17.