Hybrid Analytic Method for Missing Data Imputation in Medical Big Data

Hybrid Analytic Method for Missing Data Imputation in Medical Big Data

Compared to other traditional datasets, medical data has several hidden challenges. In fact, the possibility of missing values for certain attributes presents a great dispute for data mining researchers to make correct medical decisions. In this paper, a hybrid scheme combining the k-means method and regression analysis is proposed. A combination of these two analytical methods allows to find the best distributional model of numerical data in space and helps to predict missing data. Applied to medical data (diabetes dataset), the proposed model predicts the values with a minor error rate, which is considered very satisfactory.

___

  • [1] Nazari, Elham, Mohammad Hasan Shahriari, and Hamed Tabesh. "BigData analysis in healthcare: apache hadoop, apache spark and apache flink." Frontiers in Health Informatics 8.1 (2019): 14.
  • [2] Palanisamy, Venketesh, and Ramkumar Thirunavukarasu. "Implications of big data analytics in developing healthcare frameworks–A review." Journal of King Saud University-Computer and Information Sciences 31.4 (2019): 415-425.
  • [3] Kumar, Sunil, and Maninder Singh. "Big data analytics for healthcare industry: impact, applications, and tools." Big data mining and analytics 2.1 (2018): 48-57.
  • [4] Bahri, Safa, et al. "Big data for healthcare: a survey." IEEE access 7 (2018): 7397-7408.
  • [5] Bennett, Derrick A. "How can I deal with missing data in my study?." Australian and New Zealand journal of public health 25.5 (2001): 464-469.
  • [6] Graham, John W. "Missing data: Analysis and design". Springer Science & Business Media, (2012).
  • [7] Mack, Christina, Zhaohui Su, and Daniel Westreich. "Managing missing data in patient registries: addendum to registries for evaluating patient outcomes: a user’s guide." (2018).
  • [8] Rubin, Donald B. "Inference and missing data." Biometrika 63.3 (1976): 581-592.
  • [9] Little, Roderick JA, and Donald B. Rubin. Statistical analysis with missing data. Vol. 793. John Wiley & Sons, 2019.
  • [10] Ludbrook, John. "Outlying observations and missing values: how should they be handled?." Clinical and experimental pharmacology & physiology 35.5-6 (2008): 670-678.
  • [11] Zhang, Zhongheng. "Missing values in big data research: some basic skills." Annals of translational medicine 3.21 (2015).
  • [12] Langkamp, Diane L., Amy Lehman, and Stanley Lemeshow. "Techniques for handling missing data in secondary analyses of large surveys." Academic pediatrics 10.3 (2010): 205-210.
  • [13] Donders, A. Rogier T., et al. "A gentle introduction to imputation of missing values." Journal of clinical epidemiology 59.10 (2006): 1087-1091.
  • [14] Jerez, José M., et al. "Missing data imputation using statistical and machine learning methods in a real breast cancer problem." Artificial intelligence in medicine 50.2 (2010): 105-115.
  • [15] Hruschka, Eduardo R., Estevam R. Hruschka, and Nelson FF Ebecken. "Towards efficient imputation by nearest-neighbors: A clustering-based approach." Australasian Joint Conference on Artificial Intelligence. Springer, Berlin, Heidelberg, 2004.
  • [16] Zhang, Shichao. "Nearest neighbor selection for iteratively kNN imputation." Journal of Systems and Software 85.11 (2012): 2541-2552.
  • [17] Pujianto, Utomo, Aji Prasetya Wibawa, and Muhammad Iqbal Akbar. "K-nearest neighbor (k-NN) based missing data imputation." 2019 5th International Conference on Science in Information Technology (ICSITech). IEEE, 2019.
  • [18] Silva-Ramírez, Esther-Lydia, Rafael Pino-Mejías, and Manuel López-Coello. "Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbours for monotone patterns." Applied Soft Computing 29 (2015): 65-74.
  • [19] Purwar, Archana, and Sandeep Kumar Singh. "Hybrid prediction model with missing value imputation for medical data." Expert Systems with Applications 42.13 (2015): 5621-5631.
  • [20] Twala, Bhekisipho. "An empirical comparison of techniques for handling incomplete data using decision trees." Applied Artificial Intelligence 23.5 (2009): 373-405.
  • [21] Gimpy, Dr, and Minakshi Rajan Vohra. "Estimation of missing values using decision tree approach." Int J Comput Sci Inf Technol 5.4 (2014): 5216-5220.
  • [22] Zhang, Shichao, et al. "Missing value imputation based on data clustering." Transactions on computational science I. Springer, Berlin, Heidelberg, 2008. 128-138.
  • [23] Zhang, Zhaoyang, Hua Fang, and Honggang Wang. "Multiple imputation based clustering validation (miv) for big longitudinal trial data with missing values in ehealth." Journal of medical systems 40.6 (2016): 1-9.
  • [24] Emmanuel, Tlamelo, et al. "A survey on missing data in machine learning." Journal of Big Data 8.1 (2021): 1-37.
  • [25] Enders CK. Applied missing data analysis. New York: The Guilford Press; 2010.
  • [26] Carpenter, James R., Michael G. Kenward, and Stijn Vansteelandt. "A comparison of multiple imputation and doubly robust estimation for analyses with missing data." Journal of the Royal Statistical Society: Series A (Statistics in Society) 169.3 (2006): 571-584.
  • [27] Beale, Evelyn ML, and Roderick JA Little. "Missing values in multivariate analysis." Journal of the Royal Statistical Society: Series B (Methodological) 37.1 (1975): 129-145.
  • [28] Carpenter, James R., and Michael G. Kenward. "Missing data in randomised controlled trials: a practical guide." (2007): 199.
  • [29] Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. "An efficient k-means clustering algorithm: Analysis and implementation." IEEE transactions on pattern analysis and machine intelligence 24.7 (2002): 881-892.
  • [30] Cover, Thomas, and Peter Hart. "Nearest neighbor pattern classification." IEEE transactions on information theory 13.1 (1967): 21-27.
  • [31] Gou, Jianping, et al. "A generalized mean distance-based k-nearest neighbor classifier." Expert Systems with Applications 115 (2019): 356-372. [32] Allen, David M. "Mean square error of prediction as a criterion for selecting variables." Technometrics 13.3 (1971): 469-475.