Sağlık Verilerinin Analizinde Veri Ön işleme Adımlarının Makine Öğrenmesi Yöntemlerinin Performansına Etkisi

Günümüzde verilerin hızla artmasıyla makine öğrenmesi yöntemleri ile veri analizi birçok alanda popüler hale gelmiştir. Gerçek dünya veri kümelerinde eksik değerler ve dengesiz sınıf verileri sıklıkla karşılaşılan sorunlardır. Bu sorunlar, makine öğrenmesi yöntemlerinin başarımlarını olumsuz yönde etkilemekte ve modelin hatalı veya yanlış sonuçlar elde etmesine neden olmaktadır. Verilerdeki eksik değerlerin doldurulması ve sınıf dengesizliğinin ortadan kaldırılması veri ön işleme aşamasında önem arz etmektedir. Özellikle, sağlık verilerinde sınıfların dengesi verilerin doğruluğu ve eksiksizliği makine öğrenmesi yöntemlerinin performansını etkilediğinden çok önemlidir. Bu makalede, makine öğrenmesinde eksik değerlere sahip dengesiz veri sınıflandırması ile ilgili sorunları araştırmak için literatürde başarılı olan yöntemlerin karşılaştırmalı bir çalışması PIMA diyabet veri kümesi kullanılarak yapılmıştır. Elde edilen sonuçlara göre, sınıf dengesizliğinde eksik ve aşırı örnekleme yöntemlerinin birleştirildiği SMOTEENN algoritması ile eksik değerlerde zincirleme denklemlerle çoklu atama yönteminin kullanılması hasta ve sağlıklı bireylerin sınıflandırılmasında %91 F-skor değeri ile diğer en iyi yöntemlerden yaklaşık %9 oranında daha iyi performans göstermiştir

The Effect of Data Preprocessing Steps on the Performance of Machine Learning Methods in the Analysis of Health Data

Today, with the rapid increase in data, data analysis with machine learning methods has become popular in many areas. Missing values and imbalanced class data are common problems in real-world datasets. These problems negatively affect the performance of machine learning methods and cause the model to obtain erroneous or incorrect results. The missing values imputation and eliminating the class imbalance are important in the data preprocessing stage. In particular, the balance of classes in health data is very important as the accuracy and completeness of the data affect the performance of machine learning methods. In this article, a comparative study of successful methods in the literature for investigating problems with imbalanced data classification with missing values in machine learning was conducted using the PIMA diabetes dataset. According to the results, the SMOTEENN algorithm, which combines undersampling and oversampling methods in class imbalance, and the use of multiple imputation with chained equations for missing values, were showed an F-score value of 91%, approximately 9% better than the other best methods in classifying patients and healthy individuals.

___

  • Fei Y., Jiazhi D., Jiying L., Weigang L., Lei Liu, Changlong Jin, and Qinma Kang. Missing value estimation methods research for arrhythmia classification using the modified kernel difference-weighted knn algorithms. BioMed research internati- onal, 2020, 2020.
  • Ching-Hsue C., Yung-Fu K., ve Hsien-Ping L.. A financial statement fraud model based on synthesized attribute selection and a dataset with missing values and imbalanced classes. Applied Soft Computing, 108:107487, 2021.
  • Saskya M. S., Titin S., Yoel F., Devvi S., Her-ley Shaori A., Sarah S., ve Noval S., Iterative bicluster-based bayesian principal component analysis and least squares for missing-value imputation in microarray and rna-sequencing data. Mathematical Biosciences and Engineering, 19(9):8741–8759, 2022.
  • Seokho K. Product failure prediction with missing data using graph neural net- works. Neural computing and applications, 33(12):7225–7234, 2021.
  • Mingjing W. ve Huiling C. Chaotic multi-swarm whale optimizer boosted support vector machine for medical diagnosis. Applied Soft Computing, 88:105946, 2020.
  • Nizam H. Ve Saliha S. A.. Sosyal medyada makine öğrenmesi ile duygu analizinde dengeli ve dengesiz veri setlerinin performanslarının karşılaştırılması. XIX. Türkiye’de İnternet Konferansı, 1(6), 2014.
  • Chaoliang L. and Shigang L.. A comparative study of the class imbalance problem in twitter spam detection. Concurrency and Computation: Practice and Experience, 30(5):e4281, 2018.
  • Jinyan L., Lian-sheng L., Simon F., Raymond K W., Sabah M., Jinan F., Yunsick S., ve Kelvin KL W., Adaptive swarm balancing algorithms for rare-event prediction in imbalanced healthcare data. PloS one, 12(7):e0180830, 2017.
  • Koichi F., Yukun H., Kentaro H., Kenichi N., Masao K., Mai K., ve Manabu K.. Over-and under-sampling approach for extremely imbalanced and small minority data problem in health record analysis. Frontiers in Public Health, 8, 2020.
  • Vanaja R. ve Saswati M., An effective clinical decision support system using swarm intelligence. The Journal of Supercomputing, 76(9):6599–6618, 2020.
  • Tince E. T. ve Aina M.. The implementation of genetic algorithm in smote (synthetic minority oversampling technique) for handling imbalanced dataset problem. In 2018 4th international conference on science and technology (ICST), pages 1–4. IEEE, 2018.
  • Apurva S., Ruhi P., ve Nitin P., A new approach for handling imba- lanced dataset using ann and genetic algorithm. In 2016 International Conference on Communication and Signal Processing (ICCSP), pages 1987–1990. IEEE, 2016.
  • Everlandio RQ F., Carvalho A., ve Xin Y.. Ensemble of classifiers based on multiobjective genetic sampling for imbalanced data. IEEE Tran- sactions on Knowledge and Data Engineering, 32(6):1104–1115, 2019.
  • Chakraborty A., Kushal K. G., Rajonya De, E. C., ve Ram S.- kar. Learning automata based particle swarm optimization for solving class imbalance problem. Applied Soft Computing, page 107959, 2021.
  • Wei W., Jinjiu L., Longbing C., Yuming O., ve Jiahang C., Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web, 16(4):449–475, 2013.
  • Dal Pozzolo A, Caelen O., Borgne Y. L, Waterschoot S., ve Bontempi G., Learned lessons in credit card fraud detection from a practitioner perspective. Expert systems with applications, 41(10):4915–4928, 2014.
  • Sikha B. ve Kunqi L., Resampling imbalanced data for network intrusion detec- tion datasets. Journal of Big Data, 8(1):1–41, 2021.
  • Nizam Ozogur H. and Orman Z., The effect of heuristic methods toward performance of health data analysis. Next Generation Healthcare Informatics, page 147.
  • Joo-Chang K. ve Kyungyong C., Multi-modal stacked denoising autoencoder for handling missing data in healthcare big data. IEEE Access, 8:104933–104943, 2020.
  • Tan D. L., Razvan B., ve Yasuo T., Comparison of the most influential mis- sing data imputation algorithms for healthcare. In 2018 10th International Conference on Knowledge and Systems Engineering (KSE), pages 247–251. IEEE, 2018.
  • Iman A., Tapio P., Amir M R., Hannakaisa N.V., Anna A.L.,ve Pasi L., Missing data resilient decision-making for healthcare iot thro- ugh personalization: A case study on maternal health. Future Generation Computer Systems, 96:297–308, 2019.
  • Son P., Ashnil K., ve Jinman K., A deep learning technique for imputing missing healthcare data. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 6513–6516. IEEE, 2019.
  • Xiao X., Xiaoshuang L., Yanni K., Xian X., Junmei W., Yuyao S., Quanhe C., Xiaoyu J., Xinyue M., Xiaoyan M., ve ark. A multi-directional approach for missing value estimation in multivariate time series clinical data. Journal of Healthcare Informatics Research, 4(4):365–382, 2020.
  • Yang Z., Zoie S.-Y. W., ve Kwok L. T., A framework of rebalancing imbalanced healthcare data for rare events’ classification: a case of look-alike sound- alike mix-up incident detection. Journal of healthcare engineering, 2018, 2018.
  • Akram F., David C., Rozalina M., Christopher S., John A M., Celine M V., ve Che N., Breast cancer classification using deep transfer learning on structured healthcare data. In 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 277–286. IEEE, 2019.
  • Tran, T., Le, U., & Shi, Y. (2022). An effective up-sampling approach for breast cancer prediction with imbalanced data: A machine learning model-based comparative analysis. Plos one, 17(5), e0269135.
  • Zi-Ching L., Guan-Yu H., Yun-Pei L., Seungmin R., S V., ve Bo-Wei C.. Conquering insufficient/imbalanced data learning for the internet of medical things. Neural Computing and Applications, pages 1–10, 2022.
  • Pima indians diabetes dataset. “https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database” , [Ziyaret tarihi: 29 Haziran 2022].
  • Ganganwar, V. (2012). An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering, 2(4), 42-47.
  • Khushi, M., Shaukat, K., Alam, T. M., Hameed, I. A., Uddin, S., Luo, S., ... & Reyes, M. C. (2021). A comparative performance analysis of data resampling methods on imbalance medical data. IEEE Access, 9, 109960-109975.
  • Ivan T.. Two modifications of cnn. 1976.
  • Fan, X., Tang, K., & Weise, T. (2011, May). Margin-based over-sampling method for learning from imbalanced datasets. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 309-320). Springer, Berlin, Heidelberg.
  • Nitesh V C., Kevin W B., Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
  • Varsha S B. ve Roshani A., A review on imbalanced learning methods. Int. J. Comput. Appl, 975:23–27, 2015.
  • Nguyen H. M, Cooper E. W, ve Kamei K., Borderline over-sampling for imbalanced data classification. In Proceedings: Fifth International Workshop on Computational Intelligence & Applications, volume 2009, pages 24–29. IEEE SMC Hiroshima Chapter, 2009.
  • Last F., Douzas G., ve Bacao F., Oversampling for imbalanced learning based on k-means and smote. arXiv preprint arXiv:1711.00837, 2017.
  • Alisha B., Ravinder A. ve Sharma S. C., Accurate detection of electricity theft using classification algorithms and internet of things in smart grid. Arabian Journal for Science and Engineering, pages 1–17, 2021.
  • Kumar T. R, Linesh Raja, Kumar A., Dadheech P., Kumar A.,ve Nachappa MN. A cluster based classification for imbalanced data using smote. In IOP Conference Series: Materials Science and Engineering, volume 1099, page 012080. IOP Publishing, 2021.
  • Gordana I., Tome E., ve Koroušić Seljak B. Evaluating missing value imputation methods for food composition databases. Food and Chemical Toxi- cology, 141:111368, 2020.
  • Wei-Chao L., Chih-Fong T., ve Zhong J. R., Deep learning for missing value imputation of continuous data and the effect of data discretization. Knowledge-Based Systems, 239:108079.
  • Lin, W. C., & Tsai, C. F. (2020). Missing value imputation: a review and analysis of the literature (2006–2017). Artificial Intelligence Review, 53(2), 1487-1509.
  • Hunt, L. A. (2017). Missing data imputation and its effect on the accuracy of classification. In Data Science (pp. 3-14). Springer, Cham.
  • Liao, S. G., Lin, Y., Kang, D. D., Chandra, D., Bon, J., Kaminski, N., ... & Tseng, G. C. (2014). Missing value imputation in high-dimensional phenomic data: imputable or not, and how?. BMC bioinformatics, 15(1), 1-12.
  • Luo, Y., Szolovits, P., Dighe, A. S., & Baron, J. M. (2018). 3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data. Journal of the American Medical Informatics Association, 25(6), 645-653.