Dengesiz Tıbbi Veri Sınıflandırması İçin Salp Sürü Tabanlı Bir Aşağı-Örnekleme Yaklaşımı

Veri dengesizliği bir veri kümesi içindeki sınıfların eşit olmayan dağılımıdır ve makine öğrenmesi algoritmalarının başarısını doğrudan etkilemektedir. Araştırmacılar tarafından birçok yeniden örnekleme teknikleri önerilmiş olmasına rağmen, dengesiz verilerden öğrenme hala güncel zorluklardan biri olarak kabul edilmektedir. Mevcut tekniklerin birçoğu azınlık ve çoğunluk sınıflar arasındaki benzerlik ilişkilerini iyi bir şekilde yönetemediği için sınıf dengesizliği sorunu karmaşık hale gelmektedir. Ayrıca, sınıflar arasındaki karmaşık ilişkilerden dolayı mevcut tekniklerin birçoğu çoğunluk sınıf(lar)ında ki değerli örneklerin uygun bir şekilde veri kümesinde tutulmasına odaklanamaz. Bu makalede, veri sınıf dengesizliği problemini çözmek için salp sürüsü optimizasyon yöntemi kullanılarak bir aşağı örnekleme tekniği (SSBUT) önerilmiştir. Önerilen SSBUT çoğunluk sınıfına ait örnekler arasındaki benzerlik ilişkisini iyi analiz eder ve sınıflandırma algoritmasının doğruluğunu etkilemeyen örnekleri çoğunluk sınıfından çıkarır. Önerilen SSBUT'un performansı, tıbbi dengesiz veri kümeleri üzerinde test edilmiş ve elde edilen sonuçlar en güncel aşağı örnekleme teknikleri ile karşılaştırılmıştır. Deneysel sonuçlara göre, önerilen SSBUT tekniği birçok değerlendirme ölçütüne göre en güncel aşağı örnekleme tekniklerinden daha iyi performans sergilemiştir.

Anahtar Kelimeler:

Aşağı-örnekleme, Makine öğrenmesi, Salp sürüsü optimizasyonu, Sınıflandırma, Tıbbi Dengesiz veri sınıflandırması

A Salp Swarm-Based Under-Sampling Approach for Medical Imbalanced Data Classification

Data imbalance refers to the unequal distribution of classes within a dataset that directly affects the accuracy of machine learning classification algorithms. Although many resampling techniques have been proposed by researchers, learning from imbalanced data is still considered one of the contemporary challenges. The class imbalanced problem has been complicated as most of the existing techniques don't manage the similarity relationships between minority and majority classes well. In addition, due to the complex relationships among classes, most of the existing techniques do not focus on retaining valuable samples in the majority class(es) properly. In this article, a salp swarm optimization-based under-sampling technique (SSBUT) is proposed to address data class imbalance problems. Utilizing the proposed SSBUT, the similarity relationship among the samples of the majority class is well analyzed, and the samples that do not affect the accuracy of the classification algorithm are eliminated from the majority class. The performance of the proposed SSBUT has been tested on benchmark medical imbalanced datasets and the obtained results have been compared with state-of-the-art under-sampling techniques. The experimental results show that the proposed SSBUT consistently outperformed the state-of-the-art under-sampling techniques in terms of various evaluation criteria.

Keywords:

Classification, Machine learning, Medical Imbalanced data classification, Salp swarm optimization, Under-sampling,

PDF

___

Han J, Pei J, Kamber M. (2011). Data mining: concepts and techniques. Elsevier.
Sen PC, Hajra M, Ghosh M. (2020). Supervised classification algorithms in machine learning: A survey and review. In: Emerging technology in modelling and graphics. Springer, pp 99-111.
Özkaya, U., Öztürk, Ş., Barstugan, M. (2020). Coronavirus (COVID-19) classification using deep features fusion and ranking technique. In Big Data Analytics and Artificial Intelligence Against COVID-19: Innovation Vision and Approach (pp. 281-295). Springer, Cham.
Kwon O, Sim JM. (2013). Effects of data set features on the performances of classification algorithms. Expert Systems with Applications 40 (5):1847-1857.
Atomi WH. (2012). The effect of data preprocessing on the performance of artificial neural networks techniques for classification problems. Universiti Tun Hussein Onn Malaysia.
Rout N, Mishra D, Mallick MK. (2018). Handling imbalanced data: a survey. In: International proceedings on advances in soft computing, intelligent systems and applications. Springer, pp 431-443.
Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. (2018). A survey on addressing high-class imbalance in big data. Journal of Big Data 5 (1):1-30.
Singh A, Purohit A. (2015). A survey on methods for solving data imbalance problem for classification. International Journal of Computer Applications 127 (15):37-41.
Ibrahim MH. (2021). ODBOT: Outlier detection-based oversampling technique for imbalanced datasets learning. Neural Computing and Applications 33 (22):15781-15806.
Hasib KM, Iqbal M, Shah FM, Mahmud JA, Popel MH, Showrov M, Hossain I, Ahmed S, Rahman O. (2020). A survey of methods for managing the classification and solution of data imbalance problem. arXiv preprint arXiv:201211870.
Abd Elrahman SM, Abraham A. (2013). A review of class imbalance problem. Journal of Network and Innovative Computing 1 (2013):332-340.
More A. (2016). Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv preprint arXiv:160806048.
Vuttipittayamongkol P, Elyan E, Petrovski A, Jayne C. (2018). Overlap-based undersampling for improving imbalanced data classification. In: International Conference on Intelligent Data Engineering and Automated Learning, Springer, pp 689-697
Amin A, Anwar S, Adnan A, Nawaz M, Howard N, Qadir J, Hawalah A, Hussain A. (2016). Comparing oversampling techniques to handle the class imbalance problem: A customer churn prediction case study. IEEE Access 4:7940-7957.
Chowdhury A, Alspector J. (2003). Data duplication: an imbalance problem? In: ICML’2003 Workshop on Learning from Imbalanced Data Sets (II), Washington, DC.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16:321-357.
Han H, Wang W-Y, Mao B-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, 2005. Springer, pp 878-887.
Tsai C-F, Lin W-C, Hu Y-H, Yao G-T. (2019). Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Information Sciences 477:47-54.
Vuttipittayamongkol P, Elyan E. (2020). Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Information Sciences 509:47-70.
Devi D, Namasudra S, Kadry S. (2020). A boosting-aided adaptive cluster-based undersampling approach for treatment of class imbalance problem. International Journal of Data Warehousing and Mining (IJDWM) 16 (3):60-86.
Onan A. (2019). Consensus clustering-based undersampling approach to imbalanced learning. Scientific Programming 2019.
Arafat MY, Hoque S, Farid DM. (2017). Cluster-based under-sampling with random forest for multi-class imbalanced classification. In: 2017 11th International Conference on Software, Knowledge, Information Management and Applications (SKIMA). IEEE, pp 1-6.
Miah MO, Khan SS, Shatabda S, Farid DM. (2019). Improving detection accuracy for imbalanced network intrusion classification using cluster-based under-sampling with random forests. In: 2019 1st international conference on advances in science, engineering and robotics technology (ICASERT), 2019. IEEE, pp 1-5.
Zhang Y-P, Zhang L-N, Wang Y-C. (2010). Cluster-based majority under-sampling approaches for class imbalance learning. In: 2010 2nd IEEE International Conference on Information and Financial Engineering, IEEE, pp 400-404
IBRAHIM MH. (2020). WBBA-KM: a hybrid weight-based bat algorithm with K-means algorithm for cluster analysis. Politeknik Dergisi:1-1. Khishe M, Mosavi MR. (2020). Chimp optimization algorithm. Expert systems with applications 149:113338.
Mirjalili S, Gandomi AH, Mirjalili SZ, Saremi S, Faris H, Mirjalili SM. (2017). Salp Swarm Algorithm: A bio-inspired optimizer for engineering design problems. Advances in Engineering Software 114:163-191
Asuncion A, Newman D. (2007). UCI machine learning repository. Irvine, CA, USA.
Gorunescu F. (2011). Data Mining: Concepts, models and techniques, vol 12. Springer Science & Business Media.
Giancarlo R, Bosco GL, Pinello L. (2010). Distance functions, clustering algorithms and microarray data analysis. In: International Conference on Learning and Intelligent Optimization, Springer, pp 125-138.
Charrad M, Ghazzali N, Boiteux V, Niknafs A. (2014). NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set| Charrad| Journal of Statistical Software.