Serkan GÜLDAL

Dengesiz Verilerin Yeniden Örnekleme ile Makine Öğrenimi Performansını İyileştirilme: DBSCAN ve Ağırlıklı Aritmetik Ortalama

Dijital teknolojinin gelişmesi, toplanan veri boyutlarının artan bir hızla artmasına neden olmuştur. Veri boyutundaki artış, dengesiz veri gibi yeni sorunları da beraberinde getirmektedir. Bir veri kümesi dengesizse, sınıflar eşit olarak dağıtılmamıştır. Bu nedenle, sınıflandırma algoritmaları veri kümeleri dengelenmiş varsayımı ile tasarlandığından, veriler sınıflandırılırken performans kayıplarına neden olur. Sınıflandırma çoğunluk sınıfını desteklerken, azınlık sınıfı genellikle yanlış sınıflandırılır. Veri setlerinin dengesizliklerini azaltmak için son yıllarda çeşitli çalışmalar yapılmıştır. Genel anlamda, bu çalışmalar veri kümelerini dengelemek için yetersiz örnekleme, aşırı örnekleme veya her ikisi şeklindedir. Bu çalışmada, sentetik numuneler üretmek için ortalama ile birleştirilmiş uzaklık tabanlı azınlık sınıfını yeniden örnekleme yönteminin kullanıldığı bir aşırı örnekleme yöntemi önerilmiştir. Yeniden örnekleme işlemi için azınlık sınıfındaki çiftler arasındaki uzaklıklar Öklid uzaklık metriği ile hesaplanır. Hesaplanan mesafeler göz önünde bulundurularak, yoğun bölgeler DBSCAN yöntemi dikkate alınarak her veri noktası etrafında tanımlanır. Yeni sentetik numuneler, Ağırlıklı Aritmetik Ortalama kullanılarak bölgenin içinde kalan noktalar ile merkez noktalar arasında oluşturulur. Böylece bu çalışmada veri seti 500 (çoğunluk) ve 535 (268 azınlık verisinden) olarak yeniden tanımlanmıştır. Ham ve dengeli veri kümelerini Rassal Orman (RF) ve Destek Vektör Makinesi (SVM) algoritmaları ile sınıflandırılmıştır. Sonuçlar önerilen yöntemin listelenen tüm yöntemler arasında en iyi makine öğrenimi performansa sahip olduğunu göstermiştir.

Improving Machine Learning Performance of Imbalanced Data by Resampling: DBSCAN and Weighted Arithmetic Mean

Improvement of digital technology has caused the collected data sizes to increase at an accelerating rate. The increase in data size comes with new problems such as imbalanced data. If a dataset is imbalanced, the classes are not equally distributed. Therefore, the classification of the data causes performance losses since the classification algorithms assume the datasets are balanced. While the classification favors the majority class, the minority class is often misclassified. To reduce the imbalanced ratio, various studies have been performed in recent years. In general terms, these studies are undersampling, oversampling, or both to balance the imbalanced datasets. In this study, an oversampling method is proposed employing distance combined with mean based resampling method to produce synthetic samples for the minority class. For the resampling process, the distances between pairs are calculated by the Euclidean distance metric between the minority class members. Based on the calculated distances, the denser zones are identified in the sense of DBSCAN around every datum. The new synthetic samples are formed between the points in the zones and central points by using the Weighted Arithmetic Mean. Thus, in this study, the dataset has been approximated 500 (majority) and 535 (from 268 minority data). Moreover, Random Forest (RF) and Support Vector Machine (SVM) algorithms are used for the classification of raw and balanced datasets. The result showed that the proposed method has the best machine learning performance among all the listed methods.

PDF

___

[1] Gopinath M., Aarthy S., Manchanda A. 2019 Machine Learning on Medical Dataset. in Information Systems Design and Intelligent Applications, S. C. Satapathy, V. Bhateja, R. Somanah, X.-S. Yang, and R. Senkerik Eds. Singapore: Springer. 133-143.
[2] He H., Garcia E. A. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21 (9): 1263-1284.
[3] Weiss G. M. 2004. Mining with rarity: a unifying framework. SIGKDD Explorations Newsletter, 6 (1): 7–19.
[4] Mohammed A. J., Hassan M. M., Kadir D. H. 2020. Improving classification performance for a novel imbalanced medical dataset using SMOTE method. International Journal, 9 (3): 3161- 3172.
[5] Rahman M. M., Davis D. N. 2013. Addressing the class imbalance problem in medical datasets. International Journal of Machine Learning and Computing, 3 (2): 224-228.
[6] Hayati M., Muthmainah S., Ghufran S. 2021. Random and synthetic over-sampling approach to resolve data imbalance in classification. International Journal of Artificial Intelligence Research, 4 (2): 86-94.
[7] Zuech R., Hancock J., Khoshgoftaar T. M. 2021. Detecting web attacks using random undersampling and ensemble learners. Journal of Big Data, 8 (1): 1-20.
[8] Elhassan T., M A., F A.-M., Shoukri M. 2016. Classification of imbalance data using tomek link (T-Link) combined with random under-sampling (RUS) as a data reduction method. Global Journal of Technology and Optimization, 01.
[9] Chawla N. V., Bowyer K. W., Hall L. O., Kegelmeyer W. P. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16: 321-357.
[10] Yavaş M., Güran A., Uysal M. 2021. Covid-19 veri kümesinin SMOTE tabanlı örnekleme yöntemi uygulanarak sınıflandırılması. Avrupa Bilim ve Teknoloji Dergisi: 258-264. [Online]. Available: https://dergipark.org.tr/tr/pub/ejosat/issue/56356/779952.
[11] Han H., Wang W.-Y. Mao B.-H. 2005. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning, In: International Conference on Intelligent Computing: Springer, 878-887.
[12] Chawla N. V., Japkowicz N., and Kotcz A. 2004. Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6 (1): 1-6.
[13] Kovács G. 2019. Smote-Variants: A python implementation of 85 minority oversampling techniques. Neurocomputing, 366: 352-354.
[14] Hassan G. A. A. M., Yıldırım D., Masoud 2021. Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data, 10, doi: https://dergipark.org.tr/tr/pub/bitlisfen/939733.
[15] Ester M., Kriegel H.-P., Sander J., Xu X. 1996, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, presented at the Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon.
[16] Schubert E., Sander J., Ester M., Kriegel H. P., Xu X. 2017. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Transactions on Database Systems, 42 (3): Article 19.
[17] Bilgin T., Çamurcu Y. 2005. DBSCAN, OPTICS ve K-Means Kümeleme Algoritmalasının Uygulamalı Karşılastırılması.
[18] Dokuz A. S., Çelik M., Ecemis A. 2020. DBSCAN Algoritması Kullanarak Bitcoin Fiyatlarında Anormallik Tespiti.
[19] Yaşar H., Albayrak M. Comparison of serial and parallel programming performance in outlier detection with DBSCAN algorithm. Bilecik Şeyh Edebali Üniversitesi Fen Bilimleri Dergisi, 7 (1): 129-140.
[20] Alhussein I., Ali A. H. 2020. Application of DBSCAN to Anomaly Detection in Airport Terminals. In: 2020 3rd International Conference on Engineering Technology and its Applications (IICETA), 6-7 September, Iraq, 112-116.
[21] Baselice F., Coppolino L., Antonio S. D., Ferraioli G., Sgaglione L. 2015. A DBSCAN Based Approach for Jointly Segment and Classify Brain MR Images, In: 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 25-29 August, 2993-2996.
[22] Huan Y., Wenhui Z. 2013. DBSCAN Data Clustering Algorithm for Video Stabilizing System, In: Proceedings 2013 International Conference on Mechatronic Sciences, Electric Engineering and Computer (MEC), 25-29 August, 1297-1301.
[23] KEEL. "Pima Indians Diabetes Dataset." KEEL. https://sci2s.ugr.es/keel/dataset.php?cod=21 (accessed 12.04.2021.
[24] Liaw A., Wiener M. 2002. Classification and Regression by Random Forest. R news, 2(3): 18- 22.
[25] Vapnik V. 2013. The Nature of Statistical Learning Theory, 2nd ed. New York, USA: Springer Science & Business Media.
[26] Demidova L., Klyueva I., Sokolova Y., Stepanov N., Tyart N. 2017. Intellectual approaches to improvement of the classification decisions quality on the base of the SVM classifier. Procedia Computer Science, 103: 222-230.
[27] Fatourechi M., Ward R. K., Mason S. G., Huggins J., Schlögl A., Birch G. E. 2008. Comparison of Evaluation Metrics in Classification Applications with Imbalanced Datasets, presented at the Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications. [Online]. Available: https://doi.org/10.1109/ICMLA.2008.34.