Serkan GÜLDAL

Improving Machine Learning Performance of Imbalanced Data by Resampling: DBSCAN and Weighted Arithmetic Mean

Dijital teknolojinin gelişmesi, toplanan veri boyutlarının artan bir hızla artmasına neden olmuştur. Veri boyutundaki artış, dengesiz veri gibi yeni sorunları da beraberinde getirmektedir. Bir veri kümesi dengesizse, sınıflar eşit olarak dağıtılmaz. Bu nedenle, sınıflandırma algoritmaları veri kümeleri dengelenmiş gibi tasarlandığından, verilerin sınıflandırılması performans kayıplarına neden olur. Sınıflandırma çoğunluk sınıfını desteklerken, azınlık sınıfı genellikle yanlış sınıflandırılır. Toplanan veri kümelerinin, özellikle de tıbbi veri kümelerinin çoğunluğunun dengesiz dağılım sorunu vardır. Veri setlerinin dengesizliklerini azaltmak için son yıllarda çeşitli çalışmalar yapılmıştır. Genel anlamda, bu çalışmalar veri kümelerini dengelemek için yetersiz örnekleme, aşırı örnekleme veya her ikisidir. Bu çalışmada, sentetik numuneler üretmek için uzaklık ve ortalama tabanlı yeniden örnekleme yönteminin kullanıldığı bir aşırı örnekleme yöntemi önerilmiştir. Yeniden örnekleme işlemi için çiftler arasındaki uzaklıklar azınlık sınıfındaki Öklid uzaklığı ile hesaplanır. Hesaplanan mesafeler, yeterli sayıda çift elde etmek için DBSCAN yöntemi dikkate alınır. Yeni sentetik numuneler, Ağırlıklı Aritmetik Ortalama kullanılarak listelenen çiftler arasında oluşturulmuştur. Böylece veri seti 500 (çoğunluk) ve 535 (268 azınlık verisinden) olarak yeniden tasarlanmıştır. Ham ve dengeli veri kümelerini sınıflandırmak için Rassal Orman (RF) ve Destek Vektör Makinesi (SVM) algoritmaları kullanılmış ve sonuçlar birbirleriyle ve diğer metotlar (ROS, RUS ve SMOTE) ile kıyaslanmıştır. Sonuç, önerilen yöntemin listelenen tüm yöntemler arasında en iyi performansa sahip olduğunu göstermiştir. RF'nin doğruluk performansı, ham veriler ve yeniden örneklenmiş veriler için sırasıyla 0.751 ve 0.798'dir. Benzer şekilde, SVM'nin doğruluk performansı, ham veriler ve yeniden örneklenmiş veriler için sırasıyla 0.762 ve 0.781'dir.

Anahtar Kelimeler:

Makine Öğrenimi, Rastgele Orman, Destek Vektör Makinesi, Sentetik Veri, Tıbbi Veri

Balanced DATA by DBSCAN and Weighted Arithmetic Mean to Improve Performance of Machine Learning Algorithms

Improvement of digital technology has caused the collected data sizes to increase at an accelerating rate. The increase in data size comes with new problems such as unbalanced data. If a dataset is unbalanced, the classes are not equally distributed. Therefore, classification of the data causes performance losses since the classification algorithms treat as the datasets are balanced. While the classification favors the majority class, the minority class is often misclassified. The majority of collected datasets, especially medical datasets, have an unbalanced distribution problem. To reduce the unbalance datasets, various studies have been performed in recent years. In general terms, these studies are undersampling, oversampling, or both to balance the datasets. In this study, an oversampling method is proposed employing distance and mean based resampling method to produce synthetic samples. For the resampling process, the distances between pairs are calculated by the Euclidean distance in the minority class. The calculated distances are considered in the sense of DBSCAN to obtain a sufficient amount of pairs. The new synthetic samples were formed between listed pairs by using the Weighted Arithmetic Mean. Thus, the dataset has been approximated 500 (majority) and 535 (from 268 minority data). The Random Forest (RF) and Support Vector Machine (SVM) algorithms are used for classification the raw and balanced datasets, and the results were compared with each other and the other well known methods such as Random Over Sampling (ROS), Random Under Sampling (RUS), and Synthetic Minority Oversampling Technique (SMOTE). The result showed that the proposed method has the best performance among all the listed methods. The accuracy performance of RF is 0.751 and 0.798 for raw data and resampled data respectively. Likewise, the accuracy performance of SVM is 0.762 and 0.781 for raw data and resampled data respectively.

Keywords:

Machine Learning, Random Forest, Support Vector Machine, Synthetic Data, Medical Data,

PDF

___

M. Gopinath, S. Aarthy, and A. Manchanda, "Machine Learning on Medical Dataset," in Information Systems Design and Intelligent Applications: Springer, 2019, pp. 133-143.
A. J. Mohammed, M. M. Hassan, and D. H. Kadir, "Improving Classification Performance for a Novel Imbalanced Medical Dataset using SMOTE Method," International Journal, vol. 9, no. 3, 2020.
M. M. Rahman and D. N. Davis, "Addressing the class imbalance problem in medical datasets," International Journal of Machine Learning and Computing, vol. 3, no. 2, p. 224, 2013.
H. Han, W.-Y. Wang, and B.-H. Mao, "Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning," in International conference on intelligent computing, 2005: Springer, pp. 878-887.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002.
N. V. Chawla, N. Japkowicz, and A. Kotcz, "Special issue on learning from imbalanced data sets," ACM SIGKDD explorations newsletter, vol. 6, no. 1, pp. 1-6, 2004.
G. Kovács, "Smote-variants: A python implementation of 85 minority oversampling techniques," Neurocomputing, vol. 366, pp. 352-354, 2019.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise," presented at the Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, 1996.
E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN," ACM Trans. Database Syst., vol. 42, no. 3, p. Article 19, 2017, doi: 10.1145/3068335.
T. Bilgin and Y. Çamurcu, "DBSCAN, OPTICS ve K-Means Kümeleme Algoritmalasinin Uygulamali Karsilastirilmasi," 2005.
A. S. Dokuz, M. Çelik, and A. Ecemis, "DBSCAN Algoritmasi Kullanarak Bitcoin Fiyatlarinda Anormallik Tespiti," 2020.
H. Yaşar and M. Albayrak, "Comparison of Serial and Parallel Programming Performance in Outlier Detection with DBSCAN Algorithm," Bilecik Şeyh Edebali Üniversitesi Fen Bilimleri Dergisi, vol. 7, no. 1, pp. 129-140.
I. Alhussein and A. H. Ali, "Application of DBSCAN to Anomaly Detection in Airport Terminals," in 2020 3rd International Conference on Engineering Technology and its Applications (IICETA), 6-7 Sept. 2020 2020, pp. 112-116, doi: 10.1109/IICETA50496.2020.9318876.
F. Baselice, L. Coppolino, S. D. Antonio, G. Ferraioli, and L. Sgaglione, "A DBSCAN based approach for jointly segment and classify brain MR images," in 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 25-29 Aug. 2015 2015, pp. 2993-2996, doi: 10.1109/EMBC.2015.7319021.
Y. Huan and Z. Wenhui, "DBSCAN data clustering algorithm for video stabilizing system," in Proceedings 2013 International Conference on Mechatronic Sciences, Electric Engineering and Computer (MEC), 20-22 Dec. 2013 2013, pp. 1297-1301, doi: 10.1109/MEC.2013.6885267.
KEEL. "Pima Indians Diabetes Dataset." KEEL. https://sci2s.ugr.es/keel/dataset.php?cod=21 (accessed 12.04.2021.
A. Liaw and M. Wiener, "Classification and regression by randomForest," R news, vol. 2, no. 3, pp. 18-22, 2002.
V. Vapnik, The nature of statistical learning theory. Springer science & business media, 2013.
L. Demidova, I. Klyueva, Y. Sokolova, N. Stepanov, and N. Tyart, "Intellectual approaches to improvement of the classification decisions quality on the base of the SVM classifier," Procedia Computer Science, vol. 103, pp. 222-230, 2017.