COMPARISON OF SAMPLING TECHNIQUES FOR IMBALANCED LEARNING

COMPARISON OF SAMPLING TECHNIQUES FOR IMBALANCED LEARNING

In recent years, huge increase in the number of people using Internet accompanied massive amounts of human and machine generated data recently called Big Data, where handling it efficiently is a challenging task. Along with that, valuable information that can be extracted from this data to perform data-driven decision making has attracted increased attention both from industry and academia. One of the important tasks in knowledge extraction is the classification task. However, in some of the real-world applications, dataset is either inherently skewed or collected dataset has imbalanced class distribution. Imbalance class distribution degrades the performance of several classification algorithms which generally expect balanced class distributions and assume that the cost of misclassifying an instance from both of the classes is equivalent. To tackle with this so called imbalanced learning problem, several sampling algorithms has been proposed in the literature. In this study, we compare sampling algorithms with respect to their running times and classification accuracies obtained from running classifiers trained with the sampled datasets. We find out that classification accuracies of the over-sampling methods are superior to the under-sampling methods. Sampling times are found to be similar whereas classification can be done more efficiently with under-sampling methods. Among the proposed sampling algorithms, the ADASYN method should be the preferred choice considering both execution times, increase in the data size and classification performance.Keywords: Imbalanced Learning, Sampling Methods, Data Mining, Big Data

___

  • A. Asuncion and D. J. Newman. UCI Machine Learning Repository. University of California at Irvine, School of Information and Computer Science, 2007.
  • Barua, Simul, Md Minarul Islam, Xin Yao, and Kazuyuki Murase. "MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning." Knowledge and Data Engineering, IEEE Transactions on 26, no. 2 (2014): 405-425.
  • Batista, Gustavo EAPA, Ronaldo C. Prati, and Maria Carolina Monard. "A study of the behavior of several methods for balancing machine learning training data." ACM Sigkdd Explorations Newsletter 6, no. 1 (2004): 20-29.
  • B.X. Wang and N. Japkowicz, “Imbalanced Data Set Learning with Synthetic Samples,” Proc. IRIS Machine Learning Workshop, 2004.
  • Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research (2002): 321-357.
  • Dal Pozzolo, Andrea, Olivier Caelen, Serge Waterschoot, and Gianluca Bontempi. "Racing for unbalanced methods selection." In Intelligent Data Engineering and Automated Learning–IDEAL 2013, pp. 24-31. Springer Berlin Heidelberg, 2013.
  • Dittman, David J., Taghi M. Khoshgoftaar, Randall Wald, and Amri Napolitano. "Comparison of data sampling approaches for imbalanced bioinformatics data." In The Twenty-Seventh International Flairs Conference. 2014
  • Fatourechi, Mehrdad, Rabab K. Ward, Steven G. Mason, Jane Huggins, A. Schlogl, and Gary E. Birch. "Comparison of evaluation metrics in classification applications with imbalanced datasets." In Machine Learning and Applications, 2008. ICMLA'08. Seventh International Conference on, pp. 777-782. IEEE, 2008.
  • Han, Hui, Wen-Yuan Wang, and Bing-Huan Mao. "Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning." In Advances in intelligent computing, pp. 878-887. Springer Berlin Heidelberg, 2005.
  • He, Haibo, and Edwardo A. Garcia. "Learning from imbalanced data." Knowledge and Data Engineering, IEEE Transactions on 21, no. 9 (2009): 1263-1284.
  • He, Haibo, Yang Bai, Edwardo A. Garcia, and Shutao Li. "ADASYN: Adaptive synthetic sampling approach for imbalanced learning." In Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, pp. 1322-1328. IEEE, 2008.
  • Nguyen, Hien M., Eric W. Cooper, and Katsuari Kamei. "Borderline over-sampling for imbalanced data classification." International Journal of Knowledge Engineering and Soft Data Paradigms 3, no. 1 (2011): 4-21.
  • I. Tomek, “Two modifications of CNN,” IEEE Tram. Cyst., Man, Cybern., vol. SMG6, pp. 769-772, Nov. 1976.
  • Japkowicz, Nathalie. "Learning from imbalanced data sets: a comparison of various strategies." In AAAI workshop on learning from imbalanced data sets, vol. 68, pp. 10-15. 2000.
  • Kubat, Miroslav, and Stan Matwin. "Addressing the Curse of Imbalanced Training Sets: One-Sided Selection." In In Proceedings of the Fourteenth International Conference on Machine Learning. 1997.
  • Laurikkala, Jorma. Improving identification of difficult small classes by balancing class distribution. Springer Berlin Heidelberg, 2001.
  • Mani, Inderjeet, and I. Zhang. "kNN approach to unbalanced data distributions: a case study involving information extraction." In Proceedings of workshop on learning from imbalanced datasets. 2003.
  • Olivier Caelen, Andrea Dal Pozzolo and Gianluca Bontempi. Comparison of balancing techniques for unbalanced datasets. Technical report, Machine Learning Group University of Bruxelles, Belgium, 2012
  • P. E. Hart, "The condensed nearest neighbor," IEEE Trans. Inform. Theory, vol. IT-14, pp. 515-516, May 1968.
  • Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel et al. "Scikit-learn: Machine learning in Python." The Journal of Machine Learning Research 12 (2011): 2825-2830.
  • Weiss, Gary M. "Mining with rarity: a unifying framework." ACM SIGKDD Explorations Newsletter 6, no. 1 (2004): 7-19.
  • Wilson, Dennis L. "Asymptotic properties of nearest neighbor rules using edited data." Systems, Man and Cybernetics, IEEE Transactions on 3 (1972): 408-421.
Yönetim Bilişim Sistemleri Dergisi-Cover
  • ISSN: 2630-550X
  • Başlangıç: 2015
  • Yayıncı: Vahap TECİM
Sayıdaki Diğer Makaleler

ÜNİVERSİTE ÇALIŞANLARININ ELEKTRONİK BELGE YÖNETİM SİSTEMİNİ KULLANMA NİYETLERİNİN TEKNOLOJİ KABUL MODELİ İLE İNCELENMESİ

Abdullah EREN, Muhammed Dursun KAYA, Muhammed Dursun KAYA

YAPAY SİNİR AĞLARI TABANLI YAZILIM EFOR TAHMİNİ

Muaz GÜLTEKİN, Oya KALIPSIZ

EXPLORING THE FACTORS AFFECTING PURCHASE INTENTION IN MUSIC INDUSTRY

Mustafa AĞAOĞLU, Emine Serra YURTKORU, Yusuf ŞAHİN

COMPARISON OF TRADITIONAL PROJECT MANAGEMENT TECHNIQUES AND CRITICAL CHAIN PROJECT MANAGEMENT FOR MANAGEMENT OF INFORMATION TECHNOLOGY AND INFORMATION SYSTEM PROJECTS

Büşra ALMA, Erman COŞKUN, Naciye Güliz UĞUR

HOW DOES STUDENTS’ LEARNING (ACHIEVEMENT) RELATE TO THEIR LEVEL OF WE-INTENTION AND THEIR LEARNING METHODS ON FACEBOOK?

Ali AKKAYA, Tuğrul Cabir HAKYEMEZ, Ekrem KUTBAY, Birgül KUTLU BAYRAKTAR

KANSER TEDAVİSİNDE İNTERNET ÜZERİNDEN ERİŞİLEBİLEN ŞÜPHELİ TEDAVİ ÖNERİLERİNİN HASTALAR TARAFINDAN BİLİNİRLİĞİ: İÜ ONKOLOJİ ENSTİTÜSÜ ÖRNEĞİ

Şafak KUYAR, Şebnem ÖZDEMİR, Sevinç GÜLSEÇEN

COMPARISON OF SAMPLING TECHNIQUES FOR IMBALANCED LEARNING

Ahmet Onur DURAHİM

HOW DO WE REACT @socialmedia? #catchthemoment

Mustafa COŞKUN, Meltem ÖZTURAN

FİNANSAL OKURYAZARLIĞIN MOBİL BANKACILIK KULLANIMINDAKİ ROLÜ: DOĞU KARADENİZ BÖLGESİNDE BİR ARAŞTIRMA

Handan ÇAM, Alper Veli ÇAM

LINGUISTIC STUDIES ON TWEETS GATHERED FROM MUĞLA REGION: A PRELIMINARY STUDY

Feriştah DALKILIÇ, Enis KARAARSLAN, Ali HÜRRİYETOĞLU, Enis KARAARSLAN, Ali HÜRRİYETOĞLU