Improving classification performance for an imbalanced educational dataset example using SMOTE

Improving classification performance for an imbalanced educational dataset example using SMOTE

With technology, a lot of data is formed in digital environments. One of the areas with intensive data is educational data sets. Byanalyzing educational data sets, students' situatiokjgjjööÖns can be predicted by foreseeing. In this way, students can be assisted byanticipating situations such as drop-out due to failure. Educational institutions can take measures to prevent such dropouts and reducestudent drop-out. Thus, financial losses of students and educational institutions can be prevented. In this study, the data of fiveseparate associate degree students who were enrolled in Amasya University Distance Education Center in 2016-2017 were used.These are associate degree programs in child development, medical documentation and secretarial, electricity, mechatronics, andinternet and network technologies. It was estimated whether the students could graduate or not at the end of the IV. Semester withlooking at their I. and II. semester course notes. These data were analyzed by k nearest neighbor (K-NN) and KStar algorithms. Someof the data were obtained from the distance education center as imbalanced data due to the low number of students. In EducationalData Mining, researchers usually overlook the balance of the distribution on a dataset. Unbalanced data can seriously affect thesuccess of classification. Synthetic minority oversampling technique (SMOTE) method was applied to these unbalanced data and howit affected the success of classification was examined. First, the raw data were analyzed with K-nearest neighbors classifier and KStarclassifier. In this study, the analysis results of these five chapters are given in tables and comparatively. In this study, it has been seenthat SMOTE oversampling method increase the classification success. In areas where unstable data such as educational data miningmay exist, higher classification accuracy can be achieved with the help of different oversampling methods.

___

  • Aydemir, E. (2019). Ders Geçme Notlarının Veri Madenciliği Yöntemleriyle Tahmin Edilmesi. Avrupa Bilim ve Teknoloji Dergisi, (15), 70-76.
  • Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Pacific-Asia conference on knowledge discovery and data mining, Berlin, Germany.
  • Çölkesen, İ., & Kavzoğlu, T. (2011).Örnek tabanlı k-star algoritması ile uzaktan algılanmış görüntülerin sınıflandırılması. UFUAB VI.Teknik Sempozyumu, Belek, Antalya.
  • Ge, Y., Yue, D., & Chen, L. (2017). Prediction of wind turbine blades icing based on MBK-SMOTE and random forest in imbalanced data set. IEEE Conference on Energy Internet and Energy System Integration (EI2), Changsha, China.
  • Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2009). Detecting influenza epidemics using search engine query data. Nature, 457(7232), 1012-1014.
  • Güldal H., Çakıcı, Y. (2017). Eğitsel Veri Madenciliği. 12th International Balkan Education and Science Congress, Nessebar, Bulgaria.
  • Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing, Berlin, Germany.
  • Kalıpsız, O., & Cihan, P. (2015). Öğrenci Proje Anketlerini Sınıflandırmada En İyi Algoritmanın Belirlenmesi. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, 8(1), 41-49.
  • Öztürk, A. (2018). Açık ve uzaktan öğrenme ortamlarında eğitsel veri madenciliği. Açıköğretim Uygulamaları ve Araştırmaları Dergisi, 4(2), 10-13.
  • Peña-Ayala, A. (Ed.). (2013). Educational data mining: applications and trends (Vol. 524). Springer.
  • Pristyanto, Y., Pratama, I., & Nugraha, A. F. (2018). Data level approach for imbalanced class handling on educational data mining multiclass classification. International Conference on Information and Communications Technology (ICOIACT), Yogyakarta, Indonesia
  • Sultana, M., Haider, A., & Uddin, M. S. (2016). Analysis of data mining techniques for heart disease prediction. 3rd International Conference on Electrical Engineering and Information Communication Technology (ICEEICT), Dakka, Bangladeş.
  • Tallo, T. E., & Musdholifah, A. (2018). The Implementation of Genetic Algorithm in Smote (Synthetic Minority Oversampling Technique) for Handling Imbalanced Dataset Problem. 4th International Conference on Science and Technology (ICST), Yogyakarta, Indonesia.
  • Zeng, M., Zou, B., Wei, F., Liu, X., & Wang, L. (2016). Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data. May, 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS), Chongqing, China.