Dengesiz Metin Sınıflandırmada Öznitelik Seçim Yöntemlerinin Etkililiği

Metin verilerinin sınıflar arasında dağılımı genellikle eşit değildir. Bu durum, metin sınıflandırma işleminde sınıflandırıcıların performansına olumsuz yansımaktadır. Dengesiz metin sınıflandırma ile ilgili birçok çalışma yapılmıştır. Metin sınıflandırma işleminin önemli aşamalarından olan öznitelik seçim aşaması, dengesiz metin probleminde de kritik öneme sahiptir. Öznitelik seçme metotlarının dengesiz metinlerin sınıflandırılması üzerindeki etkisi bu çalışmada etraflıca araştırılmıştır. Bu doğrultuda, iki farklı veri seti üzerinde üç farklı sınıflandırıcı ve dokuz farklı öznitelik seçim metodu ile birçok deney yapılmıştır. Ayrıca öznitelik seçim yöntemlerinin başarıları farklı öznitelik sayılarında da gözlemlenmiştir. NDM, DFSS, PFS, POISSON, CHI2, IG, GINI, DFS ve MDFS olarak adlandırılan 9 farklı öznitelik seçim metodu değerlendirilmiştir. Destek Vektör Makinesi (SVM), Karar Ağacı (DTREE) ve Basit Bayes (MNB) sınıflandırıcıları ile deneysel sonuçlar elde edilmiştir. Reuters-21578 veri setinde DFS ve CHI2 öznitelik seçim yöntemleri Makro-F1 değerlendirme metriği üzerinden yaklaşık en yüksek 80 değerini alırken, SPAM SMS veri setinde, DFS öznitelik seçim yöntemi en yüksek skor olarak 95 ve CHI2 öznitelik seçim yöntemi 94 değerlerini almıştır. Öznitelik seçme metotlarından DFS ve CHI2’nin dengesiz metin sınıflandırmada daha başarılı olduğu görülmektedir.

The Effectiveness of Feature Selection Methods for Imbalanced Text Classification

The distribution of text data across classes is often imbalanced. This situation has a negative impact on the performance of classifiers in the text classification process. Many studies have been performed on imbalanced text classification. The feature selection stage, which is one of the important stages of the text classification process, is also critical in the imbalanced text classification problem. The effect of feature selection methods on the classification of imbalanced texts has been thoroughly investigated in this study. In this direction, many experiments were carried out with three different classifiers and nine different feature selection methods on two different data sets. In addition, the success of feature selection methods has been observed employing different number of features. Nine different feature selection methods called NDM, DFSS, PFS, POISSON, CHI2, IG, GINI, DFS and MDFS were evaluated. Experimental results obtained with Support Vector Machines (SVM), Decision Tree (DTREE), and Naïve Bayes (MNB) classifiers. On the Reuters-21578 dataset, DFS and CHI2 feature selection methods obtained approximately 80 as the highest Macro-F1 score. On the SPAM SMS dataset, DFS feature selection method obtained 95 and CHI2 feature selection method obtained 94 as the highest Macro-F1 score. It is seen that feature selection methods DFS and CHI2 are more successful than the others for imbalanced text classification.

___

  • Chang, C.-C. and C.-J. Lin, LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2011. 2(3): p. 1-27.
  • Chen, L., L. Jiang, and C. Li, Modified DFS-based term weighting scheme for text classification. Expert Systems with Applications, 2021. 168: p. 114438.
  • Chen, X.-w. and M. Wasikowski. Fast: a roc-based feature selection metric for small samples and imbalanced data classification problems. in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008.
  • Chen, Y.-T. and M.C. Chen, Using chi-square statistics to measure similarities for text categorization. Expert systems with applications, 2011. 38(4): p. 3085-3090.
  • Forman, G., An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res., 2003. 3(Mar): p. 1289-1305.
  • Galar, M., et al., A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2011. 42(4): p. 463-484.
  • He, H. and E.A. Garcia, Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 2009. 21(9): p. 1263-1284.
  • Liu, H. and L. Yu, Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on knowledge and data engineering, 2005. 17(4): p. 491-502.
  • Maldonado, S., R. Weber, and F. Famili, Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Information sciences, 2014. 286: p. 228-246.
  • Moayedikia, A., et al., Feature selection for high dimensional imbalanced class data using harmony search. Engineering Applications of Artificial Intelligence, 2017. 57: p. 38-49.
  • Ogura, H., H. Amano, and M. Kondo, Feature selection with a measure of deviations from Poisson in text categorization. Expert Systems with Applications, 2009. 36(3): p. 6826-6832.
  • Ogura, H., H. Amano, and M. Kondo, Comparison of metrics for feature selection in imbalanced text classification. Expert Systems with Applications, 2011. 38(5): p. 4978-4989.
  • Pouramini, J., B. Minaei-Bidgoli, and M. Esmaeili, A novel feature selection method in the categorization of imbalanced textual data. KSII Transactions on Internet and Information Systems (TIIS), 2018. 12(8): p. 3725-3748.
  • Quinlan, J.R., Induction of decision trees. Machine learning, 1986. 1(1): p. 81-106.
  • Rehman, A., K. Javed, and H.A. Babri, Feature selection based on a normalized difference measure for text classification. Information Processing & Management, 2017. 53(2): p. 473-489.
  • Schütze, H., C.D. Manning, and P. Raghavan, Introduction to information retrieval. Vol. 39. 2008: Cambridge University Press Cambridge.
  • Shang, W., et al., A novel feature selection algorithm for text categorization. Expert Systems with Applications, 2007. 33(1): p. 1-5.
  • Uysal, A.K. and S. Gunal, A novel probabilistic feature selection method for text classification. Knowledge-Based Systems, 2012. 36: p. 226-235.
  • Uysal, A.K., et al. Detection of SMS spam messages on mobile phones. in 2012 20th Signal Processing and Communications Applications Conference (SIU). 2012. Ieee.
  • Witten, I.H. and E. Frank, Data mining: practical machine learning tools and techniques with Java implementations. Acm Sigmod Record, 2002. 31(1): p. 76-77.
  • Zong, W., et al., A discriminative and semantic feature selection method for text categorization. International Journal of Production Economics, 2015. 165: p. 215-222.