The Effects of Preprocessing on Turkish and English News Data

The Effects of Preprocessing on Turkish and English News Data

In a standard text classification (TC) study, preprocessing is one of the key components to improve performance. This study aims to look at how preprocessing effects TC according to news text, text language, and feature selection. All potential combinations of commonly used preprocessing techniques are compared on one domain, namely news data, and in two different news datasets for this aim. Preprocessing technique contributions to classification performance at multiple feature sizes, possible interconnections among these techniques, and technique dependency on corresponding languages are all evaluated in this way. Using best combinations of preprocessing techniques rather than using or not using them all, experimental studies on public datasets reveals that, choosing best combinations of preprocessing techniques can improve classification accuracy significantly.

___

  • [1] G. Salton, A. Wong, and C.-S. Yang, "A vector space model for automatic indexing". Communications of the ACM, 1975. 18(11): p. 613-620.
  • [2] T. Joachims, "Text categorization with support vector machines: Learning with many relevant features". in European conference on machine learning. 1998. Springer.
  • [3] Y. Yang, and J.O. Pedersen. "A comparative study on feature selection in text categorization." in ICML. 1997.
  • [4] C. Lee, and G.G. Lee," Information gain and divergence-based feature selection for machine learning-based text categorization." Information processing & management, 2006. 42(1): p. 155-165.
  • [5] S.R. Singh, H.A. Murthy, and T.A. Gonsalves, "Feature Selection for Text Classification Based on Gini Coefficient of Inequality. "Fsdm, 2010. 10: p. 76-85.
  • [6] A. Rehman, K. Javed, and H.A. Babri, "Feature selection based on a normalized difference measure for text classification." Information Processing & Management, 2017. 53(2): p. 473-489.
  • [7] A. Rehman, et al., "Selection of the most relevant terms based on a max-min ratio metric for text classification." Expert Systems with Applications, 2018. 114: p. 78-96.
  • [8] Parlak, B. and A.K. Uysal, A novel filter feature selection method for text classification: Extensive Feature Selector. Journal of Information Science, 2021: p. 0165551521991037.
  • [9] B. Parlak, "Class‐index corpus‐index measure: A novel feature selection method for imbalanced text data." Concurrency and Computation: Practice and Experience, 2022: p. e7140.
  • [10] D. Kilinc, et al., "TTC-3600: A new benchmark dataset for Turkish text categorization." Journal of InformationScience, 2017. 43(2): p. 174-185.
  • [11] A. Çiğdem. and A. Çırak, "Türkçe haber metinlerinin konvolüsyonel sinir ağları ve Word2Vec kullanılarak sınıflandırılması." Bilişim Teknolojileri Dergisi, 2019. 12(3): p. 219-228.
  • [12] S. Yıldırım, and T. Yıldız, "Türkçe için karşılaştırmalı metin sınıflandırma analizi. "Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 2018. 24(5): p. 879-886.
  • [13] Y. Safali, et al. "Deep learning based classification using academic studies in doc2vec model". in 2019 International Artificial Intelligence and Data Processing Symposium (IDAP). 2019. IEEE.
  • [14] Ö. Köksal, "Tuning the Turkish Text Classification Process Using Supervised Machine Learning-based Algorithms". in 2020 International Conference on INnovations in Intelligent SysTems and Applications (INISTA). 2020. IEEE.
  • [15] S.M.H. Dadgar, M.S. Araghi, and M.M. Farahani. "A novel text mining approach based on TF-IDF and Support Vector Machine for news classification." in 2016 IEEE International Conference on Engineering and Technology (ICETECH). 2016. IEEE.
  • [16] A.W. Haryanto, and E.K. Mawardi. "Influence of word normalization and chi-squared feature selection on support vector machine (svm) text classification." in 2018 International Seminar on Application for Technology of Information and Communication. 2018. IEEE.
  • [17] F. Elghannam, "Text representation and classification based on bi-gram alphabet." Journal of King Saud University-Computer and Information Sciences, 2021. 33(2): p. 235-242.
  • [18] V.S. Shirsat, R.S. Jagdale, and S.N. Deshmukh, "Sentence level sentiment identification and calculation from news articles using machine learning techniques," in Computing, Communication and Signal Processing. 2019, Springer. p. 371-376.
  • [19] A.K. Uysal, and S. Gunal, "The impact of preprocessing on text classification." Information Processing & Management, 2014. 50(1): p. 104-112.
  • [20] D. Torunoğlu, et al. "Analysis of preprocessing methods on classification of Turkish texts." In: 2011 International Symposium on Innovations in Intelligent Systems and Applications. IEEE, 2011. p. 112-117.
  • [21] M.F. Porter, "An algorithm for suffix stripping." Program, 1980. 14(3): p. 130-137.
  • [22] A. Akın, M. D. Zemberek, “an open source NLP framework for Turkic languages”. Structure, 2007, 10.2007: 1-5.
  • [23] B. Parlak, and A.K. Uysal, “The effects of globalization techniques on feature selection for text classification.” Journal of Information Science, 2021, 47(6), 727-739.
  • [24] B. Parlak and A.K. Uysal, “On classification of abstracts obtained from medical journals.” Journal of Information Science, 2020, 46(5), 648-663.