Deniz KILINÇ

The Effect of Ensemble Learning Models on Turkish Text Classification

Due to rapid development of the Internet and related technologies, the amount of text-based content generated through Internet applications is increasing from day to day. Since text-based content is unstructured, accessing and managing this data is almost impossible. Consequently, there is a need for automatic text classification process. Text mining is a discipline in the Data Mining field and offers algorithms in order to perform text classification. The main objective of text classification is forming a learning model by using a training data set with pre-defined categories and placing data with unknown categories into correct categories. Different text classification algorithms such as decision trees, Bayesian classifiers, rule-based classifiers, neural networks, k-nearest neighbor classifier, support vector machines and ensemble learning methods exist in the literature. In this study, the effect of ensemble learning models on Turkish text classification was evaluated. A publicly available data set named TTC-3600 which consists of 3600 news collected from 6 news portals was selected. Text classification process was performed on TTC-3600 data set by using 4 base classification algorithms Naïve Bayes, Support Vector Machine, K-Nearest Neighbor, J48 Decision tree and their Boosting, Bagging and Rotation Forest ensemble learning models. The experimental results shows that ensemble learning models generally give more accurate results by increasing the success of base classifiers

PDF

___

Fan, W.; Bifet, A. Mining big data: current status, and forecast to the future. ACM sIGKDD Explorations Newslet- ter. 2013; 14(2), 1-5.
Sebastiani, F. Text categorization. In: Text Mining and Its Applications, UK: WIT Press. 2005; pp. 109-129.
Azzalini, A.; Scarpa, B.; Walton, G. Data Analysis and Data Mining: An Introduction, New York: Oxford University Press, 2012.
Sebastiani, F. Machine learning in automated text catego- rization. ACM Comput. Surv. 2002; 34(1), 1-47.
Torunoğlu, D.; Çakırman, E.; Ganiz, M.C. et al. Analysis of preprocessing methods on classification of Turkish texts. In: Proceedings of International Symposium on Innovations in Intelligent Systems and Applications. 2011; pp. 112-118.
Guran, A.; Akyokus, S.; Guler, N.; Gurbuz, Z. Turkish text categorization using n-gram words. In: Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications (INISTA). 2009; pp. 369-373.
Amasyalı, M.F.; Beken, A. Measurement of Turkish word semantic similarity and text categorization application. In: Proceedings of IEEE Signal Processing and Communications Applications Conference, Newyork: IEEE. 2009; pp. 1-4.
Amasyali, M.F.; Diri, B. Automatic Turkish text categori- zation in terms of author, genre and gender. In: Natural Language Processing and Information Systems, Berlin: Springer. 2006; pp. 221-226.
Çataltepe, Z.; Turan, Y.; Kesgin, F. Turkish document classification using shorter roots. In: Proceedings of IEEE Signal Processing and Communications Applications Con- ference (SIU), Newyork: IEEE, Eskisehir, Turkey. 2007; pp. 1- 4.
Tufekci, P.; Uzun, E. Author detection by using different term weighting schemes. In: Proceedings of IEEE Signal Processing and Communications Applications Conference (SIU), Newyork: IEEE, Trabzon, Turkey. 2013; pp. 1-4.
Kılınç, D.; Özçift, A.; Bozyigit, F.; Yıldırım, P.; Yücalar, F. and Borandag, E. TTC-3600: A new benchmark dataset for Turkish text categorization. Journal of Information Science, p.0165551515620551. 2015.
John G.H.; Langley P. Estimating continuous distribu- tions in Bayesian classifiers. In: Proc. 11th Conf. Uncertainty in Artificial Intelligence. 1995; pp. 338-345.
Cha, S.H.; Tappert, C.C. A Genetic Algorithm for Con- structing Compact Binary Decision Trees. Journal of Pattern Recognition Research. 2009; 4(1), 1-13.
Quinlan, J.R. C4.5: Programs for Machine Learning. Machine Learning. 1993; 16(3), 235-240.
Aha, D.W.; Kibler, D.; Albert, M.K. Instance-based Learning Algorithms. Machine Learning. 1991; 6(1), 37-66.
Cortes, C.; Vapnik, V. Support-vector network. Machine Learning. 1995; vol. 20, pp. 273-297.
Breiman, L. Bagging predictors. Machine Learning, 1996; 24(2), 123-140.
Freund, Y.; Schapire, R.E. Experiments with a new boosting algorithm. In: Proceedings of 13th International Conference on Machine Learning, San Francisco: Morgan Kaufman. 1996; pp. 148-156.
Rodriguez, J.J.; Kuncheva, L.I.; Alonso, C.J. Rotation forest: a new classifier ensemble method. IEEE Trans. Pat- tern Anal. Mach. Intell. 2006; vol. 28, pp. 1619-1630.
Tunali, V.; Bilgin, T.T. Examining the impact of stem- ming on clustering Turkish texts. In: Innovations in Intelli- gent Systems and Applications (INISTA), 2012 International Symposium. 2012; pp. 1-4.
Akin, A.A.; Akin, M.D. Zemberek, an open source NLP framework for Turkic Languages. 2007.
Witten, I.H.; Frank, E. Data Mining: Practical Machine Learning Tools and Techniques, second ed., San Fransisco: Morgan Kaufman, 2005.
McLachlan, Geoffrey J.; Do, Kim-Anh; Ambroise, Chris- tophe. Analyzing microarray gene expression data. ISBN: 978-0-471-22616-1, Wiley, 2014.
Dong, Y.S.; Han, K.S. Boosting SVM classifiers by en- semble. In: Special interest tracks and posters of the 14th international conference on World Wide Web. 2005; pp. 1072-1073.

Celal Bayar Üniversitesi Fen Bilimleri Dergisi-Cover

ISSN: 1305-130X
Yayın Aralığı: 4
Başlangıç: 2005
Yayıncı: Manisa Celal Bayar Üniversitesi Fen Bilimleri Enstitüsü

Arşiv

Sayıdaki Diğer Makaleler

Tirozinaz Enzim Aktivitesi Üzerine Reaksiyon Parametre Etkilerinin UV Spektrofotometre ile Yerinde Analiz Edilmesi

İlker POLATOĞLU

Dental İmplant Materyallerin İyon İmplantasyon Yöntemiyle Modifikasyonlarının İn Vitro Hücre Tutunmasına Etkisi

Ahmet ÖZTARHAN, Taner DAĞCI, Ali Erdem TURANLI, Alexey NİKOLAYEV, Emel SOKULLU

The Effect of Ensemble Learning Models on Turkish Text Classification

Deniz KILINÇ

Checklist of Cladosporium Species Reported from Turkey

Fatih KALYONCU, Evrim ÖZKALE, Ahmet ASAN