Türkçe Dokümanlar İçin N-gram Tabanlı Yeni Bir Sınıflandırma(Ng-ind): Yazar, Tür ve Cinsiyet

Bu çalışmada Türkçe bir dokümanın türü, yazarı ve doküman yazarının cinsiyeti Türkçe’nin n-gram modeli kullanılarak belirlenmeye çalışılmıştır. N-gram modelinde 2-, 3-, 4-gram’lar kullanılmış ve üç farklı veri seti üzerinde toplam altı adet özellik vektörü oluşturulmuştur. Naive Bayes (NB), Destek Vektör Makinesi (DVM), Rastgele Orman (RO), K-En Yakın Komşuluk (K-EYK) gibi sınıflandırıcıların yanında geliştirdiğimiz Ng-ind yöntemi kullanılarak testler yapılmış ve başarı performansları birbirleri ile karşılaştırılmıştır. Ng-ind yöntemi cinsiyet ve tür belirlemede diğer yöntemlere göre daha iyi sonuç vermiştir. Bununla birlikte Ng-ind, tür belirlemede birleştirilmiş sınıflandırıcılardan da daha iyi performans göstermiştir.

In this study, it is tried to find out a Turkish document’s genre, author and document author’s gender with using the Turkish n-gram model. In N-gram model, 2-, 3-, 4-grams were used, and total 6 feature vectors were produced on 3 different data set. Some tests were made with the Ng-ind method that we produced near the other classifiers such as Naive Bayes (NB), Support Vector Machine (SVM), Random Forest (RF), KNearest Neighbor (K-NN) and the success performances were compared with each other. In spite of the Ng-ind method gave better results than the other ones in gender and genre determination, it showed better performance than the compounded classifiers in genre determination

PDF

___

1. Doğan, S., 2006, “Türkçe Dokümanlar için N-gram Tabanlı Sınıflandırma: Yazar, Tür ve Cinsiyet”, Yıldız Teknik Üniv., Master Tezi
2. Cavnar, W. B. ve Trenkle, J. M., 1994, “N-gram-based text categorization”, Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. Information Systems Project Management, Jolyon E. Hallows, AMACOM Pres
3. Peng F., Keselj V., Cerconey N., Thomasy C., 2003, “N-Gram-Based Author Profiles For Authorship Attribution”, Faculty of Computing Science, Dalhousie University, Canada
4. Stamatatos E., Fakotakis N., Kokkinakis G., 2000, “Automatic Text Categorization in Terms of Genre and Author”, Computational Linguistics, pp.471-495
5. Peng F., Schuurmans D., 2003, “Combining Naive Bayes and N-gram Language Models for Test Classification”, School of Computer Science, University of Waterloo.
6. Amasyalı M.F., Diri B., 2006, “Automatic Written Turkish Text Categorization in Terms of Author, Genre and Gender”, 11th International Conference on Applications of Natural Language to Information Systems, Austria
7. Peng F., Wang S., Schuurmans D., 2003, “Language and Task Independent Te Categorization with Simple Language Models”, School of Computer Science, University of Waterloo
8. Nowson S., Oberlander J., 2006, “Openness and gender in personal weblogs”, School of Informatics, University of Edinburgh, 2 Buccleuch Place, Edinburg, EH89LW
9. Dupont P., 2006, “Noisy Sequence Classification with Smoothed Markov Chains”, Department of Computing Science and Engineering (INGI), Université catholique de Louvain Place Sainte Barbe, 2 B-1348 Louvain-la-Neuve – Belgium
10. George H., 1995, “Estimating Continuous Distributions in Bayesian Classifiers”, Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338-345. Morgan Kaufmann, San Mateo
11. Breiman L., 1999, “Random forests– random features”, Technical Report 567, Department of Statistics, University of California, Berkeley
12. Peng F., Schuurmans D., 2003, “Combining Naive Bayes and N-gram Language Models for Text Classification”, School of Computer Science, University of Waterloo
13. Doyle J., Keselj V., 2005, “Automatic Categorization of Author Gender via NGram Analysis”, In The 6th Symposium on Natural Language Processing, SNLP'2005, Chiang Rai, Thailand, December
14. http://sourceforge.net/projects/weka/