Türkçe Dokümanlar İçin N-gram Tabanlı Yeni Bir Sınıflandırma(Ng-ind): Yazar, Tür ve Cinsiyet

Bu çalışmada Türkçe bir dokümanın türü, yazarı ve doküman yazarının cinsiyeti Türkçe’nin n-gram modeli kullanılarak belirlenmeye çalışılmıştır. N-gram modelinde 2-, 3-, 4-gram’lar kullanılmış ve üç farklı veri seti üzerinde toplam altı adet özellik vektörü oluşturulmuştur. Naive Bayes (NB), Destek Vektör Makinesi (DVM), Rastgele Orman (RO), K-En Yakın Komşuluk (K-EYK) gibi sınıflandırıcıların yanında geliştirdiğimiz Ng-ind yöntemi kullanılarak testler yapılmış ve başarı performansları birbirleri ile karşılaştırılmıştır. Ng-ind yöntemi cinsiyet ve tür belirlemede diğer yöntemlere göre daha iyi sonuç vermiştir. Bununla birlikte Ng-ind, tür belirlemede birleştirilmiş sınıflandırıcılardan da daha iyi performans göstermiştir.

In this study, it is tried to find out a Turkish document’s genre, author and document author’s gender with using the Turkish n-gram model. In N-gram model, 2-, 3-, 4-grams were used, and total 6 feature vectors were produced on 3 different data set. Some tests were made with the Ng-ind method that we produced near the other classifiers such as Naive Bayes (NB), Support Vector Machine (SVM), Random Forest (RF), KNearest Neighbor (K-NN) and the success performances were compared with each other. In spite of the Ng-ind method gave better results than the other ones in gender and genre determination, it showed better performance than the compounded classifiers in genre determination

___

  • 1. Doğan, S., 2006, “Türkçe Dokümanlar için N-gram Tabanlı Sınıflandırma: Yazar, Tür ve Cinsiyet”, Yıldız Teknik Üniv., Master Tezi
  • 2. Cavnar, W. B. ve Trenkle, J. M., 1994, “N-gram-based text categorization”, Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. Information Systems Project Management, Jolyon E. Hallows, AMACOM Pres
  • 3. Peng F., Keselj V., Cerconey N., Thomasy C., 2003, “N-Gram-Based Author Profiles For Authorship Attribution”, Faculty of Computing Science, Dalhousie University, Canada
  • 4. Stamatatos E., Fakotakis N., Kokkinakis G., 2000, “Automatic Text Categorization in Terms of Genre and Author”, Computational Linguistics, pp.471-495
  • 5. Peng F., Schuurmans D., 2003, “Combining Naive Bayes and N-gram Language Models for Test Classification”, School of Computer Science, University of Waterloo.
  • 6. Amasyalı M.F., Diri B., 2006, “Automatic Written Turkish Text Categorization in Terms of Author, Genre and Gender”, 11th International Conference on Applications of Natural Language to Information Systems, Austria
  • 7. Peng F., Wang S., Schuurmans D., 2003, “Language and Task Independent Te Categorization with Simple Language Models”, School of Computer Science, University of Waterloo
  • 8. Nowson S., Oberlander J., 2006, “Openness and gender in personal weblogs”, School of Informatics, University of Edinburgh, 2 Buccleuch Place, Edinburg, EH89LW
  • 9. Dupont P., 2006, “Noisy Sequence Classification with Smoothed Markov Chains”, Department of Computing Science and Engineering (INGI), Université catholique de Louvain Place Sainte Barbe, 2 B-1348 Louvain-la-Neuve – Belgium
  • 10. George H., 1995, “Estimating Continuous Distributions in Bayesian Classifiers”, Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338-345. Morgan Kaufmann, San Mateo
  • 11. Breiman L., 1999, “Random forests– random features”, Technical Report 567, Department of Statistics, University of California, Berkeley
  • 12. Peng F., Schuurmans D., 2003, “Combining Naive Bayes and N-gram Language Models for Text Classification”, School of Computer Science, University of Waterloo
  • 13. Doyle J., Keselj V., 2005, “Automatic Categorization of Author Gender via NGram Analysis”, In The 6th Symposium on Natural Language Processing, SNLP'2005, Chiang Rai, Thailand, December
  • 14. http://sourceforge.net/projects/weka/