THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION

Author identification, one of the popular topics in text classification and natural language processing, basically aims to determine the author of a given text through various analyses. In the literature, different text representation approaches and use of preprocessing steps are considered for author identification problem. This paper aims to comprehensively examine the impact of text representation and preprocessing steps on author identification specifically for Turkish language. For this purpose, the contributions of all possible combinations of different text representation approaches, namely unigram and bigram, together with the preprocessing tasks, including stemming and stop-word removal, to the performance of author identification are investigated. For the experimental evaluation, a brand new dataset is constituted. Also, two different classification algorithms, namely Multinomial Naive Bayes and Sequential Minimal Optimization, are employed. The results of the experimental analysis reveal that using bigram features alone should be avoided. Besides, it is shown that stop-words should be kept inside the text while stemming can be preferred depending on the classification algorithm so that higher performance can be achieved for author identification.

___

  • Aslantürk O. Turkish authorship analysis with an incremental and adaptive model. MSc Dissertation, Hacettepe University, Ankara, Turkey, 2014.
  • Diri B, Amasyalı MF. Automatic author detection for Turkish texts. Artificial Neural Networks and Neural Information Processing 2003, 138-141.
  • Amasyalı MF, Diri B. Automatic Turkish text categorization in terms of author, genre and gender. In: NLDB 11th International Conference on Applications of Natural Language to Information Systems; 2006; Klagenfurt, Austria. pp. 221-226.
  • Amasyalı MF, Diri B, Türkoğlu F. Farklı özellik vektörleri ile Türkçe dokümanların yazarlarının belirlenmesi. In: The 15th Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN); 21-24 June 2006; Muğla, Turkey.
  • Türkoğlu F, Diri B, Amasyalı MF. Author attribution of Turkish texts by feature mining. In: The 3rd International Conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications with Aspects of Artificial Intelligence; 2007; Qingdao, China. pp. 1086–1093.
  • Kaban Z, Diri B. Genre and author detection in Turkish texts using artificial immune recognition systems. In: IEEE 16th Signal Processing, Communication and Applications Conference; April 2008. pp. 1-4.
  • Orucu F. Turkish Language Characteristics and Author Identification. MSc. Dissertation, Dokuz Eylül University, İzmir, 2009.
  • Bay Y, Çelebi E, Feature Selection for Enhanced Author Identification of Turkish Text. In: the 30th International Symposium on Computer and Information Sciences, 2015. pp. 371-379.
  • Stamatatos E. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 2009; 60(3): 538-556.
  • Joachims T. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Carnegie-Mellon Univ. Pittsburgh PA Dept. of Computer Science 1996.
  • Gunal S. Hybrid feature selection for text classification, Turkish Journal of Electrical Engineering & Computer Sciences 2012; 20(sup.2): 1296-1311.
  • Uysal AK, Gunal S, Ergin S, Sora Gunal E. The impact of feature extraction and selection on SMS spam filtering. Elektronika ir Elektrotechnika 2013; 19(5): 67-72.
  • Pak MY, Gunal S. Sentiment classification based on domain prediction, Elektronika ir Elektrotechnika 2016; 22(2): 96-99.
  • Manning CD, Raghavan P, Schtze H. Introduction to Information Retrieval. New York, USA: Cambridge University Press, 2008
  • Uysal AK, Gunal S. The impact of preprocessing on text classification. Information Processing & Management 2014; 50(1): 104-112.
  • Can F, Kocberber S, Balcik E, Kaynak C, Ocalan HC, Vursavas OM. Information retrieval on Turkish texts. Journal of the American Society for Information Science and Technology 2008, 59: 407–421.
  • Zemberek. (Accessed October 2016).
  • Gunal S, Edizkan R. Subspace based feature selection for pattern recognition. Information Sciences 2008; 178(19): 3716-3726.
  • McCallum A, Nigam K. A comparison of event models for naïve Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization 1998; 752: 41-48.
  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 2009; 11(1): 10-18.
  • Platt JC. Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods 1999; 185-208.