Author Identification for Turkish Texts

The main concern of author identification is to define an appropriate characterization of documents that captures the writing style of authors. The most important approaches to computer-based author identification are exclusively based on lexical measures. In this paper we presented a fully automated approach to the identification of the authorship of unrestricted text by adapting a set of style markers to the analysis of the text. In this study, 35 style markers were applied to each author. By using our method, the author of a text can be identified by using the style markers that characterize a group of authors. The author group consists of 20 different writers. Author features including style markers were derived together with different machine learning algorithms. By using our method we have obtained a success rate of 80% in avarege.
Anahtar Kelimeler:

Author, Identification, Turkish

Author Identification for Turkish Texts

Keywords:

-,

___

  • A. Genkin, D. D. Lewis, and D. Madigan, Large-scale bayesian logistic regression for text categorization, 2004.
  • B.Diri, M. F. Amasyal›, Automatic Author Detection for Turkish Text, ICANN/ICONIP’03 13th International Conference on Artificial Neural Network and 10th International Conference on Neural Information Processing, 2003.
  • B.Kessler, G. Nunberg, H.Schutze, Automatic Detection of Text Genre, Proc. of 35th Annual Meeting of the Association for Computational Linguistics (ACL/EACL’97), 32-38 1997.
  • Chris Callison-Burch, Co-training for Statistical Machine Translation, Master’s thesis, University of Edinburgh, 2002.
  • Christopher D. Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, The MIT Press, 1999.
  • D. Biber, Variations Across Speech and Writing, Cambridge University Press, 1988.
  • D. I. Holmes, Stylometry: Its Origins, Development and Aspirations, presented to the Joint International Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, Queen’s University, Kingston, Ontario, 1997.
  • D. Khmelev, Disputed authorship resolution using relative entropy for markov chain of letters in a text, In R. Baayen, editor, 4th Conference Int. Quantitative Linguistics Association, Prague, 2000.
  • E. Stamatatos, N. Fakotakis, G. Kokkinakis, Automatic Text Categorization in Terms of Genre and Author, Computational Linguistics, pages 471-495, 2000.
  • F. J. Tweedie, S. Singh, D. I. Holmes, Neural Network Applications in Stylometry: The Federalist Paper, Computers and the Humanities, Vol. 30, pages 1-10, 1996.
  • H. Baayen, H. van Halteren, and F. Tweedie, Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, Vol. 11(3), pages 121-131, 1996.
  • J. Allen, Natural Language Understanding, Benjamin/Cummings Pub. Co., Redwood City, California, 1995.
  • J. Burrows, Computation into Criticism: A Study of Jane Austen’s Novels and an Experiment in Method, Clarendon Press, Oxford, 1987.
  • J. Goldsmith, Unsupervised learning of the morphology of a natural language, Computational Linguistics, Vol. 27(2), pages 153–198, 2001.
  • J. Karlgren, and D. Cutting, Recognizing Text Genres with Simple Metrics using Discriminant Analysis, Proceedings of the 15th. International Conference on Computational Linguistics, Kyoto, 1994.
  • Jill M. Farringdon, Analyzing for Authorship: A Guide to the Cusum Technique. University of Wales Press, 1996.
  • J. Rissanen, Stochastic Complexity in Statistical Inquiry, Volume 15. World Scientific Series in Computer Science, Singapore, 1989.
  • Mathias Creutz, Unsupervised segmentation of words using prior distributions of morph length and frequency, In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), pages 280–287, Sapporo, Japan, 2003.
  • R. A. Bosch, J. A. Smith, Separating Hyperplanes and the Authorship of the Disputed Federalist Papers, American Mathematical Monthly, Volume 105, pages 601-608, 1998.
  • S. Argamon-Engelson, M. Koppel, and G. Avneri, Style-based text categorization: What newspaper am I reading?, In Proc. AAAI Workshop on Learning for Text Categorization, pages 1-4, 1998.
  • T. Mendenhall, The characteristic curves of composition, Science, 214:237249, 1887.
  • https://zemberek.dev.java.net/
  • http://www.cs.waikato.ac.nz/~ml/weka/index.html