Turkish lexicon expansion by using finite state automata

Turkish lexicon expansion by using finite state automata

Turkish is an agglutinative language with rich morphology. A Turkish verb can have thousands of differentword forms. Therefore, sparsity becomes an issue in many Turkish natural language processing (NLP) applications.This article presents a model for Turkish lexicon expansion. We aimed to expand the lexicon by using a morphologicalsegmentation system by reversing the segmentation task into a generation task. Our model uses finite-state automata(FSA) to incorporate orthographic features and morphotactic rules. We extracted orthographic features by capturingphonological operations that are applied to words whenever a suffix is added. Each FSA state corresponds to either astem or a suffix category. Stems are clustered based on their parts-of-speech (i.e. noun, verb, or adjective) and suffixesare clustered based on their allomorphic features. We generated approximately 1 million word forms by using only a fewthousand Turkish stems with an accuracy of 82.36%, which will help to reduce the out-of-vocabulary size in other NLPapplications. Although our experiments are performed on Turkish language, the same model is also applicable to otheragglutinative languages such as Hungarian and Finnish.

___

  • [1] Göksel A, Kerslake C. Turkish: A Comprehensive Grammar. London, UK: Routledge Comprehensive Grammars, 2005.
  • [2] Lewis G. Turkish Grammar. Oxford, UK: Oxford University Press, 2001.
  • [3] Koskenniemi K. Two-level morphology: a general computational model for word-form recognition and production. In: ACL ’84/COLING ’84 Proceedings of the 10th International Conference on Computational Linguistics and 22nd annual meeting on Association for Computational Linguistics; 2–6 July 1984; Stanford, CA, USA. pp. 178-181.
  • [4] Oflazer K. Two-level description of Turkish morphology. In: Proceedings of the Sixth Conference on European Chapter of the Association for Computational Linguistics; 21–23 April 1993; Utrecht, Netherlands: ACL. p. 472.
  • [5] Sak H, Güngör T, Saraçlar M. A stochastic finite-state morphological parser for Turkish. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers; 2–7 August 2009; Singapore: ACL. pp. 273-276.
  • [6] Eryiğit G, Adalı E. An affix stripping morphological analyser for Turkish. In: Proceedings of the IASTED International Conference on Artificial Intelligence and Applications; 16–18 February 2004; Innsbruck, Austria: AIA. pp. 299-304.
  • [7] Çöltekin Ç. A freely available morphological analyzer for Turkish. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation; 2010; Valletta, Malta: European Language Resources Association (ELRA). pp. 820-827.
  • [8] Afşin, A, Akın MD. Zemberek, an open source NLP framework for Turkic languages. Structure 2007; 10.
  • [9] Beesley KR. Arabic finite-state morphological analysis and generation. In: Proceedings of the 16th Conference on Computational Linguistics - Volume 1; 1996; Copenhagen, Denmark: ACL. pp. 89-94.
  • [10] Cavalli-Sforza V, Soudi A, Mitamura T. Arabic morphology generation using a concatenative strategy. In: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference; 2000; Seattle, WA, USA: ACL. pp. 86-93.
  • [11] Habash N. Large scale lexeme based Arabic morphological generation. In: Traitement Automatique des Langues Naturelles; 2004. pp. 271-276.
  • [12] Backwalter T. Arabic morphological analyzer version 1.0. In: Linguistic Data Consortium; 2002; University of Pennsylvania.
  • [13] Habash N, Rambow O, Kiraz G. Morphological analysis and generation for Arabic dialects. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages; 29 June 2005; Ann Arbor, MI, USA: ACL. pp. 17-24.
  • [14] Kiraz GA. Multi-tape two-level morphology: a case study in Semitic non-linear morphology. In: Proceedings of the 15th conference on Computational Linguistics-Volume 1; 5–9 August 1994; Kyoto, Japan: ACL. pp. 180-186.
  • [15] Rasooli MS, Lippincott T, Habash N, Rambow O. Unsupervised morphology-based vocabulary expansion. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); June 2014; Baltimore, MA, USA: ACL. pp. 1349-1359.
  • [16] Creutz M, Lagus K. Inducing the morphological lexicon of a natural language from unannotated text. In: Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning; 2005. pp. 106-113.
  • [17] Köprü S, Miller J. A unification based approach to the morphological analysis and generation of Arabic. In: 3rd Workshop on Computational Approaches to Arabic Script-based Languages at MT Summit XII; 26–30 August 2009; Ottawa, Ontario, Canada. pp. 89-94.
  • [18] Tantuğ AC, Eryiğit G. Probabilistic Turkish word root generation. In: Proceedings of the 3rd Asia Pacific International Symposium on Information Technology; 13–14 January 2004; İstanbul, Turkey.
  • [19] Oflazer K, Göçmen E, Bozsahin C. An outline of Turkish morphology. In: Report on Turkish Natural Language Processing Initiative Project; 1994.
  • [20] Öztürk B, Can B. Clustering word roots syntactically. In: Proceedings of the 24th Signal Processing and Communication Application Conference; 16–19 May 2016; Zonguldak, Turkey.
  • [21] Baek DH, Lee H, Chang RH. Conceptual clustering of Korean concordances using similarity between morphemes; 2009.
  • [22] Kullback S, Leibler RA. On information and sufficiency. The Annals of Mathematical Statistics 1951; 22: 79-86.
  • [23] Can B. Unsupervised learning of allomorphs in Turkish. Turkish Journal of Electrical Engineering & Computer Sciences 2017; 25: 3253-3260.
  • [24] Oflazer K, Nirenburg S, McShane M. Bootstrapping morphological analyzers by combining human elicitation and machine learning. Computational Linguistics 2001; 27: 59-85.