TOWARDS A DATA-DRIVEN MORPHOLOGICAL ANALYSIS OF KAZAKH LANGUAGE

TOWARDS A DATA-DRIVEN MORPHOLOGICAL ANALYSIS OF KAZAKH LANGUAGE

We propose a method for complete morphological analysis of Kazakh language that accounts for both inflectional and derivational morphology. Our method is data-driven and does not require manually generated rules, which makes it convenient for analyzing agglutinative languages. The intuition behind our approach is to label morphemes with so called transition labels, i.e. labels that encode grammatical functions of morphemes as transitions between corresponding POS, and use transitivity to ease the analysis. We evaluate our method on a fair-sized sample of real data and report encouraging results.

___

  • [1] D. Elworthy, “Tagset design and inflected languages,” in In EACL SIGDAT workshop iFrom Texts to Tags: Issues in Multilingual Language Analysis, 1995, pp. 1–10.
  • [2] J. Hana and A. Feldman, “A positional tag set for Russian,” Proceedings of LREC-10. Malta, 2010.
  • [3] K. Koskenniemi, “A general computational model for word-form recognition and production,” in Proceedings of the 10th international conference on Computational linguistics. ACL, 1984, pp. 178–181. [4] K. Oflazer and C. Güzey, “Spelling correction in agglutinative languages.” in ANLP, 1994, pp. 194–195.
  • [5] H. Sak, T. Güngor, and M. Saraçlar, “A stochastic finite-state morphological parser for Turkish,” in Proceedings of the ACL-IJCNLP 2009 Conference. Stroudsburg, PA, USA: ACL, 2009, pp. 273–276.
  • [6] M. Hulden, “Foma: a finite-state compiler and library.” in EACL (Demos), A. Lascarides, C. Gardent, and J. Nivre, Eds. ACL, 2009, pp. 29–32.
  • [7] K. Linden, M. Silfverberg, E. Axelson, S. Hardwick, and T. Pirinen, HFST-Framework for Compiling and Applying Morphologies, ser. Communications in Computer and Information Science, 2011, vol. Vol. 100, pp. 67–85.
  • [8] D. Z. Hakkani-Tur, K. Oflazer, and G. Tur, “Statistical morphological disambiguation for agglutinative languages.” Computers and the Humanities, vol. 36, no. 4, pp. 381–410, 2002.
  • [9] J. Hajič, P. Krbec, P. Pavel Květoň, K. Oliva, and V. Petkevič, “Serial combination of rules and statistics: A case study in czech tagging,” in Proceedings of the 39th Annual Meeting on ACL. Stroudsburg, PA, USA: ACL, 2001, pp. 268–275.
  • [10] G. D. Grzegorz Chrupała and J. van Genabith, “Learning morphology with morfette,” in Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco: ELRA, may 2008.
  • [11] M. Creutz and K. Lagus, “Unsupervised models for morpheme segmentation and morphology learning,” ACM Transactions on Speech and Language Processing (TSLP), vol. 4, no. 1, p. 3, 2007.
  • [12] O. Kohonen, S. Virpioja, L. Lepp¨anen, and K. Lagus, “Semi-supervised extensions to morfessor baseline,” in Proceedings of the Morpho Challenge 2010 Workshop. Espoo, Finland: Aalto University, September 2010.
  • [13] A. Sharipbayev, G. Bekmanova, B. Ergesh, A. Buribayeva, and M. K. Karabalayeva, “Intellectual morphological analyzer based on semantic networks,” in Proceedings of the OSTIS-2012, 2012, pp. 397–400.
  • [14] D. E. Shuklin, “The structure of a semantic neural network extracting the meaning from a text,” Cybernetics and Sys. Anal., vol. 37, no.
  • 2, pp. 182–186, Mar. 2001. [15] G. Kessikbayeva and I. Cicekli, “Rule based morphological analyzer of Kazakh language,” in Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM. Baltimore, Maryland: ACL, June 2014, pp. 46–54.
  • [16] H. R. Zafer, B. Tilki, A. Kurt, and M. Kara, “Two-level description of Kazakh morphology,” in Proceedings of the 1st International Conference on Foreign Language Teaching and Applied Linguistics (FLTAL11), Sarajevo, May 2011.
  • [17] G. Altenbek and W. Xiao-long, “Kazakh segmentation system of inflectional affixes,” in CIPS-SIGHAN, 2010, pp. 183–190.
  • [18] B. M. Kairakbay and D. L. Zaurbekov, “Finite state approach to the Kazakh nominal paradigm,” in Proceedings of the 11th International Conference on Finite State Methods and Natural Language Processing. St Andrews, Scotland: ACL, July 2013, pp. 108– 112.
  • [19] A. Ranta, “A multilingual natural-language interface to regular expressions,” in Proceedings of the International Workshop on Finite State Methods in Natural Language Processing, ser. FSMNLP ’09. Stroudsburg, PA, USA: ACL, 1998, pp. 79–90.
  • [20] A. Makazhanov, O. Makhambetov, I. Sabyrgaliyev, and Z. Yessenbayev, “Spelling correction for kazakh,” in Proceedings of the 2014 CICLing. Kathmandu, Nepal: Springer Berlin Heidelberg, 2014, pp. 533–541.
  • [21] O. Makhambetov, A. Makazhanov, Z. Yessenbayev, B. Matkarimov, I. Sabyrgaliyev, and A. Sharafudinov, “Assembling the kazakh language corpus,” in EMNLP. Seattle, Washington, USA: ACL, October 2013, pp. 1022–1031.