Türkçe 'nin olasılık tabanlı bağlılık ayrıştırması

Bu çalışma, Türkçe için geliştirilmiş ilk istatistiksel bağlılık ayrıştırıcısının sonuçlarını sunmaktadır. Türkçe, tümce içi öğe dizilişleri serbest, karmaşık bir çekimsel ve türetimsel biçimbirime sahip olan bitişken bir dildir ve bu özellikleri ile istatistiksel ayrıştırma konusunda ilginç sorunlar ortaya koymaktadır. Türkçe’de, bağlılık ilişkileri “çekim kümesi” adı verilen sözcük parçacıkları arasında kurulmaktadır. Bu bağlılıkların bulunması amacı ile Türkçe’nin karmaşık yapısının ayrıştırma sırasında nasıl modelleneceğinin irdelenmesi gerekmektedir. Bu çalışmada, ayrıştırma için farklı gösterim birimleri kullanan olasılık tabanlı modeller incelenmiştir. Başlangıç olarak biri kural tabanlı bir ayrıştırıcı olmak üzere üç dayanak model geliştirilmiştir. Gerçekleştirilen üç olasılık tabanlı modelin, dayanak modellere ve birbirlerine oranla başarımları değerlendirilmiştir. Ayrıştırıcının eğitimi ve sınaması için Odtü Sabancı Türkçe ağaç yapılı derlemi kullanılmıştır. Çalışma ayrıca bu derlem üzerinde sınanmış ve sonuçlaı raporlanmış ilk çalışmadır. Bu ilk incelemede, derlemin sadece sağa bağımlı (iye sözcüklerin uydu sözcüklerin sağ taraflarında yer aldığı) türde ve kesişmeyen bağlılıklar içeren bir alt kümesini ayrıştırmaya odaklanılmıştır. Eldeki derlemin boyutu nedeni ile görünüm bilgisi (sözcüğün tümünün veya gövdesinin ayrıştırma birimi gösterimlerinde bir özellik olarak kullanılması) kullanmayan ve sadece birimler arası etiketsiz bağlılıkları bulmaya yönelik incelemeler yapılmıştır. Sonuçlarımız, çekim kümeleri arasındaki doğru bağlıkların bulunma başarımı gözönüne alındığında, ayrıştırma birimi olarak çekim kümelerinin kullanıldığı ve bağlam bilgisinden yararlanan modelin en yüksek başarımı sağladığını göstermektedir.

Probabilistic dependency parsing of Turkish

This paper presents results from the first statistical dependency parser for Turkish. Turkish is a freeconstituent order language with complex agglutinative inflectional and derivational morphology and presents interesting challenges for statistical parsing, as in general, dependency relations are between “portions” of words – called inflectional groups. We have explored statistical models that use different representational units for parsing. We have used the Turkish Dependency Treebank to train and test our parser but have limited this initial exploration to that subset of the treebank sentences with only leftto- right non-crossing dependency links. Our results indicate that the best accuracy in terms of the dependency relations between inflectional groups is obtained when we use inflectional groups as units in parsing, and when contexts around the dependent are employed. Turkish shows very different characteristics from the well-studied languages in parsing literature. Many of these characteristics are common for all agglutinative languages such as Basque, Estonian, Finnish, Hungarian, Japanese and Korean. It is a flexible constituent order language. Even though in written texts, the constituent order of sentences generally conforms to the SOV or OSV structures, the constituents may freely change their position depending on the requirements of the discourse context. From the point of view of dependency structure, Turkish is predominantly (but not exclusively) head final. Furthermore, Turkish morphotactics is quite complicated: a given word form may involve multiple derivations and the number of word forms one can generate from a nominal or verbal root is theoretically infinite. Derivations in Turkish are very productive, and the syntactic relations that a word is involved in as a dependent or head element, are determined by the inflectional properties of the one or more (possibly intermediate) derived forms. In this work, we assume that a Turkish word is represented as a sequence of inflectional groups (IGs hereafter), separated by ˆDBs, denoting derivation boundaries. A sentence would then be represented as a sequence of the IGs making up the words. When a word is considered as a sequence of IGs, linguistically, the last IG of a word determines its role as a dependent, so, syntactic relation links only emanate from the last IG of a (dependent) word, and land on one of the IGs of a (head) word on the right (with minor exceptions). And again with minor exceptions, the dependency links between the IGs, when drawn above the IG sequence, do not cross. We implemented three baseline parsers: 1. The first baseline parser links a word-final IG to the first IG of the next word on the right. 2. The second baseline parser links a word-final IG to the last IG of the next word on the right. 3. The third baseline parser is a deterministic rulebased parser that links each word-final IG to an IG on the right based on the approach of Nivre (2003). The parser uses 23 unlexicalized linking rules and a heuristic that links any non-punctuation word not linked by the parser to the last IG of the last word as a dependent. In addition to these, we implemented three probabilistic models: 1. “Unlexicalized” Word-based Model, where the words are represented as the concatenation of their IGs and are used as the parsing unit during the parsing. 2. IG-based Model, where each word is splitted into its IGs and then the IGs are used as the smallest parsing unit. 3. IG-based Model with Word-final IG Contexts, where the IGs are again used as the parsing unit. This model differs from the previous one in the way it uses the contextual units and calculates the distances between units. Our results indicate that all of our models perform better than the three baseline parsers, even when no contexts around the dependent and head units are used. We get our best results with Model 3, where IGs are used as units for parsing and contexts are comprised of word final IGs. The highest accuracy in terms of percent of correctly extracted IG-to-IG relations excluding punctuations (73.5%) was obtained when one word is used as context on both sides of the the dependent. We also noted that using a smaller treebank to train our models did not result in a significant reduction in our accuracy indicating that the unlexicalized models are quite effective, but this also may hint that a larger treebank with unlexicalized modeling may not be useful for improving link accuracy.

___

  • Charniak, E. (2000). A maximum-entropy inspired parser, Proceedings, 1st Conference of the North American Chapter of the Association for Computational Linguistics, 132-139, Seattle WA.
  • Chung, H. ve Rim, H. (2004). Unlexicalized dependency parser for variable word order languages based on local contextual pattern. Proceedings, Computational Linguistics and Intelligent Text Processing, 109-120, Seoul.
  • Collins, M., Hajic, J., Ramshaw, L. ve Tillmann, C. (1999). A statistical parser for Czech. Proceedings, 37th Annual Meeting of the Association for Computational Linguistics (ACL), 505–518, Maryland.
  • Collins, M. (1996). A new statistical parser based on bigram lexical dependencies. Proceedings, 34th ACL, 184-191, Santa Cruz, CA.
  • Collins, M. (1997). Three generative, lexicalised models for statistical parsing. Proceedings, 35th ACL, 16–23, San Francisco.
  • Eryiğit, G. ve Oflazer, K. (2006). Statistical dependency parsing of Turkish. Proceeding, 11th Conference of the European Chapter of the Association for Computational Linguistics, 89- 96, Trento.
  • Klein, D. ve Manning, C. (2003). Accurate unlexicalized parsing. Proceedings, 41st ACL, 423–430, Sapporo.
  • Kudo, T. ve Matsumoto, Y. (2000). Japanese dependency analysis based on support vector machines. Proceedings, Empirical Methods In Natural Language Processing and Very Large Corpora, 18-25, Hong Kong.
  • Kudo, T. ve Matsumoto, Y. (2002). Japanese dependency analysis using cascaded chunking. Proceedings, 6th Conference on Natural Language Learning, 63-69, Taipei.
  • Nivre, J. ve Nilsson, J.. (2005). Pseudoprojective dependency parsing. Proceedings, 43rd ACL, 99– 106, Ann Arbor MI.
  • Nivre, J., Hall, J. ve Nilsson, J. (2004). Memorybased dependency parsing. Proceedings, 8th Conference on Computational Natural Language Learning, 49-56, Boston MA.
  • Nivre, J. (2003). An efficient algorithm for projective dependency parsing. Proceedings, 8th International Workshop on Parsing Technologies, 23–25, Nancy.
  • Oflazer, K., Say, B., Hakkani-Tür, D. ve Tür, G. (2003). Building a Turkish treebank in Abeille, A. eds, Building and Exploiting Syntacticallyannotated Corpora. Kluwer Academic Publishers, 261-277.
  • Oflazer, K. (1994). Two-level description of Turkish morphology. Literary and Linguistic Computing, 9, 2, 137-148.
  • Oflazer, K. (2003). Dependency parsing with an extended finite-state approach. Computational Linguistics, 29, 4, 515-544.
  • Sekine, S., Uchimoto, K. ve Isahara, H. (2000). Backward beam search algorithm for dependency analysis of Japanese. Proceedings, 17th International Conference on Computational Linguistics, 754–760, Saarbrücken.
  • Yamada, H. ve Matsumoto, Y. 2003. Statistical dependency analysis with support vector machines. Proceedings, 8th International Workshop of Parsing Technologies, 195-206, Nancy.