Türkçe metinlerde sözlük dışı kelime tespiti

Bu çalışmada, Türkçe metinler için sözlük dışı kelime (SDK) tespiti yapabilen anlamsal bir çizge ağı modeli sunulmuştur. Doğal dil işleme (DDİ) alanında, biçimbirimsel çözümleyiciler, kelime analizi esnasında bilinmeyen kelime (BK)’lerle karşılaşabilmektedirler. Bu durum daha çok, bu tip araçların çözümleme esnasında aday bulabilmeleri için bir sözlüğe bağımlı oldukları durumlarda oluşmaktadır. Bazen, bir çözümleyici madde başı adaylarının sözlükte mevcut olmaması sebebiyle hiçbir madde başı adayını bulamamaktadır. Bu durum çözümleme çıktı değerini düşürebilmektedir. Sözlük dışı kelime (SDK) tespiti için önerilen model, sözlükler için uygun olabilecek sözlük dışı kelimeleri tespit edebilmektedir. Ayrıca çizge veri tabanında birliktelik ilişkileri kullanılarak bir anlamsal alt-ağ oluşturulmuş ve yeni eşdizimliliklerin madde başı olarak önerilecek şekilde keşfedilmesi amacıyla kullanılmıştır.

Anahtar Kelimeler:

Bilinmeyen kelimeler, Eşdizimlilik, Birliktelik, Sözlük dışı kelimeler

Identification of OOV words in Turkish texts

In this study, we present a semantic graph network model which is capable of detecting out-of-vocabulary (OOV) words in Turkish texts. In natural language processing (NLP) field, morphological analyzers can encounter unknown words (UW) during word processing. This mostly occurs when these kind of tools depend on a dictionary to find the probable lemmas in order to further process parsing. Sometimes, an analyzer is unable to find any candidates because of the non-existence of the lemma candidates in the dictionary. This results in degraded parsing output. The proposed model for OOV detection is able to define OOV words which are suitable for dictionaries. Also co-occurrence relations of the lemmas in texts are modelled as a semantic sub-graph and it is used to discover collocations to propose as new lemma candidates.

Keywords:

Unknown words, Collocation, Co-occurrence, OOV words,

PDF

___

Arısoy, E., Dutağacı, H., Arslan, L.M., 2006. A unified language model for large vocabulary continuous speech recognition of Turkish. Signal Processing, 86(10), pp.2844-2862.
Arısoy, E., Can, D., Parlak, S., Sak, H. and Saraçlar, M., 2009. Turkish broadcast news transcription and retrieval. IEEE Transactions on Audio, Speech, and Language Processing, 17(5), pp.874-883.
Arslan, E, Orhan, U. 2017. Using Graphs in Construction of a Lemmatization Model for Turkish, International Mediteranean Science and Engineering Congress, IMSEC.Asahara, M., Matsumoto, Y., 2004, August. Japanese unknown word identification by character-based chunking. In Proceedings of the 20th international conference on Computational Linguistics (p. 459). Association for Computational Linguistics.
Bazzi, I., Glass, J., 2002. A multi-class approach for modelling out-of-vocabulary words. In Seventh International Conference on Spoken Language Processing.
Brill, E., 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational linguistics, 21(4), pp.543-565.
Çöltekin, Ç., 2014. A set of open source tools for Turkish natural language processing. In LREC (pp. 1079-1086).Daciuk, J., 1999, July. Treatment of unknown words. In International Workshop on Implementing Automata (pp. 71-80). Springer, Berlin, Heidelberg.
Erjavec, T., Džeroski, S., 2004. Machine learning of morphosyntactic structure: Lemmatizing unknown Slovene words. Applied Artificial Intelligence, 18(1), pp.17-41.
Jongejan, B., Dalianis, H., 2009. August. Automatic training of lemmatization rules that handle morphological changes in pre-, in-and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1 (pp. 145-153). Association for Computational Linguistics.
Korobov, M., 2015. April. Morphological analyzer and generator for Russian and Ukrainian languages. In International Conference on Analysis of Images, Social Networks and Texts (pp. 320-332). Springer, Cham.
Lafferty, J., McCallum, A. and Pereira, F.C., 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
Loponen, A., Kalervo, J., 2010. A dictionary-and corpus-independent statistical lemmatizer for information retrieval in low resource languages. International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Berlin, Heidelberg, 2010.
Nakagawa, T., 2004. August. Chinese and Japanese word segmentation using word-level and character-level information. In Proceedings of the 20th international conference on Computational Linguistics (p. 466). Association for Computational Linguistics.
Silfverberg, M., Ruokolainen, T., Lindén, K. and Kurimo, M., 2016. FinnPos: an open-source morphological tagging and lemmatization toolkit for Finnish. Language Resources and Evaluation, 50(4), pp.863-878.
Parlak, Siddika, and Murat Saraclar. "Spoken term detection for Turkish broadcast news." Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on. IEEE, 2008.
Parlak, S., Saraclar, M., 2008. March. Spoken term detection for Turkish broadcast news. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on (pp. 5244-5247). IEEE.
Tahiroglu, B.T., Akalın, S.H., Ozkan, B., 2014. Turkce Cevrim Ici Haber Metinlerinde Yeni Sozlerin (Neolojizm) Otomatik Çıkarımı. In Turkce Uzerine Derlembilim Uygulamaları, Karahan Kitabevi.