Exploring feature sets for Turkish word sense disambiguation

Exploring feature sets for Turkish word sense disambiguation

This paper presents an exploration and evaluation of a diverse set of features that influence word-sense disambiguation (WSD) performance. WSD has the potential to improve many natural language processing (NLP) tasks as being one of the most crucial steps in the area. It is known that exploiting effective features and removing redundant ones help improving the results. There are two groups of feature sets to disambiguate senses and select the most appropriate ones among a set of candidates: collocational and bag-of-words (BoW) features. We introduce the effects of using these two feature sets on the Turkish Lexical Sample Dataset (TLSD), which comprises the most ambiguous verb and noun samples. In addition to our results, joint setting of feature groups has been applied to measure additional improvement in the results. Our results suggest that joint setting of features improves accuracy up to 7%. The effective window size of the ambiguous words has been determined for noun and verb sets. Additionally, the suggested feature set has been investigated on a different corpus that had been used in the previous studies on Turkish WSD. The results of the experiments to investigate diverse morphological groups show that word root and the case marker are significant features to disambiguate senses.

___

  • [1] Bhala RV, Abirami S. Trends in word sense disambiguation. Artif Intell Rev 2014; 42: 159-171.
  • [2] Jurafsky D, Martin JH. Speech & Language Processing. 2nd ed. Pearson Education India, 2000.
  • [3] Zhou X, Han H. Survey of Word Sense Disambiguation Approaches. In: Proceedings of the 18th International FLAIRS Conference; 15–17 May 2005; Florida, USA. pp. 307-313.
  • [4] Agirre E, Lacalle OL, Martinez D. Exploring feature spaces with svd and unlabeled data for Word Sense Disambiguation. In: Proceedings of the Conference on Recent Advances on Natural Language Processing; 21–23 September 2005; Borovets, Bulgaria.
  • [5] Cancho RF. The meaning-frequency law in Zipfian optimization models of communication. In: arXiv preprint arXiv:1409.7275, 2014.
  • [6] Ide N, V´eronis J. Introduction to the special issue on word sense disambiguation: the state of the art. Comput Linguist 1998; 24: 2-40.
  • [7] Orhan Z, Altan Z. Determining effective features for word sense disambiguation in Turkish. IU-JEEE 2011; 5: 1341-1352.
  • [8] Ilgen B, Adali E, Tantug AC. The impact of collocational features in Turkish Word Sense Disambiguation. In: IEEE 16th International Conference on Intelligent Engineering Systems; 13–15 June 2012; Lisbon, Portugal.
  • [9] Dang HT, Palmer M. Combining contextual features for word sense disambiguation. In: Proceedings of the ACL-02 workshop on Word sense disambiguation: recent successes and future directions; 2002; Philadelphia. Association for Computational Linguistics, 2002. pp. 88-94.
  • [10] Dang HT, Chia C, Palmer M, Chiou F. Simple features for Chinese word sense disambiguation. In: Proceedings of Coling-02 19th International Conference on Computational Linguistics; 2002; Taipei, Taiwan.
  • [11] Agirre E, Lacalle OL, Mart´ınez D. Exploring feature set combinations for WSD. In: Proceedings of the SEPLN, 2006.
  • [12] Leech G. 100 million words of English: the British National Corpus (BNC). Language Research 1992; 28: 1-13.
  • [13] Orhan Z, Altan Z. Effective features for disambiguation of Turkish verbs. In: International Enformatika Conference IEC’05; 26–28 August 2005; Prague, Czech Republic. Watermark, 483. pp. 182-186.
  • [14] Su´arez A, Palomar M. Feature selection analysis for maximum entropy-based wsd. In: Computational Linguistics and Intelligent Text Processing; 2002. Springer. pp. 146-155.
  • [15] Ng HT, Lee HB. Integrating multiple knowledge sources to disambiguate word sense: An exemplar-based approach. In: Proceedings of the 34th annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 1996. pp. 40-47.
  • [16] Miller GA, Chodorow M, Landes S, Leacock C, Thomas RG. Using a semantic concordance for sense identification. In: Proceedings of the Workshop on Human Language Technology. Association for Computational Linguistics, 1994. pp. 240-243.
  • [17] Leacock C, Towell G, Voorhees E. Corpus-based statistical sense resolution. In: Proceedings of the Workshop on Human Language Technology. Association for Computational Linguistics, 1994. pp. 260-265.
  • [18] Bruce R, Wiebe J. Word-sense disambiguation using decomposable models. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 1994. pp. 139- 146.
  • [19] Specia L, Srinivasan A, Joshi S, Ramakrishnan G, Nunes MDGV. An investigation into feature construction to assist word sense disambiguation. Mach Learn 2009; 76: 109-136.
  • [20] Montoyo A, Su´arez A, Rigau G, Palomar M. Combining knowledge-and corpus-based word-sense-disambiguation methods. J Artif Intell Res 2005; 23: 299-330.
  • [21] Scott S, Matwin S. Feature engineering for text classification. In: ICML; 1999. pp. 379-388.
  • [22] Chodorow M, Leacock C, Miller GA. A topical/local classifier for word sense identification. Comput Humanities 2000; 34: 115-120.
  • [23] Navigli R. Word sense disambiguation: a survey. ACM Comput Surv 2009; 41: 1-69.
  • [24] Altintas E, Karsligil E, Coskun V. The effect of windowing in word sense disambiguation. In: Computer and Information Sciences-ISCIS 2005; 2005. Springer Berlin Heidelberg. pp. 626-635.
  • [25] Banerjee S, Pedersen T. An adapted Lesk algorithm for word sense disambiguation using WordNet. In: Computational linguistics and intelligent text processing. Springer, 2002. pp. 136-145.
  • [26] Fellbaum C. WordNet: An Electronic Lexical Database. Cambridge, MA, USA: MIT Press, 1998.
  • [27] Yarowsky D. One sense per collocation. In: Proceedings of the Workshop on Human Language Technology. Association for Computational Linguistics, 1993. pp. 266-271.
  • [28] Ilgen B, Adali E, Tantug AC. Building up lexical sample dataset for Turkish Word Sense Disambiguation. In: IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA); July 2012; Trabzon, Turkey.
  • [29] Turkish Language Association. G¨uncel T¨urk¸ce S¨ozl¨uk. Ankara, Turkey: TDK Publishing, 2005.
  • [30] Oflazer K, Say B, Hakkani-T¨ur DZ, T¨ur G. Building a Turkish treebank. In: Anne Abeill´e, editor. Treebanks. Amsterdam, Netherlands: Kluwer Academic Publishers, 2003. pp. 261-277.
  • [31] Atalay NB, Oflazer K, Say B. The annotation process in the Turkish treebank. In: Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC); 2003.
  • [32] Orhan Z, C¸ elik E, Demirg¨u¸c N. SemEval-2007 task 12: Turkish lexical sample task. In: Proceedings of the 4th International Workshop on Semantic Evaluations; 2007. Association for Computational Linguistics. pp. 59-63.
  • [33] Oflazer K. Two-level description of Turkish morphology. Literary and Linguistic Computing 1994; 9: 137-148.
  • [34] Yuret D, T¨ure F. Learning morphological disambiguation rules for Turkish. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics; 2006. pp. 328-334.
  • [35] Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 2009; 11: 10-18.
  • [36] ˙Ilgen B, AdalıE, Tantu˘g AC. A comparative study to determine the effective window size of Turkish Word Sense Disambiguation systems. In: Information Sciences and Systems 2013; 28–29 October 2013. Springer. pp. 169-176.