Enlarging multiword expression dataset by co-training

Enlarging multiword expression dataset by co-training

In multiword expressions (MWEs), multiple words unite to build a new unit in language. When MWEidentification is accepted as a binary classification task, one of the most important factors in performance is to train theclassifier with enough number of labelled samples. Since manual labelling is a time-consuming task, the performances ofMWE recognition studies are limited with the size of the training sets. In this study, we propose the comparison-basedand common-decision co-training approaches in order to enlarge the MWE dataset. In the experiments, the performancesof the proposed approaches were compared to those of the standard co-training [1] and manual labelling where statisticaland linguistic features are employed as two different views of the MWE dataset [2]. A number of tests with differentsettings were performed on a Turkish MWE dataset. Ten different classifiers were utilized in the experiments andthe best performing classifier pair was observed to be the SMO-SMO pair. The experimental results showed that thecommon-decision co-training approach is an alternative to hand-labeling of large MWE datasets and both newly proposedapproaches outperform the standard co-training [2] when the training set is to be enlarged in MWE classification.

___

  • Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. In: 11th Annual Conference on Computational learning Theory; 24–26 July 1998; Madison, Wisconsin, USA. USA: ACM. pp. 92-100.
  • Kumova Metin S. Standard co-training in multiword expressiın detection. In: International Conference on Intelligent Human Computer Interaction; 10–11 December 2017; Evry, France. Springer: Evry. pp. 178-188.
  • Klose A, Kruse R. Semi-supervised learning in knowledge discovery. Fuzzy Sets Syst 2005; 149: 209-233.
  • Chapelle O, Schölkopf B, Zien A. Semi-supervised learning. Interdiscip Sci 2006; 2: 151-5.
  • Constant M, Eryiğit G, Monti J, van der Plas L, Ramisch C, Rosner M et al. Multiword expression processing: a survey. Comput Linguist 2017; 43: 837-892.
  • Tsvetkov Y, Wintner S. Identification of multiword expressions by combining multiple linguistic information sources. Comput Linguist 2014; 40: 449-468.
  • Nigam K, Ghani R. Analyzing the effectiveness and applicability of co-training. In: Proc 9th Int Conf Inf Knowl Manag 2002; 06 - 11 November 2000; McLean, Virginia, USA. USA:ACM. pp. 86-93.
  • Mihalcea R. Co-training and self-training for word sense disambiguation. In: 8th Conference on Computational Natural Language Learning; 2004; Boston, MA, USA. pp. 182-183.
  • Sarkar A. Applying Co-training methods to statistical parsing. In: 2nd ACL; 1–7 June 2001; Pittsburgh, PA, USA. pp. 175-182.
  • Frantzi K, Ananiadou S, Mima H. Automatic recognition of multi-word terms: the C-value/NC-value method. Int J Digit Libr 2000; 3: 115-130.
  • Kumova Metin S. Feature selection in multiword expression recognition. Expert Syst Appl 2018; 92: 106-123.
  • Kumova Metin S, Taze M, Aka Uymaz H, Okur E. Multiword expression detection in Turkish using linguistic features. In: 25th Signal Processing and Communications Applications Conference; 15–18 May 2017, Antalya, Turkey. IEEE. pp. 1-4.
  • Tur G, Hakkani-Tur D, Oflazer K. A statistical information extraction system for Turkish. Nat Lang Eng 2003; 9: 181-210.
  • Quasthoff U, Richter M, Biemann C. Corpus portal for search in monolingual corpora. In: 5th Int. Conf. on Lang. Resources and Evaluation; May 2006; Genoa, Italy. pp. 1799-1802 .
  • Say B, Zeyrek D, Oflazer K, Umut Ö. Development of a corpus and a treebank for present-day written Turkish. In: 11th Conference of Turkish Linguistics; January 2002. pp. 83-192.
  • Kumova Metin S. Feature selection in multiword expression recognition. Expert Syst Appl 2018; 92: 106-123.
  • John GHG, Langley P. Estimating continuous distributions in Bayesian classifiers. In: 11th Conf Uncertain Artif Intell; 1995; Quebec, Canada. pp. 338-345.
  • Freund Y, Schapire RE. Experiments with a new boosting algorithm. In: Mach Learn Conf (ICML ’96); 03–06 July 1996; Bari, Italy. USA: Morgan Kaufmann. pp. 148-156.
  • Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in brain. Psychol Rev 1958; 65: 386-408..
  • Platt JC. Sequential Minimal Optimization: A fast algorithm for training support vector machines. In: Technical Report MST-TR-98-14; 1998. Microsoft Research.
  • Breiman L. Random forests. Mach Learn 2001; 45: 5–32. [22] Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn 1993; 11: 63-90.
  • Tabachnick BG, Fidell LS. Using Multivariate Statistics. 6th Ed. Boston, MA, USA: Pearson/Allyn & Bacon, 2007.
  • Quinlan JR. C4.5: Programs for Machine Learning. San Francisco, CA, USA: Morgan Kaufman, 1993.
  • Pearl J, Russell S. Bayesian networks. In: Arbib MA, editor. The Handbook of Brain Theory and Neural Networks. USA: MIT Press, 2003. pp. 1-11.
  • Sulubacak U, Eryiğit G. Implementing Universal Dependency, Morphology and Multiword Expression Annotation Standards for Turkish Language Processing. Turk J Elec Eng & Comp Sci 2018; 26: 1662-1672.
  • Sulubacak U, Gökırmak M, Tyers F, Çöltekin Ç, Nivre J, Eryiğit G. Universal dependencies for Turkish. In: 26th Int Conf on Computational Linguistics; 11–17 December 2016; Osaka, Japan. pp. 3444-3454.
  • Pamay T, Sulubacak U, Torunaoğlu-Selamet D, Eryiğit G. The annotation process of the ITU web treebank. In: Proceedings of the 9th Linguistic Annotation Workshop; 5 June 2015; Denver, CO, USA. USA: ACL, pp. 95-101.
  • Kumova Metin S. Neighbour unpredictability measure in multiword expression extraction. Comput Syst Sci Eng 2016; 31: 209-221.
  • Tsvetkov Y, Wintner S. Identification of multiword expressions by combining multiple linguistic information sources. Comput Linguist 2013; 40: 449-468.