An index-based joint multilingual/cross-lingual text categorization using topic expansion via BabelNet

An index-based joint multilingual/cross-lingual text categorization using topic expansion via BabelNet

: The majority of the state-of-the-art text categorization algorithms are supervised and therefore require priortraining. Besides the rigor involved in developing training datasets and the requirement for repetition of training fordifferent texts, working with multilingual texts poses additional unique challenges. One of these challenges is that thedeveloper is required to have many different languages involved. Term expansion such as query expansion has beenapplied in numerous applications; however, a major drawback of most of these applications is that the actual meaning ofterms is not usually taken into consideration. Considering the semantics of terms is necessary because of the polysemousnature of most natural language words. In this paper, as a specific contribution to the document index approach for textcategorization, we present a joint multilingual/cross-lingual text categorization algorithm (JointMC) based on semanticterm expansion of class topic terms through an optimized knowledge-based word sense disambiguation. The lexicalknowledge in BabelNet is used for the word sense disambiguation and expansion of the topics’ terms. The categorizationalgorithm computes the distributed semantic similarity between the expanded class topics and the text documents in thetest corpus. We evaluate our categorization algorithm using a multilabel text categorization problem. The multilabelcategorization task uses the JRC-Acquis dataset. The JRC-Acquis dataset is based on subject domain classification ofthe European Commission’s EuroVoc microthesaurus. We compare the performance of the classifier with a model ofit using the original class topics. Furthermore, we compare the performance of our classifier with two state-of-the-artsupervised algorithms (each for multilingual and cross-lingual tasks) using the same dataset. Empirical results obtainedon five experimental languages show that categorization with expanded topics shows a very wide performance marginwhen compared to usage of the original topics. Our algorithm outperforms the existing supervised technique, whichused the same dataset. Cross-language categorization surprisingly shows similar performance and is marginally betterfor some of the languages.

___

  • [1] Sebastiani F. Text categorization. In: Zanasi A (editor). Text Mining and Its Applications. Southampton, UK: WIT Press, 2005, pp. 109-129.
  • [2] Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys 2002; 34 (1): 1-47.
  • [3] Ayetiran EF, Agbele K. An optimized Lesk-based algorithm for word sense disambiguation. Open Computer Science 2018; 8 (1): 165-172.
  • [4] Ayetiran EF, Boella G, Di Caro L, Robaldo L. Enhancing word sense disambiguation using a hybrid knowledgebased technique. In: Proceedings of 11th International Workshop on Natural Language Processing and Cognitive Science; Venice, Italy; 2014. pp. 15-26.
  • [5] Ayetiran EF, Boella G. EBL-Hope: Multilingual word sense disambiguation using a hybrid knowledge-based technique. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015); Denver, CO, USA; 2015. pp. 340-344.
  • [6] Navigli R, Ponzetto SP. BabelNet: The automatic construction, evaluation and application of a wide- coverage multilingual semantic network. Artificial Intelligence 2012; 193 (2012): 217-250.
  • [7] Kordestanchi H, Naderi H. Performance comparison study of language identification tools for identification of Farsi web pages. In: Proceedings of the 5th Conference on Information and Knowledge Technology; Shiraz, Iran; 2013. pp. 489-494.
  • [8] Bel N, Koster CH, Villegas M. Cross-lingual text categorization. In: Proceedings of Research and Advanced Technology for Digital Libraries; Trondheim, Norway; 2003. pp. 126-139.
  • [9] Ježek K, Toman M. Document categorization in multilingual environment. In: Proceedings of the 9th ICCC International Conference on Electronic Publishing; Belgium; 2015. pp. 97-104.
  • [10] Vossen P. EuroWordNet: A multilingual database of autonomous and language-specific WordNets connected via an inter-lingual index. International Journal of Lexicography 2004; 17 (2): 161-173.
  • [11] Wu K, Wang X, Lu B. Cross language text categorization using a bilingual lexicon. In: Proceedings of the International Joint Conference on Natural Language Processing; Hyderabad, India; 2008. pp. 165-172.
  • [12] Shi L, Mihalcea R, Tian M. Cross language text classification by model translation and semi-supervised learning. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing; Cambridge, MA, USA; 2010. pp. 1057-1067.
  • [13] Andrade D, Tamura A, Tsuchida M, Sadamasa K. Cross-lingual text classification using topic-dependent word probabilities. In: Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL; Denver, CO, USA; 2015. pp. 1466-1471.
  • [14] Xu R, Yang Y, Liu H, Hsi A. Cross-lingual text classification via model translation with limited dictionaries. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM); Indianapolis, IN, USA; 2016. pp. 95-104.
  • [15] García MAM, Rodríguez RP, Rifón LA. Wikipedia-based cross-language text classification. Information Sciences 2017; 406 (C): 12-28.
  • [16] Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. Journal of Machine Learning Research 2003; 3: 993-1022.
  • [17] Griffiths TL, Steyvers M. Finding scientific topics. In: Proceedings of the National Academy of Sciences; USA; 2004. pp. 5228-5235.
  • [18] Ayetiran EF. A combined unsupervised technique for automatic classification in electronic discovery. PhD, University of Bologna, Bologna, Italy, 2017.
  • [19] Harris Z. Distributional structure. Word 1954; 10 (23): 146-162.
  • [20] Salton G, Wong A, Yang CS. A vector space model for automatic indexing. Communications of the ACM 1975; 18 (11): 613-620.
  • [21] Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T et al. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation; Genoa, Italy; 2006. pp. 2142-2147.
  • [22] Amini MR, Usunier N, Goutte C. Learning from multiple partially observed views - an application to multilingual text categorization. Advances in Neural Information Processing Systems 2009; 22: 28-36.
  • [23] McCallum AK. MALLET: A Machine Learning for Language Toolkit. Amherst, MA, USA: University of Massachusetts, 2002.
  • [24] Steinberger R, Ebrahim M, Turchi M. JRC EuroVoc indexer JEX - a freely available multi-label categorisation tool. In: Proceedings of the 8th International Conference on Language Resources and Evaluation; İstanbul, Turkey; 2012. pp. 798-805.