Turkish entity discovery with word embeddings

Turkish entity discovery with word embeddings

Entity-linking systems link noun phrase mentions in a text to their corresponding knowledge base entities in order to enrich a text with metadata. Wikipedia is a popular and comprehensive knowledge base that is widely used in entity-linking systems. However, long-tail entities are not popular enough to have their own Wikipedia articles. Therefore, a knowledge base created by using Wikipedia entities would be limited to only popular entities. In order to overcome the knowledge base coverage limitation of Wikipedia-based entity-linking systems, this paper presents an entity-discovery system that can detect semantic types of entities that are not defined in Wikipedia. The effectiveness of the proposed system was validated empirically through the use of generated data sets for the Turkish language. The experimental results show that, in terms of accuracy, our system performs competitively in comparison to the previous methods in the literature. Its high performance is achieved through a method that learns word embeddings for candidate entities

___

  • [1] Shen W, Wang J, Han J. Entity linking with a knowledge base: issues, techniques, and solutions. IEEE T Knowl Data En 2015; 27: 443-460.
  • [2] Nakashole N, Tylenda T, Weikum G. Fine-grained semantic typing of emerging entities. In: ACL 2013 51st Annual Meeting of the Association for Computational Linguistics; 4–9 August 2013; Sofia, Bulgaria. pp. 1488-1497.
  • [3] Ling X, Weld DS. Fine-grained entity recognition. In: 26th AAAI Conference on Artificial Intelligence; 22–26 July 2012; Toronto, Canada. Palo Alto, CA, USA: AAAI Press. pp. 94-100.
  • [4] Xing C, Wang D, Zhang X, Liu C. Document classification with distributions of word vectors. In: APSIPA 2014 Asia-Pacific Signal and Information Processing Association Conference; 9–12 December 2014; Siem Reap, Cambodia. New York, NY, USA: IEEE. pp. 1-5.
  • [5] Luong T, Socher R, Manning CD. Better word representations with recursive neural networks for morphology. In: CoNLL 2013 Computational Natural Language Learning Conference; Sofia, Bulgaria. pp. 104-113.
  • [6] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In: ICLR 2013 International Conference on Learning Representations; 2–4 May 2013; Scottsdale, AZ, USA.
  • [7] Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. In: EMNLP 2014 Empirical Methods in Natural Language Processing Conference; 25–29 October; Doha, Qatar. pp. 1532-1543.
  • [8] Nadeau D, Sekine S. A survey of named entity recognition and classification. Linguisticae Investigationes 2007; 30: 3-26.
  • [9] Lin T, Mausam, Etzioni O. No noun phrase left behind: detecting and typing unlinkable entities. In: EMNLPCoNLL 2012 Empirical Methods in Natural Language Processing and Computational Natural Language Learning Conference; 12–14 July 2012; Stroudsburg, PA, USA. pp. 893-903.
  • [10] Rahman A, Ng V. Inducing fine-grained semantic classes via hierarchical and collective classification. In: COLING 2010 23rd International Conference on Computational Linguistics; 23–27 August 2010; Stroudsburg, PA, USA. pp. 931-939.
  • [11] Yosef MA, Bauer S, Hoffart J, Spaniol M, Weikum G. Hyena: hierarchical type classification for entity names. In: COLING 2012 24th International Conference on Computational Linguistics; 8–15 December 2012; Mumbai, India. p. 1361.
  • [12] Desmet B, Hoste V. Fine-grained Dutch named entity recognition. Lang Resour Eval 2014; 48: 307-343.
  • [13] Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J. Freebase: a collaboratively created graph database for structuring human knowledge. In: ACM SIGMOD 2008 International Conference on Management of Data; 9–12 June 2008; Vancouver, Canada. New York, NY, USA: ACM. pp. 1247-1250.
  • [14] Hoffart J, Suchanek FM, Berberich K, Weikum G. Yago2: a spatially and temporally enhanced knowledge base from Wikipedia. Artif Intell 2013; 194: 28-61.
  • [15] Yogatama D, Gillick D, Lazic N. Embedding methods for fine grained entity type classification. In: ACL-IJCNLP 2015 53rd Annual Meeting of the Association for Computational Linguistics and 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing; 26–31 July 2015; Beijing, China. pp. 291-296.
  • [16] Seker GA, Eryigit G. Initial explorations on using CRFs for Turkish named entity recognition. In: COLING 2012 24th International Conference on Computational Linguistics; 8–15 December 2012; Mumbai, India. pp. 2459-2474.
  • [17] Tatar S, Cicekli I. Automatic rule learning exploiting morphological features for named entity recognition in Turkish. J Inf Sci 2011; 37: 137-151.
  • [18] Akin AA, Akin MD. Zemberek, an open source NLP framework for Turkic languages. Structure 2007; 10: 1-5.
  • [19] Eryigit G. ITU Turkish NLP Web Service. In: EACL 2014 14th Conference of the European Chapter of the Association for Computational Linguistics; 26–30 April 2014; Gothenburg, Sweden. pp. 1-4.
  • [20] Can F, Kocberber S, Balcik E, Kaynak C, Ocalan HC, Vursavas OM. Information retrieval on Turkish texts. J Am Soc Inf Sci Technol 2008; 59: 407-421.
  • [21] Sekine S. Extended named entity ontology with attribute information. In: LREC 2008 6th International Conference on Language Resources and Evaluation; 28–30 May 2008; Marrakech, Morocco. pp. 52-57.
  • [22] Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ. Liblinear: a library for large linear classification. J Mach Learn 2008; 9: 1871-1874.
  • [23] Heaton J. Encog: library of interchangeable machine learning models for Java and C#. J Mach Learn Res 2015; 16: 1243-1247.
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK