Exploring the power of supervised learning methods for company name disambiguation in microblog posts

Exploring the power of supervised learning methods for company name disambiguation in microblog posts

Twitter is an online social networking website where people can post short messages on any subject, and these messages become visible to other users. Users intentionally express their opinions about companies or products via microblogging texts. Analyzing such messages might help explore what customers think about company products, or what the broad feelings of customers are. Identifying tweets referring to products and companies is becoming an important tool recently. However, company names are often vague. Hence, the first step is to locate the messages that are relevant to a company. In this paper, we present a number of supervised learning techniques to decide whether a given tweet is about a company, e.g., whether a message containing the term ‘amazon’is related to the company Amazon Inc. or not. Solving this task is challenging in comparison to the classical classification process. The main difficulty with this problem is that tweets and company names include limited information. To make this task tractable, external resources are used to get richer data about a company. More specifically, we generate several profiles for each organization, which contain richer information. Then we perform feature extraction to obtain both numerical and categorical features and we do feature selection to identify the most relevant attributes with our task. Finally, we train several supervised classifiers. Our constructed classifiers and carefully selected features provide high accuracy on the WePS-3 dataset. Our results show considerable improvement of accuracy by 11% over baseline approaches.

___

  • [1] Dalvi N, Kumar R, Pang B, Tomkins A. Matching reviews to objects using a language model. In: EMNLP Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing; Singapore; 2009. pp. 609-618.
  • [2] Yerva SR, Zoltan M, Aberer K. It was easy, when apples and blackberries were only fruits. In: CLEF 2010 LABs and Workshops, Notebook Papers; Padua, Italy; 2010. p. 13.
  • [3] Klenin J, Botov D. Comparison of vector space representations of documents for the task of matching contents of educational course programmes. In: Supplementary Proceedings of the Sixth International Conference on Analysis of Images, Social Networks and Texts; Moscow, Russia; 2017. pp. 79-90.
  • [4] Polat N. Experiments on company name disambiguation with supervised classification techniques. In: 2013 International Conference on Electronics, Computer and Computation; Ankara, Turkey; 2013. pp. 139-142.
  • [5] Yano T, Kang M. Taking Advantage of Wikipedia in Natural Language Processing. Technical Report. Pittsburgh, PA, USA: Carnegie Mellon University Language Technologies Institute, 2016.
  • [6] García-Cumbreras MA, García-Vega M, Martínez-Santiago F, Perea-Ortega JM. Sinai at weps-3: Online reputa- tion management. In: CLEF 2010 LABs and Workshops, Vol. 1176; Padua, Italy; 2010.
  • [7] Kalmar P. Bootstrapping websites for classification of organization names on twitter. In: 3rd Web People Search Evaluation Workshop; 2010.
  • [8] Yerva SR, Catasta M, Demartini G, Aberer K. Entity disambiguation in tweets leveraging user social profiles. In: Proceedings of the 2013 IEEE 14th International Conference on Information Reuse and Integration; San Francisco, CA, USA; 2013. pp. 120-128.
  • [9] Ahmad T, Ramsay A, Ahmed H. CENTEMENT at SemEval-2018 Task 1: Classification of tweets using multiple thresholds with self-correction and weighted conditional probabilities. In: Proceedings of the 12th International Workshop on Semantic Evaluation; New Orleans, LA, USA; 2018. pp. 200-204.
  • 0] Yerva SR, Miklós Z, Aberer K. What have fruits to do with technology?: The case of orange, blackberry and apple. In: WIMS ’11 Proceedings of the International Conference on Web Intelligence, Mining and Semantics; New York, NY, USA; 2011. p. 48.
  • [11] Yerva SR, Miklós Z, Aberer K. Entity-based classification of twitter messages. International Journal on Compu- tational Science and Applications 2012; 9: 88-115.
  • [12] Delgado DA, Martinez-Unanue R, Garcia-Plaza AP, Fernandez VF. Unsupervised real-time company name dis- ambiguation in twitter. In: Sixth International AAAI Conference on Weblogs and Social Media; Dublin, Ireland; 2012. pp. 25-28.
  • [13] Perez-Tellez F, Pinto D, Cardiff J, Rosso P. On the difficulty of clustering company tweets. In: Proceedings of the 2nd International Workshop on Search and Mining User-generated Contents; Toronto, Canada; 2010;. pp. 95-102.
  • [14] Hazrina S, Sharef NM, Ibrahim H, Murad MAA, Noah SAM. Review on the advancements of disambiguation in semantic question answering system. Information Processing and Management 2017; 53: 52-69.
  • [15] Derczynski L, Maynard D, Rizzo G, van Erp M, Gorrell G et al. Analysis of named entity recognition and linking for tweets. Information Processing and Management 2015; 51: 32-49.
  • [16] Zhao G, Wu J, Wang D, Li T. Entity disambiguation to Wikipedia using collective ranking. Information Processing and Management 2016; 52: 1247-1257.
  • [17] Qureshi MA, O’Riordan C, Pasi G. Exploiting Wikipedia for entity name disambiguation in tweets. In: Interna- tional Conference on Applications of Natural Language to Data Bases/Information Systems; Montpellier, France; 2014. pp. 184-195.
  • [18] Kalashnikov DV, Mehrotra S, Chen Z. Exploiting relationships for domain-independent data cleaning. In: SIAM SDM; Newport Beach, CA, USA; 2005. pp. 262-273.
  • [19] Minkov E, Cohen WW, Ng AY. Contextual search and name disambiguation in email using graphs. In: SIGIR; Seattle, WA, USA; 2006. pp. 27-34.
  • [20] Shen W, Han J, Wang J. A probabilistic model for linking named entities in web text with heterogeneous information networks. In: SIGMOD ’14 Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data; Snowbird, UT, USA; 2014. pp. 1199-1210.
  • [21] Chiang YH, Doan A, Naughton JF. Modeling entity evolution for temporal record matching. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data; Snowbird, UT, USA; 2014. pp. 1175-1186.
  • [22] Dalvi N, Kumar R, Pang B. Object matching in tweets with spatial models. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining; Seattle, WA, USA; 2012. pp. 43-52
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK