Event-related microblog retrieval in Turkish

Event-related microblog retrieval in Turkish

Microblogs, such as tweets, are short messages in which users are able to share any opinion and information. Microblogs are mostly related to real-life events reported in news articles. Finding event-related microblogs is important to analyze online social networks and understand public opinion on events. However, finding such microblogs is a challenging task due to the dynamic nature of microblogs and their limited length. In this study, assuming that news articles are given as queries and microblogs as documents, we find event-related microblogs in Turkish. In order to represent news articles and microblogs, we examine encoding methods, namely traditional bag-of-words and word embeddings provided by BERT and FastText pretrained language models based on deep learning. We find the distance between the encoded news article and microblog to measure text similarity or relatedness between them. We then rank microblogs according to their relatedness to the input query. The experimental results show that (i) BERT-based model outperforms other encoding methods in Turkish, though bag-of-words with Dice similarity has a challenging performance in short text; (ii) news title is successful to represent event as query, and (iii) preprocessing Turkish microblogs has positive impact in bag-of-words and also FastText embeddings, while BERT embeddings are robust to noise in Turkish.

___

  • [1] Abel F, Gao Q, Houben G, Tao K. Semantic enrichment of Twitter posts for user profile construction on the social web. In: The Semanic Web: Research and Applications - 8th Extended Semantic Web Conference ESWC 2011; Crete, Greece; 2011. pp. 375–389. doi:10.1007/978-3-642-21064-8_26
  • [2] Akın AA, Akın MD. Zemberek, an open source NLP framework for Turkic languages. Structure, 2007; 10: 1–5.
  • [3] Atefeh F, Khreich W. A survey of techniques for event detection in Twitter. Computational Intelligence, 2015; 31 (1): 132–164. doi:10.1111/coin.12017
  • [4] Basu M, Ghosh K, Das S, Dey R, Bandyopadhyay S et al. Identifying post-disaster resource needs and availabilities from microblogs. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining; Sydney, Australia; 2017. pp. 427–430. doi:10.1145/3110025.3110036
  • [5] Baucum M, Cui J, John RS. Temporal and geospatial gradients of fear and anger in social media responses to terrorism. ACM Transactions on Social Computing 2020; 2 (4): 1–16. doi:10.1145/3363565
  • [6] Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 2017; 5: 135–146. doi:10.1162/tacl_a_00051
  • [7] Can F, Kocberber S, Balcik E, Kaynak C, Ocalan HC et al. First large-scale information retrieval experiments on Turkish texts. In: Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval; Seattle, Washington, USA; 2006. pp. 627–628. doi:10.1145/1148170.1148288
  • [8] Can F, Kocberber S, Balcik E, Kaynak C, Ocalan HC et al. Information retrieval on Turkish texts. Journal of the American Society for Information Science and Technology 2008; 59 (3): 407–421. doi:10.1002/asi.20750
  • [9] Can F, Kocberber S, Baglioglu O, Kardas S, Ocalan HC et al. New event detection and topic tracking in Turkish. Journal of the American Society for Information Science and Technology 2010; 61 (4): 802–819. doi:10.1002/asi.21264
  • [10] Chung W, Toraman C, Huang Y, Vora M, Liu J. A deep learning approach to modeling temporal social networks on Reddit. In: IEEE International Conference on Intelligence and Security Informatics; Shenzhen, China; 2019. pp. 68–73. doi:10.1109/ISI.2019.8823399
  • [11] Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A. Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing; Copenhagen, Denmark; 2017. pp. 670–680. doi:10.18653/v1/D17-1070
  • [12] Demirsoz O, Ozcan R. Classification of news-related tweets. Journal of Information Science, 2017, 43 (4): 509–524. doi:10.1177/0165551516653082
  • [13] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Minneapolis, Minnesota; 2019. pp. 4171–4186. doi:10.18653/v1/N19-1423
  • [14] Erdoğan AE, Yilmaz T, Sert OC, Akyüz M, Özyer T et al. From social media analysis to ubiquitous event monitoring: The case of Turkish tweets. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining; Sydney, Australia; 2017. pp. 1088–1095. doi:10.1145/3110025.3120986
  • [15] Ertugrul AM, Velioglu B, Karagoz P. Word embedding based event detection on social media. In: 2017 International Conference on Hybrid Artificial Intelligence Systems; La Rioja, Spain; 2017. pp. 3–14. doi:10.1007/978-3-319-59650- 1_1
  • [16] Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin 1971; 76 (5): 378–382.
  • [17] Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T. Learning word vectors for 157 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation; Miyazaki, Japan; 2018. pp. 3483– 3487.
  • [18] Guo W, Li H, Ji H, Diab M. Linking tweets to news: A framework to enrich short text data in social media. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Sofia, Bulgaria; 2013. pp. 239–249.
  • [19] Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P et al. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 2009; 11 (1): 10–18. doi:10.1145/1656274.1656278
  • [20] Kenter T, Rijke de M. Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management; Melbourne, Australia; 2015. pp. 1411–1420. doi:10.1145/2806416.2806475
  • [21] Kulcu S, Dogdu E. A scalable approach for sentiment analysis of Turkish tweets and linking tweets to news. In: 2016 IEEE Tenth International Conference on Semantic Computing; Laguna Hills, CA, USA; 2016. pp. 471–476. doi:10.1109/ICSC.2016.66
  • [22] Kwak H, Lee C, Park H, Moon S. What is Twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web; Raleigh, North Carolina, USA; 2010. pp. 591–600. doi:10.1145/1772690.1772751
  • [23] Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics, 1977; 33 (1).
  • [24] Liew JSY, Turtle HR, Liddy ED. EmoTweet-28: A fine-grained emotion corpus for sentiment analysis. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation; 2016. pp. 1149–1156.
  • [25] Liu B, Zhang L. A survey of opinion mining and sentiment analysis. In: Aggarwal C, Zhai C (editors), Mining Text Data. Boston, MA, USA: Springer, 2012, pp. 415–463. doi:10.1007/978-1-4614-3223-4_13.
  • [26] Meij E, Weerkamp W, Rijke de M. Adding semantics to microblog posts. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining; Seattle, Washington, USA; 2012. pp. 563–572. doi:10.1145/2124295.2124364
  • [27] Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2; Lake Tahoe, Nevada; 2013. pp. 3111–3119.
  • [28] Onal KD, Altingovde IS, Karagoz P. Utilizing word embeddings for result diversification in tweet search. In: 11th Asia Information Retrieval Societies Conference; Brisbane, QLD, Australia; 2015., 2015 pp. 366–378. doi:10.1007/978-3-319-28940-3_29
  • [29] Onal KD, Zhang Y, Altingovde IS, Rahman MM, Karagoz P et al. Neural information retrieval: At the end of the early years. Information Retrieval 2018; 21 (2–3): 111–182. doi:10.1007/s10791-017-9321-y
  • [30] Ounis I, Macdonald C, Lin J, Soboroff I. Overview of the TREC 2011 microblog track. In: Proceedings of the Twentieth Text REtrieval Conference (TREC 2011); Gaithersburg, Maryland, USA; 2011.
  • [31] Ozdikis O, Senkul P, Oguztuzun H. Semantic expansion of tweet contents for enhanced event detection in Twitter. In: 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining; Istanbul, Turkey; 2012. pp. 20–24. doi:10.1109/ASONAM.2012.14
  • [32] Pan SJ, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 2009; 22 (10): 1345–1359. doi:10.1109/TKDE.2009.191
  • [33] Pennington J, Socher R, Manning CD. GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing; Doha, Qatar; 2014. pp. 1532–1543. doi:10.3115/v1/D14-1162
  • [34] Şahinuç F, Toraman C, Koç A. Topic detection based on deep learning language model in Turkish microblogs. In: 29th Signal Processing and Communications Applications Conference (SIU); 2021. pp. 1–4. doi:10.1109/SIU53274.2021.9477781
  • [35] Sankaranarayanan J, Samet H, Teitler BE, Lieberman MD, Sperling J. TwitterStand: News in tweets. In: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems; Seattle, Washington; 2009. pp. 42–51. doi:10.1145/1653771.1653781
  • [36] Soboroff I, Ounis I, Macdonald C, Lin JJ. Overview of the TREC 2012 microblog track. In: Proceedings of the Twenty-First Text REtrieval Conference (TREC 2012); Gaithersburg, Maryland, USA; 2012.
  • [37] Toraman C. Early prediction of public reactions to news events using microblogs. In: Seventh BCS-IRSG Symposium on Future Directions in Information Access; Barcelona, Spain; 2017. pp. 1–4. doi:10.14236/ewic/FDIA2017.4
  • [38] Toraman C, Can F, Koçberber S. Developing a text categorization template for Turkish news portals. In: IEEE 2011 International Symposium on Innovations in Intelligent Systems and Applications; Istanbul, Turkey; 2011. pp. 379–383. doi:10.1109/INISTA.2011.5946096
  • [39] Tsagkias M, Rijke de M, Weerkamp W. Linking online news and social media. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining; Hong Kong, China; 2011. pp. 565–574. doi:10.1145/1935826.1935906
  • [40] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems; Long Beach, California, USA; 2017. pp. 5998–6008.
  • [41] Wang J, Tong W, Yu H, Li M, Ma X et al. Mining multi-aspect reflection of news events in Twitter: Discovery, linking and presentation. In: Proceedings of IEEE International Conference on Data Mining; Atlantic City, NJ; 2015. pp. 429–438. doi:10.1109/ICDM.2015.112
  • [42] Wu Y, Schuster M, Chen Z, Le QV, Norouzi M et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, 2016, abs/1609.08144.
  • [43] Yilmaz ZA, Wang S, Yang W, Zhang H, Lin J. Applying BERT to document retrieval with Birch. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations; Hong Kong, China; 2019. pp. 19–24. doi:10.18653/v1/D19-3004
  • [44] Zamani H, Croft WB. Estimating embedding vectors for queries. In: Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval; Newark, Delaware, USA; 2016. pp. 123–132. doi:10.1145/2970398.2970403
  • [45] Zheng G, Callan J. Learning to reweight terms with distributed representations. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval; Santiago, Chile; 2015. pp. 575–584. doi:10.1145/2766462.2767700 [46] Zobel J, Moffat A. Inverted files for text search engines. ACM Computing Surveys 2006; 38 (2): 6–es.
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK
Sayıdaki Diğer Makaleler

A new similarity-based multicriteria recommendation algorithm based on autoencoders

Zeynep BATMAZ, Cihan KALELİ

45-nm CdS QDs photoluminescent filter for photovoltaic conversion efficiency recovery

Victor Manuel JUAREZ LUNA, Carlos VILLA ANGULO, Daniel SAUCEDA CARVAJAL, Ivett ZAVALA GUILLEN, Enrique RODARTE GUAJARDO, Francisco Javier CARRANZA CHAVEZ

A novel instrumentation amplifier with high tunable gain and CMRR for biomedical applications

Riyaz AHMAD, Dharmendar BOOLCHANDANI, Amit Mahesh JOSHI

Smart charging of electric vehicles to minimize the cost of charging and the rate of transformer aging in a residential distribution network

Arjun VISAKH, Manickavasagam Parvathy SELVAN

Microwave hyperthermia application with bioimplant single slot coaxial antenna design for mouse breast cancer treatment

Ahmet Rifat GÖRGÜN, Cem BAYTORE, Adnan KAYA, M. Ibrahim TUGLU, Selcuk COMLEKCI

Design and manufacture of electromagnetic absorber composed of boric acid-incorporated wastepaper composites

Osman ÇEREZCİ, Filiz KIRDIOĞULLARI, Mesud KAHRİMAN, Ahmet ÇİFCİ, Ali İhsan KAYA

Biometric identification using panoramic dental radiographic images with few-shot learning

Musa ATAŞ, Esma ÖZEROĞLU, Cüneyt ÖZDEMİR, Burak AK, İsa ATAŞ

Modeling and evaluation of SOC-based coordinated EV charging for power management in a distribution system

Ramazan BAYINDIR, Murat AKIL, Emrah DOKUR

A bi-level charging management approach for electric truck charging station considering power losses

Mehmet Tan TURAN, Yavuz ATEŞ, Tayfur GÖKÇEK, Ahmet Yiğit ARABUL

TARA: temperature aware online dynamic resource allocation scheme for energy optimization in cloud data centres

Sridhar SRIDEVI, Vaidyanathan RHYMEND UTHARIARAJ, Narayanamoorthi THILAGAVATHI, Arockiasamy JOHN PRAKASH