An Active Learning Based Emoji Prediction Method in Turkish

Emoji usage has become a standard in social media platforms since it can condense feelings beyond short textual information. Recent advances in machine learning enable to write short messages with automatically detected emojis. However, the prediction of emojis for the given short message can be complicated, inasmuch as users can interpret different meanings beyond the intent of their designers. Therefore, an automatic extraction strategy of training samples cannot be convenient from the large volumes of unlabelled tweets. In this paper, we present an active learning method to evaluate the emoji prediction of a tweet with a limited number of labelled Turkish emoji dataset. To simulate a human-machine collaborative learning method, we train an initial classifier with this dataset and then we update the classifier by filtering related samples out from the large pool of unlabelled data. In the evaluation, we split 25% randomly selected tweets combined with only one emoji from the generated dataset as a test case. Our active learning method has achieved 0.901 F1 score and outperforms other baseline supervised learning methods.

___

[1] F. Barbieri, F. Ronzano, and H. Saggion, “What does this emoji mean? a vector space skip-gram model for twitter emojis,” in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), 2016, pp. 3967–3972.

[2] P. K. Novak, J. Smailovic, B. Sluban, and I. Mozetic, “Sentiment of emojis,” PloS one, vol. 10, no. 12, p. e0144296, 2015.

[3] B. Eisner, T. Rocktaschel, I. Augenstein, M. Bosnjak, and S. Riedel, “emoji2vec: Learning emoji representations from their description,” arXiv preprint arXiv:1609.08359, 2016.

[4] B. Settles, “Active learning, ”Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 6, no. 1, pp. 1–114, 2012.

[5] F. Barbieri, M. Ballesteros, and H. Saggion, “Are emojis predictable?” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers,2017, pp. 105–111.

[6] X. Li, R. Yan, and M. Zhang, “Joint emoji classification and embedding learning,” in Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data. Springer, 2017, pp. 48–63.

[7] O. Coban, B. Ozyer, and G. T. Ozyer, “Sentiment analysis for turkish twitter feeds,” in 2015 23nd Signal Processing and Communications Applications Conference (SIU). IEEE, 2015, pp. 2388–2391.

[8] M. Shiha and S. Ayvaz, “The effects of emoji in sentiment analysis,” Int. J. Comput. Electr. Eng. (IJCEE.), vol. 9, no. 1, pp. 360–369, 2017.

[9] C. U. Yurtoz and I. B. Parlak, “Measuring the effects of emojis on Turkish context in sentiment analysis,” in2019 7th International Symposium on Digital Forensics and Security (ISDFS). IEEE, 2019, pp. 1–6.

[10] R. Velioglu, T. Yıldız, and S. Yıldırım, “Sentiment analysis using learning approaches over emojis for turkish tweets,” in2018 3rd International Conference on Computer Science and Engineering (UBMK). IEEE, 2018, pp. 303–307.

[11] Y.-Y. Yang, S.-C. Lee, Y.-A. Chung, T.-E. Wu, S.-A. Chen, and H.-T. Lin, “libact: Pool-based active learning in python,” arXiv preprint arXiv:1710.00379, 2017.

[12] O. Reyes, E. Perez, M. Del Carmen Rodriguez-Hernandez, H. M.Fardoun, and S. Ventura, “Jclal: a java framework for active learning,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 3271–3275, 2016.

[13] G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas, “Mulan: A java library for multi-label learning,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2411–2414, 2011.

[14] S. Keerthi, S. Shevade, C. Bhattacharyya, and K. Murthy, “Improve-ments to platt’s smo algorithm for svm classifier design,” Neural Computation, vol. 13, no. 3, pp. 637–649, 2001.

[15] T. Hastie and R. Tibshirani, “Classification by pairwise coupling,” in Advances in Neural Information Processing Systems, M. I. Jordan, M. J.Kearns, and S. A. Solla, Eds., vol. 10. MIT Press, 1998.

[16] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.Witten, “The weka data mining software: an update,” ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10–18, 2009.

[17] K. Brinker, “On active learning in multi-label classification,” in From Data and Information Analysis to Knowledge Engineering. Springer, 2006, pp. 206–213.

[18] M. del Pilar Salas-Zrate, M. A. Paredes-Valverde, M. ngel Rodriguez-Garca, R. Valencia-Garca, and G. Alor-Hernndez, “Automatic detectionof satire in twitter: A psycholinguistic-based approach,”Knowledge-Based Systems, vol. 128, pp. 20 – 33, 2017. [Online].Available:http://www.sciencedirect.com/science/article/pi i/S0950705117301855

[19] J. Cotelo, F. Cruz, J. Troyano, and F. Ortega, “A modular approach for lexical normalization applied to spanish tweets,” Expert Systems with Applications, vol. 42, no. 10, pp. 4743 – 4754, 2015. [Online].Available:http://www.sciencedirect.com/science/article/pi i/S0957417415000962

[20] O. Gorgun and O. T. Yildiz, “A novel approach to morphological disambiguation for turkish,” in Computer and Information Sciences II. Springer, 2011, pp. 77–83.

[21] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques. Elsevier, 2011.

[22] D. G. Kleinbaum, K. Dietz, M. Gail, M. Klein, and M. Klein, Logistic regression. Springer, 2002.

[23] C. D. Manning, P. Raghavan, and H. Schutze, “Introduction to infor-mation retrieval? cambridge university press 2008,” Ch, vol. 20, pp. 405–416.