Isnen HADİ AL GHOZALİ, Arif PİRMAN, Indra INDRA

Comparison of SVM and Naïve Bayes Algorithms with InNER enriched to Predict Hate Speech

Hate speech is one of the negative sides of social media abuse. Hate speech can be classified into insults, defamation, unpleasant acts, provoking, inciting, and spreading fake news (hoax). The purpose of this study is to compare the SVM and Naïve Bayes methods with feature extraction in the form of Indonesian NER (InNER) for detecting hate speech. To obtain the best model, this study applies five steps: a) data collection; b) data preprocessing; c) feature engineering; d) model development; and e) evaluating and comparing models. In this study, we have collected 7100 tweets as an initial dataset. After manual annotation, this study produced 1681 tweets: 548 insult tweets, 288 blasphemy tweets, 272 provocative tweets, and 573 neutral tweets. This study use two Python libraries that accommodate NER in Indonesian, namely the NLTK library and the Polyglot library. Based on the results of the evaluation of the proposed model, model 5, which develops the SVM algorithm with the NLTK library, is the best model proposed. This model shows an accuracy score of 92.88% with a precision of 0.93, a recall of 0.93, and an F-1 score of 0.92.

Keywords:

SVM Naive Bayes, NER, Hate Speech,

PDF

___

[1]. J. Govers, P. Feldman, A. Dant, and P. Patros, “Down the Rabbit Hole: Detecting Online Extremism, Radicalisation, and Politicised Hate Speech,” ACM Comput. Surv., p. 3583067, Feb. 2023, doi: 10.1145/3583067.
[2]. D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural language processing: state of the art, current trends and challenges,” Multimed. Tools Appl., vol. 82, no. 3, pp. 3713-3744, Jan. 2023, doi: 10.1007/s11042-022-13428-4.
[3]. A. Shvets, P. Fortuna, J. Soler, and L. Wanner, “Targets and Aspects in Social Media Hate Speech,” in Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), Online: Association for Computational Linguistics, Aug. 2021, pp. 179-190. doi: 10.18653/v1/2021.woah-1.19.
[4]. S. S. Pandey, I. Chhabra, R. Garg, and S. Sahu, “Hate Speech Detection,” Int. J. Adv. Eng. Manag. IJAEM, vol. 5, no. 4, pp. 897–903, 2023, doi: 10.35629/5252-0504897903.
[5]. S. S. Roy, A. Roy, P. Samui, M. Gandomi, and A. H. Gandomi, “Hateful Sentiment Detection in Real-Time Tweets: An LSTM-Based Comparative Approach,” IEEE Trans. Comput. Soc. Syst., pp. 1-10, 2023, doi: 10.1109/TCSS.2023.3260217.
[6]. S. Abarna, J. I. Sheeba, S. Jayasrilakshmi, and S. P. Devaneyan, “Identification of cyber harassment and intention of target users on social media platforms,” Eng. Appl. Artif. Intell., vol. 115, p. 105283, Oct. 2022, doi: 10.1016/j.engappai.2022.105283.
[7]. H. Faris, I. Aljarah, M. Habib, and P. Castillo, “Hate Speech Detection using Word Embedding and Deep Learning in the Arabic Language Context:,” in Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods, Valletta, Malta: SCITEPRESS - Science and Technology Publications, 2020, pp. 453-460. doi: 10.5220/0008954004530460.
[8]. J. Patihullah and E. Winarko, “Hate Speech Detection for Indonesia Tweets Using Word Embedding And Gated Recurrent Unit,” IJCCS Indones. J. Comput. Cybern. Syst., vol. 13, no. 1, p. 43, Jan. 2019, doi: 10.22146/ijccs.40125.
[9]. O. Oriola and E. Kotze, “Evaluating Machine Learning Techniques for Detecting Offensive and Hate Speech in South African Tweets,” IEEE Access, vol. 8, pp. 21496-21509, 2020, doi: 10.1109/ACCESS.2020.2968173.
[10]. A. M. U. D. Khanday, S. T. Rabani, Q. R. Khan, and S. H. Malik, “Detecting twitter hate speech in COVID-19 era using machine learning and ensemble learning techniques,” Int. J. Inf. Manag. Data Insights, vol. 2, no. 2, p. 100120, Nov. 2022, doi: 10.1016/j.jjimei.2022.100120.
[11]. S. E. Viswapriya, A. Gour, and B. G. Chand, “Detecting Hate Speech and Offensive Language on Twitter using Machine Learning,” Int. J. Comput. Sci. Mob. Comput., vol. 10, no. 4, pp. 22-27, Apr. 2021, doi: 10.47760/ijcsmc.2021.v10i04.004.
[12]. D. C. Asogwa, C. I. Chukwuneke, C. C. Ngene, and G. N. Anigbogu, “Hate Speech Classification Using SVM and Naive BAYES.” Mar. 21, 2022. doi: 10.9790/0050-09012734.
[13]. I. Ivan, Y. A. Sari, and P. P. Adikara, “Klasifikasi Hate Speech Berbahasa IndonesiadiTwitterMenggunakan Naive Bayes dan Seleksi Fitur Information Gain dengan Normalisasi Kata,” J. Pengemb. Teknol. Inf. Dan Ilmu Komput., vol. 3, no. 5, pp. 4914-4922, 2019.
[14]. P. Fortuna, J. Soler-Company, and L. Wanner, “How well do hate speech, toxicity, abusive and offensive language classification models generalize across datasets?,” Inf. Process. Manag., vol. 58, no. 3, p. 102524, May 2021, doi: 10.1016/j.ipm.2021.102524.
[15]. C.-C. Wang, M.-Y. Day, and C.-L. Wu, “Political Hate Speech Detection and Lexicon Building: A Study in Taiwan,” IEEE Access, vol. 10, pp. 44337-44346, 2022, doi: 10.1109/ACCESS.2022.3160712.
[16]. J. Camacho-Collados et al., “TweetNLP: Cutting-Edge Natural Language Processing for Social Media.” arXiv, Oct. 25, 2022. doi: 10.48550/arXiv.2206.14774.
[17]. K. Englmeier and J. Mothe, “Application-Oriented Approach for DetectingCyberaggression in Social Media”.
[18]. R. Rianto, A. B. Mutiara, E. P. Wibowo, and P. I. Santosa, “Improving the accuracy of text classification using stemming method, a case of non‑formal Indonesian conversation,” J. Bg Data, vol. 8, no. 26, pp. 1-26, 2021, doi: https://doi.org/10.1186/s40537‑021‑00413‑1.
[19]. A. A. Gultiaev and J. V. Domashova, “Developing a named entity recognition model for text documents in Russian to detect personal data using machine learning methods,” Procedia Comput. Sci., vol. 213, pp. 127-135, 2022, doi: 10.1016/j.procs.2022.11.047.
[20]. B. Evkoski, N. Ljubešić, A. Pelicon, I. Mozetič, and P. Kralj Novak, “Evolution of topics and hate speech in retweet network communities,” Appl. Netw. Sci., vol. 6, no. 1, p. 96, Dec. 2021, doi: 10.1007/s41109-021-00439-7.
[21]. Z. Mansur, N. Omar, and S. Tiun, “Twitter Hate Speech Detection: A Systematic Review of Methods, Taxonomy Analysis, Challenges, and Opportunities,” IEEE Access, vol. 11, pp. 16226-16249, 2023, doi: 10.1109/ACCESS.2023.3239375.
[22]. J. M. Pérez et al., “Assessing the Impact of Contextual Information in Hate Speech Detection,” IEEE Access, vol. 11, pp. 30575–30590, 2023, doi: 10.1109/ACCESS.2023.3258973.
[23]. A. U. R. Khan, M. Khan, and M. B. Khan, “Naïve Multi-label Classification of YouTube Comments Using Comparative Opinion Mining,” Procedia Comput. Sci., vol. 82, pp. 57-64, 2016, doi: 10.1016/j.procs.2016.04.009. [24]. R. Jain, D. Goel, P. Sahu, A. Kumar, and J. P. Singh, “Profiling Hate Speech Spreaders on Twitter,” in Conference and Labs of the Evaluation Forum, Bucharest, Romania, Sep. 2021.
[25]. K. K. Kiilu, “Sentiment Classification for Hate Tweet Detection in Kenya on Twitter Data Using Naïve Bayes Algorithm,” Jomo Kenyatta University of Agriculture and Technology, Juja, 2020. Accessed: Jun. 03, 2023. [Online]. Available: http://ir.jkuat.ac.ke/bitstream/handle/123456789/5521/Project%20formatted.pdf?sequence=1&isAllowed=y
[26]. H. Watanabe, M. Bouazizi, and T. Ohtsuki, “Hate Speech on Twitter: A Pragmatic Approach to Collect Hateful and Offensive Expressions and Perform Hate Speech Detection,” IEEE Access, vol. 6, pp. 13825-13835, 2018, doi: 10.1109/ACCESS.2018.2806394.
[27]. M. P. Geetha and D. Karthika Renuka, “Improving the performance of aspect based sentiment analysis using fine-tuned Bert Base Uncased model,” Int. J. Intell. Netw., vol. 2, pp. 64-69, 2021, doi: 10.1016/j.ijin.2021.06.005.
[28]. L. H. Son, A. Kumar, S. R. Sangwan, A. Arora, A. Nayyar, and M. Abdel-Basset, “Sarcasm Detection Using Soft Attention-Based Bidirectional Long Short-Term Memory Model With Convolution Network,” IEEE Access, vol. 7, pp. 23319-23328, 2019, doi: 10.1109/ACCESS.2019.2899260.