Türkçe E-postalarda Spam Tespiti için Makine Öğrenme Yöntemlerinin ve Dil Modellerinin Analizi

Son zamanlarda teknolojinin ve sosyal ağların gelişmesiyle çevrimiçi karşılıklı etkileşim, herhangi konuda fikirlerini paylaşma oldukça önem kazanmıştır. Bu etkileşimlerin olumlu yanı olsa da olumsuz yanı da oldukça fazladır. Sosyal ağlarda kullanıcıların bilgilerini elde edip kullanıcıları taklit etmek güvenlik açısından büyük bir problemdir. Böylelikle kullanıcılar üzerinden dolandırıcılık vs. yapılabilmektedir. Kullanıcıları taklit edebilmek için en yaygın yol spam mesajların, e-postaların, vs. atılmasıdır. Güvenlik probleminin üstesinden gelmek için spam filtreleme, spam tespiti yöntemi geliştirme gibi işlemler uygulanmaktadır. Bu çalışmada Türkçe e-postalarda spam içeren e-postaların tespiti için Rastgele Orman, Lojistik Regresyon, Naive Bayes, Yapay Sinir Ağları makine öğrenme yöntemleri ve BERT, ELECTRA, ALBERT, DistilBERT dil modelleri analiz edilmiştir. Böylece dil modellerinin Türkçe için spam e-postaları sınıflandırmadaki etkisi gösterilmek istenmiştir. Deneysel çalışmaların sonucunda, spam e-postaları sınıflandırmada tüm dil modelleri makine öğrenme yöntemlerine göre daha başarılı olmuştur. Makine öğrenme yöntemlerinden yapay sinir ağları %90.15 doğrulu değeri elde ederken, en başarılı dil modelleri %94.08 doğruluk değeri ile BERT ve ELECTRA olmuştur.

Analysis of Machine Learning Methods and Language Models for Spam Detection in Turkish Emails

Recently, with the development of technology and social networks, online interaction, sharing ideas on any subject has gained importance. While there are positive aspects to these interactions, there are also many negative aspects. Obtaining users' information and impersonating users in social networks is a big problem in terms of security. Thus, fraud etc. can be done by under cover of users. The most common way to impersonate users is by sending spam messages, emails, etc. In order to overcome the security problem, processes such as spam filtering and spam detection method development are applied. In this study, Random Forest, Logistic Regression, Naive Bayes, Artificial Neural Networks machine learning methods and BERT, ELECTRA, ALBERT, DistilBERT language models were analyzed to detect e-mails containing spam in Turkish e-mails. Thus, it is aimed to show the effect of language models in classifying spam e-mails for Turkish. As a result of experimental studies, all language models were more successful than machine learning methods in classifying spam emails. While artificial neural networks from machine learning methods achieved 90.15% accuracy, the most successful language models were BERT and ELECTRA with 94.08% accuracy.

___

  • Acikalin, U. U., Bardak, B., & Kutlu, M. (2020, October). Turkish sentiment analysis using bert. In 2020 28th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
  • Chen, S., Webb, G. I., Liu, L., & Ma, X. (2020). A novel selective naïve Bayes algorithm. Knowledge-Based Systems, 192, 105361.
  • Chen, H., Gilad-Bachrach, R., Han, K., Huang, Z., Jalali, A., Laine, K., & Lauter, K. (2018). Logistic regression over encrypted data from fully homomorphic encryption. BMC medical genomics, 11(4), 3-12.
  • Clark, K., Luong, M. T., Le, Q. V., & Manning, C. D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
  • Crawford, M., Khoshgoftaar, T. M., Prusa, J. D., Richter, A. N., & Al Najada, H. (2015). Survey of review spam detection using machine learning techniques. Journal of Big Data, 2(1), 1-24.
  • Çelıkten, A., & Bulut, H. (2021, June). Turkish Medical Text Classification Using BERT. In 2021 29th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
  • Dedeturk, B. K., & Akay, B. (2020). Spam filtering using a logistic regression model trained by an artificial bee colony algorithm. Applied Soft Computing, 91, 106229.
  • Deniz, E., Erbay, H., & Coşar, M. (2019, November). Classification of Turkish E-Mails with Doc2Vec. In 2019 1st International Informatics and Software Engineering Conference (UBMYK) (pp. 1-4). IEEE.
  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Ekici, B. & Takcı, H. (2021). Spam Tespitinde Word2Vec ve TF-IDF Yöntemlerinin Karşılaştırılması ve Başarı Oranının Artırılması Üzerine Bir Çalışma. Bilecik Şeyh Edebali Üniversitesi Fen Bilimleri Dergisi, 8 (2), 646-655.
  • Eryılmaz, E. E., Şahin, D. Ö., & Kılıç, E. (2020, June). Filtering turkish spam using LSTM from deep learning techniques. In 2020 8th International Symposium on Digital Forensics and Security (ISDFS) (pp. 1-6). IEEE.
  • Guven, Z. A. (2021a). Comparison of BERT models and machine learning methods for sentiment analysis on Turkish tweets. In 2021 6th International Conference on Computer Science and Engineering (UBMK) (pp. 98-101). IEEE.
  • Guven, Z. A. (2021b). The Effect of BERT, ELECTRA and ALBERT Language Models on Sentiment Analysis for Turkish Product Reviews. In 2021 6th International Conference on Computer Science and Engineering (UBMK) (pp. 629-632). IEEE.
  • Isik, S., Kurt, Z., Anagun, Y., & Ozkan, K. (2020). Spam E-mail Classification Recurrent Neural Networks for Spam E-mail Classification on an Agglutinative Language. International Journal of Intelligent Systems and Applications in Engineering, 8(4), 221-227.
  • Ismail, S. S., Mansour, R. F., El-Aziz, A., Rasha, M., & Taloba, A. I. (2022). Efficient E-Mail Spam Detection Strategy Using Genetic Decision Tree Processing with NLP Features. Computational Intelligence and Neuroscience, 2022.
  • Karasoy, O., & Ballı, S. (2022). Spam SMS detection for Turkish language with deep text analysis and deep learning methods. Arabian Journal for Science and Engineering, 47(8), 9361-9377.
  • Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
  • Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language models as knowledge bases?. arXiv preprint arXiv:1909.01066.
  • Probst, P., & Boulesteix, A. L. (2017). To tune or not to tune the number of trees in random forest. The Journal of Machine Learning Research, 18(1), 6673-6690.
  • Rao, S., Verma, A. K., & Bhatia, T. (2021). A review on social spam detection: challenges, open issues, and future directions. Expert Systems with Applications, 186, 115742.
  • Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  • Siğirci, İ. O., Özgür, H., Oluk, A., Uz, H., Çetiner, E., Oktay, H. U., & Erdemir, K. (2020, September). Sentiment Analysis of Turkish Reviews on Google Play Store. In 2020 5th International Conference on Computer Science and Engineering (UBMK) (pp. 314-315). IEEE.
  • Şahin, G., & Diri, B. (2021, June). The Effect of Transfer Learning on Turkish Text Classification. In 2021 29th Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE.
  • Şimşek, H. & Aydemir, E. (2022). Classification of Unwanted E-Mails (Spam) with Turkish Text by Different Algorithms in Weka Program. Journal of Soft Computing and Artificial Intelligence, 3 (1), 1-10.
  • Taşar, B., Fatih, Ü. N. E. Ş., Demirci, M., & Kaya, Y. Z. (2018). Yapay sinir ağları yöntemi kullanılarak buharlaşma miktarı tahmini. Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi, 9(1), 543-551.