Recep Sinan ARSLAN

Kötücül Web Sayfalarının Tespitinde Doc2Vec Modeli ve Makine Öğrenmesi Yaklaşımı

Günümüzde birçok işlem dijital ortama taşınmakta ve verilerimizi bu ortamda korumak zorlaşmaktadır. Birçok şeyin internete bağlı olması web güvenliği büyük bir sorun olarak ortaya koymaktadır. İnternet kaynaklı saldırıları başlatmanın en yaygın yolu da kötü amaçlı URL adreslerini kullanmaktır. Kötücül faaliyette bulunan korsanlar bu amaçla hazırladıkları web sitelerini kullanarak birçok veriyi elde etmektedirler. Bu tür kötü amaçlı URL adreslerini veya web sitelerini tespit etmenin geleneksel yolu kara liste kullanmaktır. Ancak bu yöntem yeni oluşturulan kötü amaçlı URL’lerin tespit edilmesinde başarılı olmamaktadır. Bu çalışmada, kötücül URL adreslerinin tespitinde verimliliği artırmak ve kara liste gibi bir takım veri tabanlarına bağımlılığı önlemek için makine öğrenmesi kullanan bir yaklaşım önerildi. Makine öğreniminde sınıflandırma için farklı algoritmalar denenirken, özellik çıkarımı için Doc2Vec yaklaşımı kullanılmıştır. Sadece URL adreslerinden elde edilen özellikler kullanılarak sınıflandırma yapılmaktadır. ISCX2016URL veri seti ile yapılan testlerin birinci aşamasında URL adresinin kötücül ve iyicil olarak sınıflandırma için Logistic Regresyon algoritması ile %99,2 doğruluk yakalanırken, kesinlik, duyarlılık ve F-skoru değerlerinde sırasıyla %98,9, %99,1 ve %99,2 değerleri yakalanmıştır. Testlerin ikinci aşamasında ise kötücül URL adreslerinin spam, kimlik avı, kötücül amaçlı yazılım dağıtan ve tahrif edilmiş sınıflarına aitlikleri test edilmiştir. Sonuçta SVC sınıflandırıcı ile %88,1 doğruluk ile kötücül URL adresleri sınıflandırılmıştır. Sonuçta ortaya çıkan modeli herhangi bir vekil sunucuda veya bir ağ denetleyici platforma üzerinde uygulamak mümkündür.

Anahtar Kelimeler:

Tekdüzen Kaynak Bulucu(URL), Doc2Vec, web güvenliği, makine öğrenmesi, URL filtreleme.

A Detection Method for Malicious Web Pages using Doc2vec Model and Machine Learning

Today, many transactions are transferred to the digital environment and it is difficult to protect our data in this environment. Due to the fact that many things are connected to the internet, web security is emerging as a big problem. The most common way to initiate Internet-borne attacks is by using malicious URL addresses. Hackers engaged in malicious activity obtain a lot of data by using the websites they have prepared for this purpose. The traditional way to detect such malicious URL addresses or websites is by using a blacklist. However, this method does not succeed in detecting newly created malicious URLs. In this study, an approach using machine learning is proposed to increase efficiency in detecting malicious URLs and prevent dependence on some databases such as blacklists. While different machine learning algorithms were tried for classification, Doc2Vec approach was used for feature extraction. Classification is made using only the features obtained from URL addresses. In the first stage of the tests conducted with the ISCX2016URL data set, URLs were classified as malicious or benign. With the Logistic Regression algorithm, 99.2% accuracy was achieved, while the precision, sensitivity and F-score values were 98.9%, 99.1% and 99.2%, respectively. In the second stage of the tests, the malicious URLs belonging to the classes spam, phishing, malware and defacement were tested. Malicious URLs were classified by SVC with 88.1% accuracy. It is possible to implement the resulting model on any Proxy server or on a network controller platform

Keywords:

Uniform resource locators (URLs), Doc2Vec, web security, machine learning, URL filtering,

PDF

___

Chia-Mei C., Jhe-Jhun H., Ya-Hui O., Efficient suspicious URL filtering based on reputation, Journal of Information Security and Applications, 20, 26-36, 2015.
Imma H., Carlos R. R., David R., Rafael C., CALA: CIAssifying Links Automatically based on their URL, The Journal of Systems and Software, 115, 130-143, 2016.
Jasper P., Shantanu M., Kalliopi Z., Yingqian Z., Term Based Semantic Clusters for Very Short Text Classification, 12th International Conference on Recent Advances in Natural Language Processing, Varna-Bulgaria, 878-887, 2-4 Eylül, 2019.
Florian B., Martin E., Xiaowei X., Frequent term-based text clustering, International Conference on Knowledge Discovery and Data Mining, Newyork-United States, 436-442, 23-25 Temmuz, 2002.
Gideon M. B. W., Thomas D., Eleri A., Herbert T.K., Edwin A. V., Lambert S., Structure-Tags Improve Text Classification for Scholarly Document Quality Prediction, arXiv:2004.03705v3, 2021.
Daniel L.S., Angelica G. A., Juan M. C., Visual Content-based Web Page Categorization with Deep Transfer Learning and Metric Learning, Neurocomputing, 338, 418-431, 2019.
Ali A., Mehran F., Mahmoud K., Intelligent Classification of web pages using contextual and visual features, Applied Soft Computing, 11(2), 1638-1647, 2011.
Jia Z., Qing X., Shoou Y., Wai H. W., Exploting link structure for web page genre identication, Data Mining and Knowledge Discovery, 30, 550-575, 2016.
Rajalakshmi R., Sanju X., Experimental Study of Feature Weighting Techniques for URL Based Webpage Classification, Procedia Computer Science, 115, 218-225, 2017.
Hidayet T., Turker A., İbrahim S., A Text Based Anomaly Detection for Web Attacks, Journal of the Faculty of Engineering and Architecture of Gazi University, 22(2), 247-253, 2007.
Rajalakshmi R., Hans T., Jay P., Ankit K., Karthik R., Design of Kids-specific URL Classifier using Recurrrent Convolutional Neural Network, Procedia Computer Science, 167, 2124-2131, 2020.
Özgür K. Ş., Ebubekir B., Onder D., Banu D., Machine learning based phishing detection from URLs, Expert Systems with Applications, 117, 345-357, 2019.
Tie L., Gang K., Yi P., Improving malicious URLs detection via feature engineering: Linear and nonlinear space transformation methods, Information Systems, 91, 1-18, 2020.
Netcraft. Active Cyber Defence. https://www.netcraft.com/. Yayın tarihi: Ocak 1, 2018. Erişim tarihi: Nisan 6, 2021.
Navisite, Navisite Services, https://www.navisite.com/services/. Yayın tarihi Haziran 1, 1996. Erişim tarihi: Nisan 5, 2021.
Mohammed M., Muhammed A. R., Arash H. L., Natalia S., Detecting Malicious URLs Using Lexical Analysis, International Conference on Network and System Security, Taipei, Taiwan, 1-17, 28-30 Eylül, 2016.
Wei W., Qiao K., Jakub N., Marcin K., Rafal S., Marcin W., Accurate and fast URL phishing detector: A convolutional neural network approach, Computer Networks, 178, 1-9, 2020.
Arslan R.S., Doğru İ.A., Barışçı N., Permission-based malware detection system for android using machine learning techniques, International Journal of Software Engineering and Knowledge Engineering, 29(1), 43-61, 2019.
Yurttakal A.H., Erbay H., Arslan R.S., Grading Brain Histopathological Images Using Deep Residual Networks and Support Vector Machine, Electronic Letters on Science and Engineering, 16(2), 77-83, 2020.
Arslan R.S., Barışçı N., Development of Output Correction Methodology for Long Short Term Memory-based Speech Recognition, Sustainability, 11(15), 4250-4266, 2019.
Trendmicro. Trendmicro sitesafety. https://global.sitesafety.trendmicro.com/, Yayın tarihi: Ocak 1, 2020. Erişim tarihi: Nisan 6, 2021.
Norton. Norton Safe Web Plugin. https://us.norton.com/feature/safe-web, Yayın tarihi: Ocak 1, 2020. Erişim tarihi: Nisan 6, 2021.
Google. Google Safe Browsing. https://safebrowsing.google.com/, Yayın tarihi: Ocak 1, 2020. Erişim tarihi: Nisan 6, 2021.
Microsoft. Microsoft Smart Screen. https://support.microsoft.com/en-us/topic/what-is-smartscreen-and-how-can-it-help-protect-me-1c9a874a-6826-be5e-45b1-67fa445a74c8, Yayın tarihi: Ocak 1, 2020. Erişim tarihi: Nisan 6, 2021.
Goutam C., Tsai T.L., A Url address aware classification of malicious websites for online security during web-surfing, International conference on Advanced Networks and Telecommunications Systems (ANTS), Bhubaneswar-India, 1-6, 17-30 Aralık, 2017.
Trevor J., Nikhil S., Michale H., Defeating script injection attacks with browser-enforced embedded policies, International Conference on World Wide Web, Alberta-Kanada, 601-611, 8-12 Mayıs, 2007.
Yue Z., Jason H., Lorrie C., Cantina: a content-based approach to detecting phishing web sites, International Conference on World Wide Web, Alberta-Canada, 639-648, 8-12 Mayıs, 2007.
Guang X., Jason H., Carolyn P. R., Lorrie C., CANTINA+: A feature-rich machine learning framework for detecting phishing web sites, ACM Transaction Information System Security, 14(2), 1-28, 2011.
Yukun L., Zhenguo Y., Xu C., Huaping Y., Wenyin L., A stacking model using URL and HTML features for phishing webpage detection, Future Generation Computer Systems, 94, 27-39, 2019.
Baykan E., Henzinger M., Ludmila M., Ingmar W., A comprehensive study of features and algorithms for URL-based topic classification, ACM Transactions on the Web, 5(3), 1-29, 2011.
Rajalakshmi R., Chandrabose A., Naive Bayes Approach for URL Classification with Supervised Feature Selection and Rejection Framework, Computational Intelligence, 34(2), 363-396, 2018.
Mouad Z., Benaceur O., A novel lightweight URL phishing detection system using SVM and similarity index, Human-Centric Computing and Information Science, 7(1), 1-17, 2017.
Lawrence K. S., David R. K., Using URLs and Table Layout for Web Classification Tasks, 13th International Conference on WWW, Newyork-United States, 193-202, 19-21 Mayıs, 2004.
Carolin J., Elijah B. R., Intelligent phishing URL detection using association rule mining, Humancentric Computing and Information Sciences, 6(1), 1-19, 2016.
Sungjin K., Jinkook K., Brent B. K., Malicious URL protection based on attackers habitual behavioral analysis, Computer and Security, 77, 790-806, 2018.
Shanshan W., Zhenxiang C., Qiben Y., Ke J., Lizhi P., Bo Y., Mauro C., Deep and broad URL feature mining for android malware detection, Information Sciences, 513, 600-613, 2020.
Petros K., Dimitris G., George G., Chrysostomos S., Topic recommendation using Doc2Vec, International Joint Conference on Neural Networks, Rio de Janerio-Brazil, 1-6, 8-13 Temmuz, 2018.
Tomas M., Corrado G.S., Kai C., Jeffren D., Efficient estimation of word representations in vector space, International Conference on Learning Representations, Scottsdale-Arizona, 1-12, 2-4 Mayıs, 2013.
Tomas M., Ilya S., Kai C., Corrado G.S., Distributed representations of words and phrases and their compositionality, Advanced in Neural Information Systems, 26, 3111-3119, 2013.
Mohammad S.I.M., Mohammad A.R., Arash H.L., Natalia S., Ali A. G., Detecting Malicious URLs Using Lexical Analysis, Network and System Security, Springer International Publishing, 467-482, 2016.
Uçar E., Uçar M., A Deep Learning Approach for Detection of Malicious URLS, 6. International Management Information Systems Conference “Connectedness and Cybersecurity”, İstanbul-Türkiye, 2-10, 09-12 Ekim, 2019.
Divya K., Anupriya A.B., Nidi M., Aditya J., Machine Learning Based Malicious URL Detection, International Journal of Engineering and Advanced Technology, 8(4), 1-5, 2019.
Deebanchakkarawarthi G., Parthan AS, Sachin L., Surya A, Classification of URL into Malicious or Benign using Machine Learning Approach, International Journal of Advanced Research in Computer and Communication Engineering, 8(2), 2019.
Raju B.P.R., Lakshmi B.V., Narayana C.V. L., Detection of Multi-class Website URLs Using Machine Learning Algorithms, International Journal of Advanced Trends in Computer Science and Engineering, 9(2), 1-9, 2020.
Dwan R.A.Jr., Tavares A.M., Predictive Analysis: Machine Learning Model for URL Classification, Degree of Bachelor of Science, Worcester Polytechnic Institute, Worchester, 8-9, 2019.