Kimlik hırsızı web sitelerinin sınıflandırılması için makine öğrenmesi yöntemlerinin karşılaştırılması

Günümüzde makine öğrenmesi yöntemleri bilgisayarların daha doğru eylemler gerçekleştirmesi amacıyla birçok farklı şekilde kullanılmaktadır. Bu amaçla kullanıldıkları bir alan kimlik hırsızı web sitelerinin tespit edilmesidir. Kimlik hırsızlığı, önemli kişisel bilgileri çalmak amacıyla güvenilir web sitelerini taklit eden sahte web sitelerinin oluşturulduğu çevrimiçi bir saldırı biçimidir. Bu tehlikeyi gerçekleşmeden önlemek amacıyla web sitelerinin farklı özelliklere dayanarak kimlik hırsızı bir site olup olmadığının belirlenmesi önemlidir. Bu çalışmada, bir web sitesinin kimlik hırsızı olup olmadığını tahmin etmek amacıyla AdaBoost, çok katmanlı algılayıcı, destek vektör makinesi, karar ağacı, en yakın k komşu, Naïve Bayes ve rastgele orman makine öğrenmesi yöntemleri 9 farklı özellik içeren 1353 örnekten oluşan bir veri kümesinden yararlanarak karşılaştırılmıştır. Eğitim ve sınama şeklinde ikiye bölünmüş veri kümesiyle yapılan deneylerde karar ağaçlarından oluşturulan bir topluluk öğrenme yaklaşımı olan rastgele orman yöntemi, karşılaştırılan diğer yöntemlere göre daha başarılı olsa da çapraz doğrulamanın kullanıldığı durumda çok katmanlı algılayıcı daha yüksek bir başarım elde etmiştir.

Anahtar Kelimeler:

Makine öğrenmesi, Sınıflandırma, Kimlik hırsızlığı

Comparison of machine learning techniques for classification of phishing web sites

Today, machine learning approaches are used to make computers act more accurately for various purposes. In this manner, one area in which the machine learning approaches are used is the detection of phishing web sites. Phishing is an online threat, which depends on creating a fake web site that mimics a trustworthy web site to steal important personal information. It is important to predict whether a website is a phishing website in order to avoid this danger before it happens. In this study, AdaBoost, multilayer perceptron, support vector machine, decision tree, k-nearest neighbors, Naïve Bayes and random forest machine learning techniques are compared to predict the purpose of a website. This comparison is performed by experimenting over a dataset containing 1353 instances with 9 different features. The experimental evaluation is performed in two different settings. The first setting based on splitting the data into training and test sets. In this setting the evaluation results show that the random forest algorithm, which is an ensemble learning approach based on decision trees, outperforms other compared approaches. On the other hand, in the second setting based on cross validation, multilayer perceptron shows a better performance.

Keywords:

Machine learning, Classification, Phishing,

PDF

___

Marshland S. Machine Learning An Algorithmic Perspective. 2nd ed. New York, USA, Chapman & Hall/CRC Press, 2015.
Mitchell T. Machine Learning. New York, USA, McGraw Hill, 1997.
Alpaydın E. Yapay Öğrenme. 3. Baskı. İstanbul, Türkiye, Boğaziçi Üniversitesi Yayınevi, 2017.
Harrington P. Machine Learning in Action. New York, USA, Manning Publications, 2012.
Abdelhamid N, Ayesh A, Thabtah F. “Phishing detection based associative classification data mining”. Expert Systems with Applications, 41(13), 5948-5959, 2014.
Dhamija R, Tygar J D, Hearst M. “Why phishing works?”. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 22-27 April 2006.
Anti-Phishing Working Group. “APWG Phishing Attack Trends Reports”. https://www.antiphishing.org/resources/apwg-reports/, (11.02.2018).
Miyamoto D, Hazeyama H, Kadobayashi Y. “An evaluation of machine learning-based methods for detection of phishing sites”. Australian Journal of Intelligent Information Processing Systems, 10(2), 54-63, 2008.
Abdelhamid N, Ayesh A, Thabtah F. “Associative classification mining for website phishing classification”. Proceedings of the International Conference on Artificial Intelligence, Las Vegas, USA, 22-25 July 2013.
Aburrous M, Hossain MA, Dahal K, Thabtah F. “Predicting phishing websites using classification mining techniques with experimental case studies”. 7th International Conference on Information Technology: New Generations, Las Vegas, USA, 12-14 April 2010.
Kaytan M. Web Tabanlı Oltalama Saldırılarının Makine Öğrenmesi Yöntemleri İle Tespiti. Yüksek Lisans Tezi, İnönü Üniversitesi, Fen Bilimleri Enstitüsü, Malatya, Türkiye, 2016.
Kaytan M, Hanbay D. “Effective classification of phishing web pages based on new rules by using extreme learning machines”. Anatolian Journal of Computer Sciences, 2(1), 15-36, 2017.
Kazemian HB, Ahmed. “Comparisons of machine learning techniques for detecting malicious webpages”. Expert Systems with Applications, 42(3), 1166-1177, 2015.
Koşan MA, Yıldız O, Karacan H. “Kimlik avı web sitelerinin tespitinde makine öğrenmesi algoritmalarının karşılaştırmalı analizi”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 24(2), 276-282, 2018.
Lakshmi VS, Vijaya MS. “Efficient prediction of phishing websites using supervised learning algorithms”. Procedia Engineering, 30, 798-805, 2012.
Moghimi M, Varjani AY. “New rule-based phishing detection method”. Expert Systems with Applications, 53, 231-242, 2016.
Mohammad RM, Thabtah F, McCluskey L. “Intelligent rule-based phishing websites classification”. IET Information Security, 8(3), 153-160, 2014.
Mohammad RM, Thabtah F, McCluskey L. “Predicting phishing websites based on self-structuring neural network”. Neural Computing and Applications, 25(2), 443-458, 2014.
Nguyen HH, Nguyen DT. “Machine Learning based phishing web sites detection”. AETA 2015: Recent Advances in Electrical Engineering and Related Sciences. LNEE, 371, 123-131, 2016.
Sahoo D, Liu C, Hoi SCH. “Malicious URL detection using machine learning: a survey”. ArXiv e-prints, 1701.07179, 2017.
Haykin S. Neural Networks and Learning Machines. 3rd Ed. New Jersey, USA, Pearson Education, 2009.
Zhou ZH. Ensemble Methods: Foundations and Algorithms. New York, USA, Chapman and Hall/CRC, 2012.
Freund Y, Schapire RE. “A Decision-theoretic generalization of on-line learning and an application to boosting”. Journal of Computer and System Sciences, 55(1), 119-139, 1997.
Schapire RE. Explaining AdaBoost, Editors: Schölkopf B, Luo Z, Vovk V. Empirical Inference, 37-52, Berlin, Germany, Springer, 2013.
Rumelhart DE, Hinton GE, Williams RJ. “Learning internal representations by back-propagating errors”. Nature, 323(99), 533–536, 11986.
ZackWeinberg, Wikimedia Commons, File: Svm separating hyperplanes (SVG).svg, https://commons.wikimedia.org/w/index.php?title=File:Svm_separating_hyperplanes_(SVG).svg&oldid=217578095, (11.02.2018).
Hsu CW, Chang CC, Lin CJ. “A Practical Guide to Support Vector Classification”. Department of Computer Science, National Taiwan University, Technical Report, https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guidepdf, (11.02.2018).
Cortes C, Vapnik V. “Support vector networks”. Machine Learning, 20(3), 1-25, 1995.
Ayhan S, Erdoğmuş Ş. “Destek vektör makineleriyle sınıflandırma problemlerinin çözümü için çekirdek fonksiyonu seçimi”. Eskişehir Osmangazi Üniversitesi İBBF Dergisi, 9(1), 175-198, 2014.
Eviatar Bach, Wikimedia Commons, File: Simple decision tree.svg, https://commons.wikimedia.org/w/index.php?title=File:Simple_decision_tree.svg&oldid=244802879, (11.02.2018).
Onan A. “Şirket iflaslarının tahmin edilmesinde karar ağacı algoritmalarının karşılaştırmalı başarım analizi”. Bilişim Teknolojileri Dergisi, 8(1), 9-19, 2015.
Cover TM, Hart PE. “Nearest neighbor pattern classification”. IEEE Transactions on Information Theory, 1967, 13(1), 21-27, 1967.
Breiman L. “Random Forests”. Machine Learning, 45(1), 5-32, 2001.
UCI Machine Learning Repository, “Website Phishing Data Set”, https://archive.ics.uci.edu/ml/datasets/Website+Phishing (11.02.2018)
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. “Scikit-learn: Machine Learning in Python”. Journal of Machine Learning Research, 12, 2825-2830, 2011.
Burman P. “A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods”. Biometrika, 76(3), 503-514, 1989.