Siber Saldırı Veri Kümelerinde Standart Sapmaya Dayalı Öznitelik Seçimi Kullanan Sınıflandırma Algoritmalarının Performanslarının Karşılaştırması

Denetimli makine öğrenimi teknikleri, geçmiş verilerden öğrenme yetenekleri nedeniyle finans, eğitim, sağlık, mühendislik vb. pek çok alanda yaygın olarak kullanılmaktadır. Ancak, veri kümesi çok boyutlu ise bu tür teknikler çok yavaş olabilir ve alakasız özellikler nedeniyle de sınıflandırma başarısını düşürebilir. Bu nedenle, bahsedilen sorunların üstesinden gelmek için öznitelik seçme veya nitelik azaltma teknikleri yaygın olarak kullanılmaktadır. Öte yandan, bilgi güvenliği hem insanlar hem de ağlar için çok önemlidir ve zaman kaybetmeksizin güvence altına alınması gerekir. Bu nedenle, sınıflandırma başarısını düşürmeden algoritmaları hızlandırabilen öznitelik seçim yaklaşımlarına ihtiyaç duyulmaktadır. Bu çalışmada, güvenlik veri kümeleri açısından standart sapmaya dayalı öznitelik seçimi kullanan en temel sınıflandırma algoritmalarının hem sınıflandırma başarılarını hem de çalışma zamanı performanslarını karşılaştırdık. Bu amaçla KDD Cup 99 ve Phishing Legitimate veri setlerine standart sapma tabanlı öznitelik seçimi uygulayarak en ilgili nitelikleri seçtik ve seçilen sınıflandırma algoritmalarını veri setlerinde uygulayarak sonuçları karşılaştırdık. Elde edilen sonuçlara göre, tüm algoritmaların sınıflandırma başarıları tatmin edici iken, Karar Ağacı (DT) diğerleri algoritmalara göre en iyisi olarak dikkat çekmiştir. Bununla birlikte, Karar Ağacı, k En Yakın Komşu ve Naïve Bayes (BN) tatmin edici düzeyde hızlıyken, Destek Vektör Makinesi (SVM) ve Yapay Sinir Ağları’nın (ANN veya NN) çok yavaş oldukları dikkat çekmiştir.

Anahtar Kelimeler:

bilgi güvenliği, makine öğrenmesi, Öznitelik seçimi, sınıflandırma, siber güvenlik

Comparison of Performance of Classification Algorithms Using Standard Deviation-based Feature Selection in Cyber Attack Datasets

Supervised machine learning techniques are commonly used in many areas like finance, education, healthcare, engineering, etc. because of their ability to learn from past data. However, such techniques can be very slow if the dataset is high-dimensional, and also irrelevant features may reduce classification success. Therefore, feature selection or feature reduction techniques are commonly used to overcome the mentioned issues. On the other hand, information security for both people and networks is crucial, and it must be secured without wasting the time. Hence, feature selection approaches that can make the algorithms faster without reducing the classification success are needed. In this study, we compare both the classification success and run-time performance of state-of-the-art classification algorithms using standard deviation-based feature selection in the aspect of security datasets. For this purpose, we applied standard deviation-based feature selection to KDD Cup 99 and Phishing Legitimate datasets for selecting the most relevant features, and then we run the selected classification algorithms on the datasets to compare the results. According to the obtained results, while the classification success of all algorithms is satisfying Decision Tree (DT) was the best one among others. On the other hand, while Decision Tree, k Nearest Neighbors, and Naïve Bayes (BN) were sufficiently fast, Support Vector Machine (SVM) and Artificial Neural Networks (ANN or NN) were too slow.

Keywords:

classification, cyber security, Feature selection, information security, machine learning,

PDF

___

Abdullahi, M., Baashar, Y., Alhussian, H., Alwadain, A., Aziz, N., Capretz, L. F. and Abdulkadir, S. J. J. E. (2022). Detecting cybersecurity attacks in internet of things using artificial intelligence methods: A systematic literature review. 11(2), 198.
Ali, N., Neagu, D. and Trundle, P. J. S. A. S. (2019). Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets. 1, 1-15.
Aljabri, M. and Mirza, S. (2022). Phishing Attacks Detection using Machine Learning and Deep Learning Models, 7th International Conference on Data Science and Machine Learning Applications (CDMA), Riyadh, Saudi Arabia, 2022, pp. 175-180, doi: 10.1109/CDMA54072.2022.00034.
Almaiah, M. A., Al-Zahrani, A., Almomani, O. and Alhwaitat, A. K. (2021). Classification of cyber security threats on mobile devices and applications. In Artificial Intelligence and Blockchain for Future Cybersecurity Applications (pp. 107-123): Springer.
Ansari, M. F., Sharma, P. K. and Dash, B. J. P. (2022). Prevention of phishing attacks using AI-based Cybersecurity Awareness Training.
Bahaa, A., Abdelaziz, A., Sayed, A., Elfangary, L. and Fahmy, H. J. I. (2021). Monitoring real time security attacks for IoT systems using DevSecOps: a systematic literature review. 12(4), 154.
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. doi:10.1023/A:1010933404324
Çetin, V. and Yıldız, O. (2022). A comprehensive review on data preprocessing techniques in data analysis. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 28(2), 299-312.
Cheng, F., Cui, J., Wang Q., and Zhang, L. (2023). A Variable Granularity Search-Based Multiobjective Feature Selection Algorithm for High-Dimensional Data Classification, in IEEE Transactions on Evolutionary Computation, vol. 27, no. 2, pp. 266-280, April 2023, doi: 10.1109/TEVC.2022.3160458.
Deiana, A. M., Tran, N., Agar, J., Blott M.., Di Guglielmo G., Duarte, J. Harris, P., Hauck, S., Liu, M., Neubauer M., S., Ngadiuba J., Ogrenci-Memik, S., Pierini, M., Aarrestad, T., Bähr, S., Becker, J., Berthold A.-S,, Bonventre, R. J., Müller, Bravo, T. E., Diefenthaler M., Dong, Z., Fritzsche, N., Gholami, A., Govorkova, E., Guo, D., Hazelwood, K. J., Herwig, C., Khan, B., Kim, S., Klijnsma, T., Liu, Y., Lo, K. H., Nguyen, T., Pezzullo, G., Rasoulinezhad, S., Rivera, R, A., Scholberg, K., Selig, J., Sen, S., Strukov, D., Tang, W., Thais, S., Unger, K. L., Vilalta, R., von Krosigk, B., Wang, S. and Warburton, T. K. (2022). Applications and Techniques for Fast Machine Learning in Science. Front. Big Data 5:787421. doi: 10.3389/fdata.2022.787421
Di Mauro, M., Galatro, G., Fortino, G. and Liotta, A. (2022). Supervised feature selection techniques in network intrusion detection: A critical review, Engineering Applications of Artificial Intelligence, vol. 101, https://doi.org/10.1016/j.engappai.2021.104216.
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Eid, H. F., Hassanien, A. E., Kim, T. H., Banerjee, S. (2013). Linear correlation-based feature selection for network intrusion detection model. In Proceedings of the International Conference on Security of Information and Communication Networks 2013, Cairo, Egypt, 3–5 September 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 240–248.
Fürnkranz, J. (2017). Decision Tree. In C. Sammut and G. I. Webb (Eds.), Encyclopedia of Machine Learning and Data Mining (pp. 330-335). Boston, MA: Springer US.
Heidari, A., Jafari Navimipour, N., Unal, M., Toumaj, S. J. N. C. and Applications. (2022). Machine learning applications for COVID-19 outbreak management. 34(18), 15313-15348.
Jain, A. K., Mao, J. and Mohiuddin, K. M. (1996). Artificial neural networks: A tutorial. J Computer, 29(3), 31-44. doi:10.1109/2.485891
Khaire, U. M., Dhanalakshmi, R. (2022). Stability of feature selection algorithm: A review, Journal of King Saud University - Computer and Information Sciences, 34(4), https://doi.org/10.1016/j.jksuci.2019.06.012.
Kira, K. and Rendell, L. A. (1992). The feature selection problem: traditional methods and a new algorithm. Paper presented at the Proceedings of the tenth national conference on Artificial intelligence, San Jose, California.
Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1), 273-324. doi:https://doi.org/10.1016/S0004-3702(97)00043-X
Kushwaha, P., Buckchash, H. and Raman, B. (2017) Anomaly based intrusion detection using filter based feature selection on KDD-CUP 99. In Proceedings of the TENCON 2017—2017 IEEE Region 10 Conference, Penang, Malaysia, 5–8 November 2017; pp. 839–844.
Lee, C. S., Cheang, P. Y. S. and Moslehpour, M. J. A. i. D. S. (2022). Predictive analytics in business analytics: decision tree. Advances in Decision Sciences, 26(1), 1-29.
Li, Y., Fang, B. X., Chen, Y., Guo, L. (2006). A lightweight intrusion detection model based on feature selection and maximum entropy model. In Proceedings of the 2006 International Conference on Communication Technology, Guilin, China, 27–30 November 2006; pp. 1–4.
Lyu Y, Feng Y and Sakurai K. A Survey on Feature Selection Techniques Based on Filtering Methods for Cyber Attack Detection. Information. 2023; 14(3):191. https://doi.org/10.3390/info14030191
Maheswari, V. U., Aluvalu, R. and Mudrakola, S. (2022). An integrated number plate recognition system through images using threshold-based methods and KNN. Paper presented at the 2022 International Conference on Decision Aid Sciences and Applications (DASA).
Malik, N. U. R., Abu Bakar, S. A. R. and Sheikh, U. U. (2022). Multiview human action recognition system based on OpenPose and KNN classifier. Paper presented at the Proceedings of the 11th International Conference on Robotics, Vision, Signal Processing and Power Applications: Enhancing Research and Innovation through the Fourth Industrial Revolution.
Manevitz, L. M., and Malik Y. (2001). One-class svms for document classification. J. Mach. Learn. Res. 2, 139–154.
Mohammadi, S., Desai, V., Karimipour, H. (2018). Multivariate mutual information-based feature selection for cyber intrusion detection. In Proceedings of the 2018 IEEE Electrical Power and Energy Conference (EPEC), Toronto, ON, Canada, 10–11 October 2018; pp. 1–6.
Nguyen, H., Franke K. and Petrovic, S. (2010). Improving Effectiveness of Intrusion Detection by Correlation Feature Selection, 2010 International Conference on Availability, Reliability and Security, Krakow, Poland, 2010, pp. 17-24, doi: 10.1109/ARES.2010.70.
Ojewumi, T. O., Ogunleye, G., Oguntunde, B., Folorunsho, O., Fashoto, S. and Ogbu, N. J. S. A. (2022). Performance evaluation of machine learning tools for detection of phishing attacks on web pages. 16, e01165.
Patil, S. and Patil, Y. (2022). Face Expression Recognition Using SVM and KNN Classifier with HOG Features. In Applied Computational Technologies: Proceedings of ICCET 2022 (pp. 416-424): Springer.
Rivera-Lopez, R., Canul-Reich, J., Mezura-Montes, E., Cruz-Chávez, M. A. J. S. and Computation, E. (2022). Induction of decision trees as classification models through metaheuristics. 69, 101006.
Russell, S. J. (2010). Artificial intelligence a modern approach: Pearson Education, Inc.
Shabudin, S., Samsiah, N., Akram, K. and Aliff, M. (2020). Feature Selection for Phishing Website Classification. International Journal of Advanced Computer Science and Applications, 11.
Shahbaz, M.B., Wang, X., Behnad, A., Samarabandu, J. (2016). On efficiency enhancement of the correlation-based feature selection for intrusion detection systems. In Proceedings of the 2016 IEEE 7th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 13–15 October 2016; pp. 1–7
Şenol, A. (2022a). Comparison of Feature Selection Methods in the Aspect of Phishing Attacks. Paper presented at the International Conference on Engineering Technologies, ICENTE'22, Konya.
Şenol, A. (2022b). Standard Deviation-Based Centroid Initialization For K-Means. Paper presented at the 3. International Anatolian Scientific Research Congress, Kayseri.
Şenol, A. , Canbay, Y. and Kaya, M. (2021). Trends in Outbreak Detection in Early Stage by Using Machine Learning Approaches. Bilişim Teknolojileri Dergisi, 14(4), 355-366.
Tan, C. L. (2018). Phishing Dataset for Machine Learning: Feature Evaluation.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267-288.
Uma, M. and Padmavathi, G. (2013) A Survey on Various Cyber Attacks and Their Classification. International Journal of Network Security, 15, 390-396..
Wahba, Y., ElSalamouny, E., ElTaweel, G. (2015). Improving the performance of multi-class intrusion detection systems using feature reduction. arXiv:1507.06692
Wang, W., Du, X., Wang, N. (2019). Building a cloud IDS using an efficient feature selection method and SVM. IEEE Access 2018, 7, 1345–1354.
Yousefpour, A., Ibrahim, R., Abdull Hamed, H. N. and Hajmohammadi, M. S. (2014). Feature reduction using standard deviation with different subsets selection in sentiment analysis. Paper presented at the Intelligent Information and Database Systems: 6th Asian Conference, ACIIDS 2014, Bangkok, Thailand, April 7-9, 2014, Proceedings, Part II 6.
Zhou, H., Wang, X. and Zhu, R. J. A. I. (2022). Feature selection based on mutual information with correlation coefficient. 1-18.