Oumout CHOUSEİNOGLOU, İlker ŞAHİN

Metin Madenciliği, Makine ve Derin Öğrenme Algoritmaları ile Web Sayfalarının Sınıflandırılması

Web sitelerin sayısı hızlı bir şekilde artmakta ve bu sitelerde bulunabilecek zararlı içeriği engellemek ya da yararlı bilgilere daha kolay ulaşmak için, Web sayfalarını içerikleri doğrultusunda sınıflandırmak bir çözüm olarak ortaya çıkmaktadır. Sınıflandırma sayesinde, belirli sitelerin erişimine izin verilebilir veya bunları engellemek için Web siteleri filtrelenebilir. Bu çalışmada, farklı makine öğrenmesi yöntemleri ve yapay sinir ağları kullanılarak Web sitesi sınıflandırma problemi incelenmiştir. Bu sınıflandırma probleminin çözümü için, İkili Sınıflandırma ve Çoklu Sınıflandırma olarak iki farklı yaklaşım uygulanmış, her iki yaklaşım da çalışma kapsamında toplanan Web siteleri üzerinde test edilip, başarımları karşılaştırılmıştır. Tüm deneysel sonuçlar göz önüne alındığında İkili Sınıflandırma yaklaşımının, sadece istenilen bir Web site sınıfının filtrelenmesi görevini yerine getirmek için kullanıldığında daha etkili olacağı tespit edilmiştir. Başarıma bakıldığında ikili sınıflandırıcılar için en iyi performans gösteren algoritma Lojistik Regresyondur. Çoklu Sınıflandırma yaklaşımında uygulanan algoritmaları arasından ise en yüksek başarıma sahip yöntem Destek Vektör Makineleri (SVM) olmuştur. Ayrıca, Çoklu Sınıflandırma problemi için farklı kelime vektörleştirme yöntemleri denenmiş ve performansları karşılaştırılmıştır. İkili ve Çoklu sınıflandırma yaklaşımlarında kullanılan algoritmalarının ayrı ayrı ve farklı vektörleştirme yöntemleri ile denenmesi, Web sayfalarının sınıflandırılması ve içerik filtrelenmesi problemlerini birlikte ele alınmasını sağlamış olup, alandaki benzer çalışmalardan farkı ortaya konmuştur.

Anahtar Kelimeler:

Web Sayfa Sınıflandırması, Metin Madenciliği, Doğal Dil İşleme, Makine Öğrenmesi

Web Page Categorization with Text Mining, Machine and Deep Learning Algorithms

As the number of Web sites is growing rapidly, classifying Web pages with respect to their contents proposes itself as a possible solution to prevent accessing malicious content that may be found on these sites or to access useful information in an easier way. With such a classification, access to specific sites may be allowed or these sites may be filtered and thus access to them may be prevented. In this study, the Web site classification problem is examined by using different machine learning methods and artificial neural networks. In order to solve this classification problem, two different approaches are proposed, namely Binary Classification and Multiple Classification. Both approaches are tested and their performances are compared by using a number of Web sites collected for this study. Considering all experimental results, it has been found that the Binary Classification approach is more effective only when it is used to perform the task of filtering a desired Web site class. In terms of performance, Logistic Regression is the best performing algorithm for binary classifiers. Among the algorithms applied in the Multiple Classification approach, Support Vector Machines (SVM) is found as the most successful method. Furthermore, different word vectorization methods have been employed and their performances have been compared within the Multiple Classification problem. Algorithms used in Binary and Multi-class Classification approaches have been separately tested by using different vectorization methods. By this way the classification and content filtering problems on Web pages have been approached together, thus differentiating this study from similar researches in the domain.

Keywords:

Web Page Classification, Text Mining, Natural Language Processing, Machine Learning,

PDF

___

Chen, Y., Cheng, B. ve Cheng, X. (2016). Food safety document classification using LSTM-based ensemble learning. Revista Técnica de la Facultad de Ingeniería Universidad del Zulia, 39(10), 172-178.
Chen, R. C. ve Hsieh, C. H. (2006). Web page classification based on a support vector machine using a weighted vote schema. Expert Systems with Applications, 31(2), 427-435.
Gali, N., Mariescu-Istodor, R. ve Fränti, P. (2017). Using linguistic features to automatically extract web page title. Expert Systems with Applications, 79, 296-312.
Hartmann, J., Huppertz, J., Schamp, C. ve Heitmann, M. (2019). Comparing automated text classification methods. International Journal of Research in Marketing, 36(1), 20-38.
Hilbe, J. M. (2011). Logistic regression. International encyclopedia of statistical science, 755-758.
Internet Live Stats (2019). “Total Number of Websites”, https://www.internetlivestats.com/total-number-of-websites/ (erişim tarihi: 16.05.2019)
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L. ve Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4).
Li, Y. H. ve Jain, A. K. (1998). Classification of text documents. The Computer Journal, 41(8), 537-546.
Loper, E. ve Bird, S. (2002). NLTK: the natural language toolkit. arXiv preprint cs/0205028.
Manning, C., Raghavan, P. ve Schütze, H. (2010). Introduction to information retrieval. Natural Language Engineering, 16(1), 100-103.
Netcraft (2019). “July 2019 Web Server Survey”, https://news.netcraft.com/archives/category/web-server-survey/ (erişim tarihi: 16.05.2019)
Onan, A., Korukoğlu, S. ve Bulut, H. (2016). Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications, 57, 232-247.
Panigrahi, R. ve Borah, S. (2019). Classification and Analysis of Facebook Metrics Dataset Using Supervised Classifiers. In S. Borah, N. Dey, R. Babo & A. S. Ashour (Eds.), Social Network Analytics, Elsevier.
Rekik, R., Kallel, I., Casillas, J. ve Alimi, A. M. (2018). Assessing web sites quality: A systematic literature review by text and association rules mining. International Journal of Information Management, 38(1), 201-216.
Ren, X. Y., Shi, C., Zhang, D. ve Wang, W. S. (2019). An improved SVM web page classification algorithm. In Journal of Physics: Conference Series (Vol. 1187, No. 4, p. 042063). IOP Publishing.
Shen, D., Yang, Q., & Chen, Z. (2007). Noise reduction through summarization for Web-page classification. Information Processing & Management, 43(6), 1735-1747.
Sinoara, R. A., Camacho-Collados, J., Rossi, R. G., Navigli, R., & Rezende, S. O. (2019). Knowledge-enhanced document embeddings for text classification. Knowledge-Based Systems, 163, 955-971.
Stein, R. A., Jaques, P. A., & Valiati, J. F. (2019). An analysis of hierarchical text classification using word embeddings. Information Sciences, 471, 216-232.
Takenouchi, T., & Ishii, S. (2018). Binary classifiers ensemble based on Bregman divergence for multi-class classification. Neurocomputing, 273, 424-434.
Xu, S., Li, Y., & Wang, Z. (2017). Bayesian multinomial Naïve Bayes classifier to text classification. In Advanced multimedia and ubiquitous engineering (pp. 347-352). Springer, Singapore.