Performance of Using Tag-based Feature Sets in Web Page Classification

As the Web is a large collection of data growing daily, an automatic Web page classification mechanism is needed to effectively reach to useful information. Majority of the Web pages are in the form of HTML documents, therefore the aim of this study is to explore the effect of HTML tags on classification process, and try to determine the most valuable HTML tags for feature extraction of the classification task. To achieve this goal, we employ 13 different datasets, and use 5 popular classifiers that are SVM, naïve bayes (NB), kNN, C4.5, and OneR. The statistical analysis shows that, the features extracted by using solely the anchor, <p> or <title> tags can be used as an alternative to the features extracted from the whole Web page. SVM is the best among the classifiers used in this study. Using the HTML tags for feature extraction improves classification accuracy.

___

  • [1] Shaker, M., Ibrahim, H., Mustapha, A. and Abdullah, L. N. 2009. Information Extraction From Hypertext Mark-up Language Web Pages. Journal of Computer Science, 5(8), 596-607.
  • [2] Soonthomphisaj, N., Chartbanchachai, P., Pratheeptham, T. and Kijsirikul, B. 2002. Web Page Categorization Using Hierarchical Headings Structure. Proceedings of the 24th International Conference on Information Technology Interfaces in Cavtat, Croatia, IEEE, 37-42.
  • [3] Xue, W., Bao, H., Huang, W. and Lu, Y. 2006. Web Page Classification Based on SVM. Proceedings of the 6th World Congress on Intelligent Control and Automation in Dalian, China, IEEE, 6111-6114.
  • [4] Werner, L., Böttcher, S. and Beckmann, R. 2005. Enhanced Information Retrieval by Using HTML Tags. Proceedings of the 2005 International Conference on Data Mining in Las Vegas, Nevada, USA, CSREA Press, 24-29.
  • [5] Kim, S. and Zhang, B.-T. 2003. Genetic Mining of HTML Structures for Effective Web-document Retrieval. Applied Intelligence, 18(3), 243–256.
  • [6] Özel, S. A. 2011. A Web Page Classification System Based on a Genetic Algorithm Using Tagged-terms as Features. Expert Systems with Applications, 38(4), 3407-3415.
  • [7] Golub, K. and Ardo, A. 2005. Importance of HTML structural elements and metadata in automated subject classification. Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries in Vienna, Austria, Springer-Verlag, 368–378.
  • [8] Yang, Y., Slattery, S. and Ghani, R. 2002. A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems, 18(2-3), 219–241.
  • [9] Fresno, V., Martinez, R., Montalvo, S. and Casillas, A. 2006. Naive Bayes Web Page Classication with HTML Mark-up Enrichment. Proceedings of the International Multi-Conference on Computing in the Global Information Technology in Bucharest, Romania, IEEE, 48-53.
  • [10] Belmouhcine, A., Idrissi, A. and Benkhalifa, M. 2013. Web Classification Approach Using Reduced Vector Representation Model Based on HTML Tags. Journal of Theoratical and Applied Information Technology, Vol.55 No.1, 137-148.
  • [11] Saraç, E. and Özel, S. A. 2013. Web Page Classification Using Firefly Optimization. 2013 IEEE International Symposium on INnovations in Intelligent SysTems and Applications in Albena, Bulgaria, IEEE, 1-5.
  • [12] Saraç, E. and Özel, S. A. 2014. An Ant Colony Optimization Based Feature Selection for Web Page Classification. The Scientific World Journal, Vol. 2014, Article ID 649260 (2014), 16 pages.
  • [13] Meshkizadeh, S. and Rahmani, A. M. 2010. Webpage Classification Based on Compound of Using HTML Features & URL Features and Features of Sibling Pages. International Journal of Advencements in Computing Technology, 2(4), 36-46.
  • [14] Jeong, O., Oh, J., Kim, D., Lyu, H. and Kim, W. 2014. Determining the Titles of Web Pages Using Anchor Text and Link Analysis. Expert Systems with Applications, Vol. 41 No. 9 (2014), 4322-4329.
  • [15] Ünal, H. E., Özel, S. A. and Ünal, İ. 2013. Effect of Tagged-Terms on Web Page Classification Accuracy. Global Journal on Technology, Vol. 3 (2013), 244-250.
  • [16] Bhalla, V.K. and Kumar, N. 2016. An Efficient Scheme for Automatic Web Pages Categorizaiton Using the Support Vector Machine. New Review of Hypermedia and Multimedia, Vol. 22 No:3 (2016), 223-242.
  • [17] Ester, M., Kriegel, H.-P. and Schubert, M. 2002. Web Site Mining: A New Way to Spot Competitors, Customers and Suppliers in the World Wide Web. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining in Edmonton, CA, USA, ACM Press, 249-258.
  • [18] Qi, D. and Sun, B. 2004. A Genetic k-means Approaches for Automated Web Page Classification. Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration in Las Vegas, Nevada, USA, IEEE, 241–246.
  • [19] Bie, R., Fu, Z., Sun, Q. and Chen, C. 2010. A Comparison Study of Bayesian Classifiers on Web pages classification. New Generation Computing, 28(2), 161-168.
  • [20] Davison, B. D. 2000. Topical Locality in the Web, Proceedings of the 23rd Annual International Conference on Research and Development in Information Retrieval in Athens, Greece, ACM Press, 272-279.
  • [21] Pierre, J. M. 2001. On the Automated Classification of Web Sites. Linköping Electronic Articles in Computer and Information Science, Vol. 6 (2001), arXiv preprint cs/0102002.
  • [22] Qi, X. and Davison, B. D. 2009. Web Page Classification: Features and Algorithms. ACM Computing Surveys, 41(2), Article 12.
  • [23] Ru, Y. and Horowitz, E. 2007. Automated Classification of HTML Forms on E-commerce Web Sites. Online Information Review, Vol. 31 No. 4 (2007), 451 - 466.
  • [24] Sun, A., Lim, E.-P. and Ng, W.-K. 2002. Web Classification Using Support Vector Machine. Proceedings of the 4th International Workshop on Web Information and Data Management in New York, USA, ACM Press, 96–99.
  • [25] Navadiay, D., Parikh, M. and Patel, R. 2013. Constructure Based Web Page Classification. International Journal of Computer Science and Management Research, 2(6), 2742-2746.
  • [26] A. M. Sarhan, G. M. Hamissa and H. E. Elbehiry, 2015. Feature Selection Algorithms Based on HTML Tags Importance. 2015 Tenth International Conference on Computer Engineering & Systems (ICCES), Cairo, pp. 185-190.
  • [27] B. Thanasopon, N. Sumret, J. Buranapanitkij and P. Netisopakul. 2017. Extraction and evaluation of popular online trends: A case of Pantip.com. 9th International Conference on Information Technology and Electrical Engineering (ICITEE), Phuket, pp. 1-5.
  • [28] Özel, S. A. 2011. A Genetic Algorithm Based Optimal Feature Selection for Web Page Classification. Proceedings of the 2011 International Symposium on Innovations in Intelligent Systems and Applications in Istanbul, Turkey, IEEE, 282-286.
  • [29] Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K. and Slattery, S. 1998. Learning to Extract Symbolic knowledge From the World Wide Web. Proceedings of the 15th National Conference on Artificial Intelligence in Madison, Wisconsin, USA, American Association for Artificial Intelligence, 509–516.
  • [30] Ghani, R. 2001. CMU World Wide Knowledge Base (Web->KB) Project. http://www.cs.cmu.edu/~webkb/ (Access Date: 12 February 2016).
  • [31] Sinka, M. and Corne, D. (2002), “A large benchmark dataset for Web document clustering”, Soft Computing Systems: Design, Management and Applications, Vol. 87, 881-890.
  • [32] Pazzani, M. 1998. Syskill and Webert Web Page Ratings. http://kdd.ics.uci.edu/databases/ SyskillWebert/SyskillWebert.data.html (Access Date: 12 February 2016).
  • [33] Porter, M. F. 1980. An Algorithm for Suffix Stripping. Program, 14(3), 130–137.
  • [34] Salton, G., Wong, A. and Yang, C. S. 1975. A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11), 613-620.
  • [35] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I. H. 2009. The WEKA Data Mining Software: An Update. ACM Special Interest Group on Knowledge Discovery in Data Explorations Newsletter, 11(1),10-18.
  • [36] Witten, I. H., Frank, E. and Hall, M. A. 2011. Data mining: practical machine learning tools and techniques with Java implementations, Morgan Kaufmann Publishers, San Francisco, CA.
  • [37] Joachims, T. 1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of the 10th European Conference on Machine Learning in Chemnitz, Germany, Springer-Verlag, 137-142.
  • [38] Baykan, E., Henzinger, M., Marian, L. and Weber, I. 2011. A Comprehensive Study of Features and Algorithms for URL-based Topic Classification. ACM Transactions on the Web, 5(3), Article 15.
  • [39] Han, J., Kamber, M. and Pei, J. 2011. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, CA.
  • [40] Rennie, J.D.M., Shih, L., Teevan, J. and Karger, D.R. 2003. Tackling the Poor Assumptions of Naive Bayes Text Classiers. Proceedings of the Twentieth International Conference on Machine Learning in Washington DC, USA, AAAI Press, 616-623.
Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi-Cover
  • ISSN: 1300-7688
  • Yayın Aralığı: Yılda 3 Sayı
  • Başlangıç: 1995
  • Yayıncı: Süleyman Demirel Üniversitesi
Sayıdaki Diğer Makaleler

Perlit ve Su Kültürü Ortamlarında Yetiştirilen Domates (<em>Lycopersicon esculentum </em> Mill. cv. Kurucaova) Bitkisinin Bazı Anatomik Özelliklerinin Karşılaştırılması

Meltem TUYLU, Gökhan İsmail TUYLU, Selçuk SÖYLEMEZ, Hatice Nurhan BÜYÜKKARTAL

Kendiliğinden Yerleşen Harçların Elektriksel Özdirenci Üzerine Mineral Katkıların Etkisi

Tayfun UYGUNOĞLU, Bekir İlker TOPÇU, Barış ŞİMŞEK, Emriye ÇINAR

A Comparison of Different Approaches to Document Representation in Turkish Language

Savaş YILDIRIM, Tuğba YILDIZ

Expression Profile of Transcription Factor ELK-1 and ELK-1 Target Genes on Lymphoma-Leukemia Cell Lines

Hande AKALAN, Duygu YASAR SİRİN

A Comparison of Different Approaches to Document Representation in Turkish LanguageA Comparison of Different Approaches to Document Representation in Turkish Language

SAVAŞ YILDIRIM, TUĞBA YILDIZ

Steered Molecular Dynamics Simulations of Coumarin2 5Z/5E Pulling Reveal Different Interaction Profiles for Four Human Cytosolic Carbonic Anhydrases

Mustafa TEKPINAR

Türkiye Helotiaceae’si İçin Yeni Bir İlave

İbrahim TÜRKEKUL, Hakan IŞIK

Obsesif Kompulsif Bozukluk Hastalarında Klinik Değerlendirme Ölçekleri ile EEGSenkronizasyonu Arasındaki Korelasyon

Mehmet Akif ÖZÇOBAN, Aydın AKAN, Tan OĞUZ, Öz Serap AYDIN

Bazı Biyolojik Preparatların Sitophilus granarius (Coleoptera: Curculionidae) Erginlerine Etkileri

Tuğba AYYILDIZ, Zafer İsmail KARACA

Capacitive Solvent Sensing with Microfluidics Chip

İSMAİL BİLİCAN, MUSTAFA TAHSİN GÜLER