Web Sayfası Sınıflamada Etiket-tabanlı Nitelik Kümesi Kullanımının Performansı
Web sürekli büyüyen geniş bir veri kümesidir. Buna bağlı olarak yararlı bilgilere etkili bir şekilde erişmek için otomatik bir Web sayfası sınıflandırma mekanizmasına ihtiyaç duyulmaktadır. Web sayfalarının çoğunluğu HTML dokümanları biçimindedir. Bu nedenle bu çalışmanın amacı, HTML etiketlerinin sınıflandırma işlemi üzerindeki etkisini araştırmak ve sınıflandırmanın nitelik çıkarımı aşamasında kullanılabilecek en etkili HTML etiketlerini belirlemektir. Bu amaca ulaşmak için, 13 farklı veri seti ve 5 popüler sınıflayıcı (SVM, Naive Bayes, kNN, C4.5 ve OneR) kullanılmıştır. İstatistiksel analiz sonuçları, “anchor”,”” ve”” etiketlerini kullanarak çıkarılan niteliklerin, tüm Web sayfası kullanılarak çıkarılan niteliklere alternatif olarak kullanılabileceğini göstermektedir. SVM, bu çalışmada kullanılan sınıflandırıcılar arasında en başarılısıdır. Nitelik çıkarımı için HTML etiketlerini kullanmak sınıflandırma doğruluğunu arttırmıştır.
Performance of Using Tag-based Feature Sets in Web Page Classification
As the Web is a large collection of data growing daily, an automatic Webpage classification mechanism is needed to effectively reach to useful information.Majority of the Web pages are in the form of HTML documents, therefore the aimof this study is to explore the effect of HTML tags on classification process, and tryto determine the most valuable HTML tags for feature extraction of theclassification task. To achieve this goal, we employ 13 different datasets, and use 5popular classifiers that are SVM, naïve bayes (NB), kNN, C4.5, and OneR. Thestatistical analysis shows that, the features extracted by using solely the anchor, or tags can be used as an alternative to the features extracted from thewhole Web page. SVM is the best among the classifiers used in this study. Using theHTML tags for feature extraction improves classification accuracy.
___
- Shaker, M., Ibrahim, H., Mustapha, A. and
Abdullah, L. N. 2009. Information Extraction
From Hypertext Mark-up Language Web Pages.
Journal of Computer Science, 5(8), 596-607.
- Soonthomphisaj, N., Chartbanchachai, P.,
Pratheeptham, T. and Kijsirikul, B. 2002. Web
Page Categorization Using Hierarchical Headings
Structure. Proceedings of the 24th International
Conference on Information Technology
Interfaces in Cavtat, Croatia, IEEE, 37-42.
- Xue, W., Bao, H., Huang, W. and Lu, Y. 2006. Web
Page Classification Based on SVM. Proceedings
of the 6th World Congress on Intelligent Control
and Automation in Dalian, China, IEEE, 6111-
6114.
- Werner, L., Böttcher, S. and Beckmann, R. 2005.
Enhanced Information Retrieval by Using HTML
Tags. Proceedings of the 2005 International
Conference on Data Mining in Las Vegas, Nevada,
USA, CSREA Press, 24-29.
- Kim, S. and Zhang, B.-T. 2003. Genetic Mining of
HTML Structures for Effective Web-document
Retrieval. Applied Intelligence, 18(3), 243–256.
- Özel, S. A. 2011. A Web Page Classification
System Based on a Genetic Algorithm Using
Tagged-terms as Features. Expert Systems with
Applications, 38(4), 3407-3415.
- Golub, K. and Ardo, A. 2005. Importance of
HTML structural elements and metadata in
automated subject classification. Proceedings of
the 9th European Conference on Research and
Advanced Technology for Digital Libraries in
Vienna, Austria, Springer-Verlag, 368–378.
- Yang, Y., Slattery, S. and Ghani, R. 2002. A Study
of Approaches to Hypertext Categorization.
Journal of Intelligent Information Systems, 18(2-
3), 219–241.
- Fresno, V., Martinez, R., Montalvo, S. and Casillas,
A. 2006. Naive Bayes Web Page Classication with
HTML Mark-up Enrichment. Proceedings of the
International Multi-Conference on Computing in
the Global Information Technology in Bucharest,
Romania, IEEE, 48-53.
- Belmouhcine, A., Idrissi, A. and Benkhalifa, M.
2013. Web Classification Approach Using
Reduced Vector Representation Model Based on
HTML Tags. Journal of Theoratical and Applied
Information Technology, Vol.55 No.1, 137-148.
- Saraç, E. and Özel, S. A. 2013. Web Page
Classification Using Firefly Optimization. 2013
IEEE International Symposium on INnovations
in Intelligent SysTems and Applications in
Albena, Bulgaria, IEEE, 1-5.
- Saraç, E. and Özel, S. A. 2014. An Ant Colony
Optimization Based Feature Selection for Web
Page Classification. The Scientific World Journal,
Vol. 2014, Article ID 649260 (2014), 16 pages.
- Meshkizadeh, S. and Rahmani, A. M. 2010.
Webpage Classification Based on Compound of
Using HTML Features & URL Features and
Features of Sibling Pages. International Journal
of Advencements in Computing Technology,
2(4), 36-46.
- Jeong, O., Oh, J., Kim, D., Lyu, H. and Kim, W.
2014. Determining the Titles of Web Pages Using
Anchor Text and Link Analysis. Expert Systems
with Applications, Vol. 41 No. 9 (2014), 4322-
4329.
- Ünal, H. E., Özel, S. A. and Ünal, İ. 2013. Effect of
Tagged-Terms on Web Page Classification
Accuracy. Global Journal on Technology, Vol. 3
(2013), 244-250.
- Bhalla, V.K. and Kumar, N. 2016. An Efficient
Scheme for Automatic Web Pages Categorizaiton
Using the Support Vector Machine. New Review
of Hypermedia and Multimedia, Vol. 22 No:3
(2016), 223-242.
- Ester, M., Kriegel, H.-P. and Schubert, M. 2002.
Web Site Mining: A New Way to Spot
Competitors, Customers and Suppliers in the
World Wide Web. Proceedings of the 8th ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining in Edmonton, CA,
USA, ACM Press, 249-258.
- Qi, D. and Sun, B. 2004. A Genetic k-means
Approaches for Automated Web Page
Classification. Proceedings of the 2004 IEEE
International Conference on Information Reuse
and Integration in Las Vegas, Nevada, USA, IEEE,
241–246.
- Bie, R., Fu, Z., Sun, Q. and Chen, C. 2010. A
Comparison Study of Bayesian Classifiers on
Web pages classification. New Generation
Computing, 28(2), 161-168.
- Davison, B. D. 2000. Topical Locality in the Web,
Proceedings of the 23rd Annual International
Conference on Research and Development in
Information Retrieval in Athens, Greece, ACM
Press, 272-279.
- Pierre, J. M. 2001. On the Automated
Classification of Web Sites. Linköping Electronic
Articles in Computer and Information Science,
Vol. 6 (2001), arXiv preprint cs/0102002.
- Qi, X. and Davison, B. D. 2009. Web Page
Classification: Features and Algorithms. ACM
Computing Surveys, 41(2), Article 12.
- Ru, Y. and Horowitz, E. 2007. Automated
Classification of HTML Forms on E-commerce
Web Sites. Online Information Review, Vol. 31
No. 4 (2007), 451 - 466.
- Sun, A., Lim, E.-P. and Ng, W.-K. 2002. Web
Classification Using Support Vector Machine.
Proceedings of the 4th International Workshop
on Web Information and Data Management in
New York, USA, ACM Press, 96–99.
- Navadiay, D., Parikh, M. and Patel, R. 2013.
Constructure Based Web Page Classification.
International Journal of Computer Science and
Management Research, 2(6), 2742-2746.
- A. M. Sarhan, G. M. Hamissa and H. E. Elbehiry,
2015. Feature Selection Algorithms Based on
HTML Tags Importance. 2015 Tenth
International Conference on Computer
Engineering & Systems (ICCES), Cairo, pp. 185-
190.
- B. Thanasopon, N. Sumret, J. Buranapanitkij and
P. Netisopakul. 2017. Extraction and evaluation
of popular online trends: A case of Pantip.com.
9th International Conference on Information
Technology and Electrical Engineering (ICITEE),
Phuket, pp. 1-5.
- Özel, S. A. 2011. A Genetic Algorithm Based
Optimal Feature Selection for Web Page
Classification. Proceedings of the 2011
International Symposium on Innovations in
Intelligent Systems and Applications in Istanbul,
Turkey, IEEE, 282-286.
- Craven, M., DiPasquo, D., Freitag, D., McCallum,
A., Mitchell, T., Nigam, K. and Slattery, S. 1998.
Learning to Extract Symbolic knowledge From
the World Wide Web. Proceedings of the 15th
National Conference on Artificial Intelligence in
Madison, Wisconsin, USA, American Association
for Artificial Intelligence, 509–516.
- Ghani, R. 2001. CMU World Wide Knowledge
Base (Web->KB) Project.
http://www.cs.cmu.edu/~webkb/ (Access Date:
12 February 2016).
- Sinka, M. and Corne, D. (2002), “A large
benchmark dataset for Web document
clustering”, Soft Computing Systems: Design,
Management and Applications, Vol. 87, 881-890.
- Pazzani, M. 1998. Syskill and Webert Web Page
Ratings. http://kdd.ics.uci.edu/databases/
SyskillWebert/SyskillWebert.data.html (Access
Date: 12 February 2016).
- Porter, M. F. 1980. An Algorithm for Suffix
Stripping. Program, 14(3), 130–137.
- Salton, G., Wong, A. and Yang, C. S. 1975. A
Vector Space Model for Automatic Indexing.
Communications of the ACM, 18(11), 613-620.
- Hall, M., Frank, E., Holmes, G., Pfahringer, B.,
Reutemann, P., Witten, I. H. 2009. The WEKA
Data Mining Software: An Update. ACM Special
Interest Group on Knowledge Discovery in Data
Explorations Newsletter, 11(1),10-18.
- Witten, I. H., Frank, E. and Hall, M. A. 2011. Data
mining: practical machine learning tools and
techniques with Java implementations, Morgan
Kaufmann Publishers, San Francisco, CA.
- Joachims, T. 1998. Text Categorization with
Support Vector Machines: Learning with Many
Relevant Features. Proceedings of the 10th
European Conference on Machine Learning in
Chemnitz, Germany, Springer-Verlag, 137-142.
- Baykan, E., Henzinger, M., Marian, L. and Weber,
I. 2011. A Comprehensive Study of Features and
Algorithms for URL-based Topic Classification.
ACM Transactions on the Web, 5(3), Article 15.
- Han, J., Kamber, M. and Pei, J. 2011. Data Mining:
Concepts and Techniques, Morgan Kaufmann
Publishers, San Francisco, CA.
- Rennie, J.D.M., Shih, L., Teevan, J. and Karger,
D.R. 2003. Tackling the Poor Assumptions of
Naive Bayes Text Classiers. Proceedings of the
Twentieth International Conference on Machine
Learning in Washington DC, USA, AAAI Press,
616-623.