New use of the HITS algorithm for fast web page classification

New use of the HITS algorithm for fast web page classification

The immense number of documents published on the web requires the utilization of automatic classifiers that allow organizing and obtaining information from these large resources. Typically, automatic web pages classifiers handle millions of web pages, tens of thousands of features, and hundreds of categories. Most of the classifiers use the vector space model to represent the dataset of web pages. The components of each vector are computed using the term frequency inversed document frequency (TFIDF) scheme. Unfortunately, TFIDF-based classifiers face the problem of the large-scale size of input data that leads to a long processing time and an increase in resource requests. Therefore, there is an increasing demand to alleviate these problems by reducing the size of the input data without influencing the classification results. In this paper, we propose a novel approach that improves web page classifiers by reducing the size of the input data (i.e. web pages and feature reduction) by using the hypertext induced topic search (HITS) algorithm. We employ HITS results for weighting remaining features. We evaluate the performance of the proposed approach by comparing it with the TFIDF-based classifier. We demonstrate that our approach significantly reduces the time needed for classification.

___

  • [1] Qi X, Davison BD. Web page classification: features and algorithms. ACM Comput Surv 2009; 41: 12-40.
  • [2] Bing L. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Berlin, Germany: Springer Science and Business Media, 2007.
  • [3] Joachims T. Text categorization with support vector machines: learning with many relevant features. In: 10th European Conference on Machine Learning; 21–23 April 1998; Chemnitz, Germany. pp. 137-142.
  • [4] Tang J, Alelyani S, Liu H. Feature selection for classification: a review. In: Aggarwal CC, editor. Data Classification: Algorithms and Applications. Boca Raton, FL, USA: CRC Press, 2013. pp. 37-64.
  • [5] S´anchez-Maro˜no N, Alonso-Betanzos A, Tombilla-Sanrom´an M. Filter methods for feature selection–a comparative study. Lect Notes Comp Sci 2007; 4881: 178-187.
  • [6] Forman G. Feature selection for text classification. In: Liu H, Motoda H, editors. Computational Methods of Feature Selection. Boca Raton, FL, USA: Chapman and Hall/CRC Press, 2007. pp. 157-176.
  • [7] Pinheiro RH, Cavalcanti GD, Correa RF, Ren TI. A global-ranking local feature selection method for text categorization. Expert Syst Appl 2012; 39: 12851-12857.
  • [8] Pinheiro RH, Cavalcanti GD, Correa RF, Ren TI. Data-driven global-ranking local feature selection methods for text categorization. Expert Syst Appl 2015; 42: 1941-1949.
  • [9] Yang J, Qu Z, Liu Z. Improved feature-selection method considering the imbalance problem in text categorization. Scientific World Journal 2014; 2014: 625342.
  • [10] Lee HM, Chen CM, Tan CC. An intelligent web-page classifier with fair feature-subset selection. In: IEEE IFSA World Congress and 20th NAFIPS International Conference; 25–28 July 2001; Vancouver, Canada. pp. 395-400.
  • [11] Dasgupta A, Drineas P, Harb B, Josifovsk V, Mahoney MW. Feature selection methods for text classification. In: ACM 2007 13th International Conference on Knowledge Discovery and Data Mining; 12–15 August 2007; San Jose, CA, USA. pp. 230-239.
  • [12] Mladenic D, Brank J, Grobelnik M, Milic-Frayling N. Feature selection using support vector machines. In: ACM 2004 27th Annual International SIGIR Conference; 25–29 July 2004; Sheffield, UK. pp. 234-241.
  • [13] Tu CJ, Chuang LY, Chang JY, Yang CH. Feature selection using PSO-SVM. IAENG International Journal of Computer Science 2007; 33: 111-116.
  • [14] Kim H, Howland P, Park H. Dimension reduction in text classification with support vector machines. J Mach Learn Res 2005; 6: 37-53.
  • [15] Chen CM, Lee HM, Chang YJ. Two novel feature selection approaches for Web page classification. Expert Syst Appl 2009; 36: 260-272.
  • [16] Page L, Brin S, Motwani R, Winograd T. PageRank: Bringing Order to the Web. Stanford, CA, USA: Stanford Digital Libraries Working Paper, 1997.
  • [17] Kleinberg J. Authoritative sources in a hyperlinked environment. J ACM 1999; 46: 604-632.
  • [18] Duhan N, Sharma AK, Bhatia KK. Page ranking algorithms: a survey. In: IEEE Advance 2009 Computing Conference; 6–7 March 2009; Patiala, India. New York, NY, USA: IEEE. pp. 1530-1537.
  • [19] Borodin A, Rosenthal JS, Roberts GO, Tsaparas P. Link analysis ranking: algorithms, theory and experiments. ACM T Internet Techn 2005; 5: 231-297.
  • [20] Xu X, Zhou C, Wang Z. Credit scoring algorithm based on link analysis ranking with support vector machine. Expert Syst Appl 2009; 36: 2625-2632.
  • [21] Deguchi T, Takahashi K, Takayasu H, Takayasu M. Hubs and authorities in the world trade network using a weighted HITS algorithm. PLoS One 2014; 9: e100338.
  • [22] Vapnik VN. Estimation of Dependences Based on Empirical Data. New York, NY, USA: Springer, 1982.
  • [23] Vapnik VN. The Nature of Statistical Learning Theory. New York, NY, USA: Springer, 1995.
  • [24] Porter MF. An algorithm for suffix stripping. Program 1980; 14: 130-137.
  • [25] Prajapati MR. A survey paper on hyperlink-induced topic search (HITS) algorithms for web mining. International Journal of Engineering Research & Technology 2012; 1: 2278-0181.
  • [26] Sun A, Lim EP, Ng WK. Web classification using support vector machine. In: ACM 2002 4th International Workshop on Web Information and Data Management; 4–9 November 2002; McLean, VA, USA. New York, NY, USA: ACM. pp. 96-99.
  • [27] Craven M, McCallum A, PiPasquo D, Mitchell T, Freitag D. Learning to extract symbolic knowledge from the World Wide Web. In: 15th National Conference on Artificial Intelligence; 26–30 July 1998; Madison, WI, USA. pp. 509-516.
  • [28] Han J, Kamber M. Data Mining: Concepts and Techniques. San Francisco, CA, USA: Morgan Kaufmann, 2006.
  • [29] Platt JC. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Technical Report, Microsoft Research. Seattle, WA, USA: Microsoft, 1998.
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK