A new algorithm for detection of link spam contributed by zero-out link pages
A new algorithm for detection of link spam contributed by zero-out link pages
Link spammers are constantly seeking new methods and strategies to deceive the search engine ranking algorithms. The search engines need to come up with new methods and approaches to challenge the link spammers and to maintain the integrity of the ranking algorithms. In this paper, we proposed a methodology to detect link spam contributed by zero-out link or dangling pages. We randomly selected a target page from live web pages, induced link spam according to our proposed methodology, and applied our algorithm to detect the link spam. The detail results from amazon.com pages showed that there was a considerable improvement in their PageRank after the link spam was induced; our proposed method detected the link spam by using eigenvectors and eigenvalues.
___
- [1] Gy¨ongyi Z, Garcia-Molina H. Link Spam Alliances. In: The 31st International Conference on Very Large Databases (VLDB); 2005; Trondheim, Norway: ACM. pp. 517-528.
- [2] Henzinger MR, Motwani R, Silverstein C. Challenges in web search engines. Journal of ACM SIGIR 2002; 36: 11-22.
- [3] Eiron N, McCurley KS, Tomlin AJ. Ranking the Web Frontier. In: The 13th International conference on WWW; 1722 May 2004; New York, USA: pp. 309-318.
- [4] Wang X, Tao T, Sun JT, Shakery A, Zhai C. DirichletRank: solving the zero-one-gap problem of PageRank. ACM T Inform Syst 2008; 26: 10.
- [5] Bianchini M, Gori M, Scarselli F. Inside PageRank. ACM T Internet Techn 2005; 5: 92-128.
- [6] Brin S, Page L, Motwani R, Winograd T. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-0120. Stanford, CA, USA: Computer Science Department, Stanford University, 1999.
- [7] Kleinberg J. Authoritative sources in a hyper-linked environment. J ACM 1999; 46: 604-632.
- [8] Lempel R, Moran S. SALSA: the stochastic approach for link-structure analysis. ACM T Inform Syst 2001; 19: 131-160.
- [9] Gy¨ongyi Z, Garcia-Molina H. Web spam taxonomy. In: The 1st International Workshop on Adversarial Information Retrieval on the Web; 1014 May 2005; Chiba, Japan: pp. 39-47.
- [10] Baeza-Yates R, Castillo C, Lst opez V. PageRank increase under different collusion topologies. The 1st International Workshop on Adversarial Information Retrieval on the Web; 1014 May 2005; Chiba, Japan: pp. 17-24.
- [11] Zhang H, Goel A, Govindan R, Mason K, Van Roy B. Making eigenvector-based reputation systems robust to collusion. In: The 3rd Workshop on Web Graphs (WAW). Lecture Notes in Computer Science, Vol. 3243; 2004; Rome, Italy: Springer. pp. 92-104.
- [12] Gy¨ongyi Z, Berkhin P, Garcia-Molina H. Link spam detection based on mass estimation. The 32nd International Conference on Very Large Data Bases; 1215 September 2006; Seoul, Korea: ACM. pp. 439-450.
- [13] Zhou B, Pei J. Link spam target detection using page farms. ACM Transactions on Knowledge Discovery from Data (TKDD) 2009; 3: 13.
- [14] Nikita S, Jiawei H. Survey on web spam detection: principles and algorithms. ACM SIGKDD Explorations Newsletter 2011; 13: 50-64.
- [15] Haveliwala TH, Kamvar SD. The Second Eigenvalue of the Google Matrix. Technical Report 2003-20. Stanford, CA, USA: Stanford University, 2003.
- [16] Ipsen ICF, Selee TM. PageRank computation, with special attention to dangling node. Society for Industrial and Applied Mathematics 2007; 29: 1281-1296.
- [17] Langville AN, Meyer CD. Deeper Inside PageRank. Internet Mathematics 2003; 1: 335-380.
- [18] de Jager DV, Bradley JT. PageRank: splitting homogeneous singular linear systems of index one. In: The 2nd International Conference on the Theory of Information Retrieval: Advances in Information Retrieval Theory; 10-12 September 2009; Cambridge, UK. Berlin, Germany: Springer. pp. 17-28.
- [19] Gleich DF, Gray AP, Greif C, Lau T. An inner-outer iteration for computing PageRank. SIAM J Sci Comput 2010; 32: 349-371.
- [20] Singh AK, Kumar PR, Goh AKL. Efficient methodologies to handle hanging pages using virtual node. Cybernet Syst 2011; 42: 621-635.
- [21] Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J. Graph structure in the web. Comput Netw 2000; 33: 309-320.
- [22] Gao B, Liu TY, Ma Z, Wang T, Li H. A general Markov framework for page importance computation. In: The 18th Conference on Information and Knowledge Management; 26 November 2009; Hong Kong, China: ACM. pp.1835-1838.
- [23] Kumar PR, Goh AKL, Singh AK, Application of Markov chain in the PageRank algorithm. Pertanika Journal of Science and Technology 2013; 21: 541-554.
- [24] Langville AN, Meyer CD. A survey of eigenvector methods of web information retrieval. SIAM 2005; 47: 135-161.
- [25] Meyer CD. Matrix Analysis and Applied Linear Algebra. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2000.
- [26] Boldi P, Vigna S, Santini M. PageRank as the function of the damping factor. In: The 14th International Conference on World Wide Web; 2005; Chiba, Japan: pp. 557-566.
- [27] Moler C. Experiments with MATLAB. Natick, MA, USA: MathWorks, Inc., 2011.