Machine Learning Methods for Spamdexing Detection

Machine Learning Methods for Spamdexing Detection

In this paper, we present recent contributions for the battle against one of the main problems faced by search engines: the spamdexing or web spamming. They are malicious techniques used in web pages with the purpose of circumvent the search engines in order to achieve good visibility in search results. To better understand the problem and finding the best setup and methods to avoid such virtual plague, in this paper we present a comprehensive performance evaluation of several established machine learning techniques. In our experiments, we employed two real, public and large datasets: the WEBSPAM-UK2006 and the WEBSPAM-UK2007 collections. The samples are represented by content-based, link-based, transformed link-based features and their combinations. The found results indicate that bagging of decision trees, multilayer perceptron neural networks, random forest and adaptive boosting of decision trees are promising in the task of web spam classification.

___

  • Z. Gyongyi and H. Garcia-Molina, “Spam: It’s not just for inboxes anymore,” Computer, vol. 38, no. 10, pp. 28–34, 2005.
  • K. M. Svore, Q. Wu, and C. J. Burges, “Improving web spam classification using rank-time features,” in Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’07), Banff, Alberta, Canada, 2007, pp. 9–16.
  • G. Shen, B. Gao, T. Liu, G. Feng, S. Song, and H. Li, “Detecting link spam using temporal information,” in Proceedings of the 6th IEEE International Conference on Data Mining (ICDM’06), Hong Kong, China, 2006, pp. 1049–1053.
  • M. Egele, C. Kolbitsch, and C. Platzer, “Removing web spam links from search engine results,” Journal in Computer Virology, vol. 7, pp. 51–62, 2011.
  • J. P. John, F. Yu, Y. Xie, A. Krishnamurthy, and M. Abadi, “deSEO: combating search-result poisoning,” in Proceedings of the 20th USENIX conference on Security (SEC’11), Berkeley, CA, USA, 2011, pp. 20–20.
  • L. Lu, R. Perdisci, and W. Lee, “SURF: detecting and measuring search poisoning,” in Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS’11), New York, NY, USA, 2011, pp. 467–476.
  • R. M. Silva, T. A. Almeida, and A. Yamakami, “Artificial neural networks for content-based web spam detection,” in Proc. of the 14th International Conference on Artificial Intelligence (ICAI’12), Las Vegas, NV, USA, 2012, pp. 209–215.
  • ——, “Towards web spam filtering with neural-based ap- proaches,” in Advances in Artificial Intelligence – IBERAMIA 2012, ser. Lecture Notes in Computer Science, vol. 7637. Cartagena de Indias, Colombia: Springer Berlin Heidelberg, 2012, pp. 199–209.
  • ——, “An analysis of machine learning methods for spam host detection,” in Proc. of the 11th International Conference on Machine Learning and Applications (ICMLA’12), Boca Raton, FL, USA, 2012, pp. 227–232.
  • J. Lin, “Detection of cloaked web spam by using tag-based methods,” Expert Systems with Applications: An International Journal, vol. 36, no. 4, pp. 7493–7499, 2009.
  • A. V. Sunil and A. Sardana, “A reputation based detection technique to cloaked web spam,” Procedia Technology, vol. 4, no. 0, pp. 566–572, 2012.
  • N. Spirin and J. Han, “Survey on web spam detection: princi- ples and algorithms,” ACM SIGKDD Explorations Newsletter, vol. 13, no. 2, pp. 50–64, 2012.
  • M. Najork, “Web spam detection,” in Encyclopedia of Database Systems.
  • Springer US, 2009, vol. 1, pp. 3520–3523.
  • M. R. Henzinger, R. Motwani, and C. Silverstein, “Challenges in web search engines,” SIGIR Forum, vol. 36, no. 2, pp. 11–22, 2002.
  • X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, McLachlan, A. Ng, B. Liu, P. S. Yu, Z. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, “Top 10 algorithms in data mining,” Knowledge and Information Systems, vol. 14, no. 1, pp. 1–37, 2008.
  • S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed.
  • New York, NY, USA: Prentice Hall, 1998.
  • C. M. Bishop, Neural Networks for Pattern Recognition, 1st ed. Oxford, UK: Oxford Press, 1995.
  • M. T. Hagan and M. B. Menhaj, “Training feedforward net- works with the marquardt algorithm,” IEEE Transactions on Neural Networks, vol. 5, no. 6, pp. 989–993, 1994.
  • T. Kohonen, “The self-organizing map,” in Proceedings of the IEEE, vol. 9, no. 78, 1990, pp. 1464–1480.
  • M. J. L. Orr, “Introduction to radial basis function networks,” 1996.
  • C. Cortes and V. N. Vapnik, “Support-vector networks,” in Machine Learning, 1995, pp. 273–297.
  • C. Chang and C. Lin, “LIBSVM: A library for support vec- tor machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011.
  • C. Hsu, C. Chang, and C. Lin, “A practical guide to support vector classification,” National Taiwan University, Tech. Rep., 2003.
  • J. R. Quinlan, C4.5: programs for machine learning, 1st ed. San Mateo, CA, USA: Morgan Kaufmann, 1993.
  • L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
  • D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning algorithms,” Machine Learning, vol. 6, no. 1, pp. 37– 66, 1991.
  • I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Prac- tical Machine Learning Tools and Techniques, 3rd ed. Francisco, CA, USA: Morgan Kaufmann, 2011. San
  • Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” in Proceedings of the 13th International Conference on Machine Learning (ICML’96). mann, 1996, pp. 148–156.
  • Bari, Italy: Morgan Kauf
  • L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, pp. 123–140, 1996.
  • J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: a statistical view of boosting,” Annals of Statistics, vol. 28, no. 2, pp. 337–407, 1998.
  • R. C. Holte, “Very simple classification rules perform well on most commonly used datasets,” Machine Learning, vol. 11, no. 1, pp. 63–90, 1993.
  • G. H. John and P. Langley, “Estimating continuous distributions in bayesian classifiers,” in Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI’95), Montreal, Quebec, Canada, 1995, pp. 338–345.
  • C. Castillo, D. Donato, and A. Gionis, “Know your neighbors: Web spam detection using the web topology,” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07), Amsterdam, The Netherlands, 2007, pp. 423–430.
  • L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza- Yates, “Using rank propagation and probabilistic counting for link-based spam detection,” in Proceedings of the 2006 Work- shop on Web Mining and Web Usage Analysis (WebKDD’06), Philadelphia,USA, 2006.
  • J. Shao, “Linear model selection by cross-validation,” Journal of the American Statistical Association, vol. 88, no. 422, pp. 486–494, 1993.
  • M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: an update,” SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10–18, 2009.
  • H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009.
  • M. Galar, A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera, “A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches,” IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol. 42, no. 4, pp. 463–484, 2012.
  • N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, no. 1, pp. 321–357, 2002.
  • D. C. Montgomery and G. C. Runger, Applied Statistics and Probability for Engineers, 3rd ed. Wiley & Sons, 2002.
  • New York, NY, USA: John
International Journal of Information Security Science-Cover
  • Yayın Aralığı: Yılda 4 Sayı
  • Başlangıç: 2012
  • Yayıncı: Şeref SAĞIROĞLU