Towards SMS Spam Filtering: Results under a New Dataset

Towards SMS Spam Filtering: Results under a New Dataset

The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. Recent reports clearly indicate that the volume of mobile phone spam is dramatically increasing year by year. In practice, fighting such plague is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. Probably, one of the major concerns in academic settings is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, traditional content-based filters may have their performance seriously degraded since SMS messages are fairly short and their text is generally rife with idioms and abbreviations. In this paper, we present details about a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we offer a comprehensive analysis of such dataset in order to ensure that there are no duplicated messages coming from previously existing datasets, since it may ease the task of learning SMS spam classifiers and could compromise the evaluation of methods. Additionally, we compare the performance achieved by several established machine learning techniques. Im summary, the results indicate that the procedure followed to build the collection does not lead to near-duplicates and, regarding the classifiers, the Support Vector Machines outperforms other evaluated techniques and, hence, it can be used as a good baseline for further comparison.

___

  • T. Almeida, J. Gómez Hidalgo, and A. Yamakami, “Contri- butions to the Study of SMS Spam Filtering: New Collection and Results,” in Proceedings of the 2011 ACM Symposium on Document Engineering, Mountain View, CA, USA, 2011, pp. 259–262.
  • J. M. Gómez Hidalgo, T. A. Almeida, and A. Yamakami, “On the Validity of a New SMS Spam Collection,” in Proceedings of the 2012 IEEE International Conference on Machine Learning and Applications, Boca Raton, FL, USA, 2012, pp. 240–245.
  • J. M. Gómez Hidalgo, “Evaluating Cost-Sensitive Unsolicited Bulk Email Categorization,” in Proceedings of the 17th ACM Symposium on Applied Computing, Madrid, Spain, 2002, pp. 615–620.
  • L. Zhang, J. Zhu, and T. Yao, “An Evaluation of Statistical Spam Filtering Techniques,” ACM Transactions on Asian Language Information Processing, vol. 3, no. 4, pp. 243–269, 2004.
  • G. Cormack, “Email Spam Filtering: A Systematic Review,” Foundations and Trends in Information Retrieval, vol. 1, no. 4, pp. 335–455, 2008.
  • T. A. Almeida, A. Yamakami, and J. Almeida, “Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters,” in Proceedings of the 8th IEEE In- ternational Conference on Machine Learning and Applications, Miami, FL, USA, 2009, pp. 517–522.
  • ——, “Filtering Spams using the Minimum Description Length Principle,” in Proceedings of the 25th ACM Symposium On Applied Computing, Sierre, Switzerland, 2010, pp. 1856–1860.
  • ——, “Probabilistic Anti-Spam Filtering with Dimensionality Reduction,” in Proceedings of the 25th ACM Symposium On Applied Computing, Sierre, Switzerland, 2010, pp. 1804–1808.
  • T. A. Almeida and A. Yamakami, “Content-Based Spam Fil- tering,” in Proceedings of the 23rd IEEE International Joint Conference on Neural Networks, Barcelona, Spain, 2010, pp. 1–7.
  • T. A. Almeida, J. Almeida, and A. Yamakami, “Spam Filtering: How the Dimensionality Reduction Affects the Accuracy of Naive Bayes Classifiers,” Journal of Internet Services and Applications, vol. 1, no. 3, pp. 183–200, 2011.
  • T. A. Almeida and A. Yamakami, “Facing the Spammers: A Very Effective Approach to Avoid Junk E-mails,” Expert Systems with Applications, vol. 39, pp. 6557–6561, 2012.
  • J. M. Gómez Hidalgo, G. Cajigas Bringas, E. Puertas Sanz, and F. Carrero García, “Content Based SMS Spam Filtering,” in Proceedings of the 2006 ACM Symposium on Document Engineering, Amsterdam, The Netherlands, 2006, pp. 107–114.
  • G. V. Cormack, J. M. Gómez Hidalgo, and E. Puertas Sanz, “Feature Engineering for Mobile (SMS) Spam Filtering,” in Proceedings of the 30th Annual International ACM SIGIR Con- ference on Research and Development in Information Retrieval, New York, NY, USA, 2007, pp. 871–872.
  • ——, “Spam Filtering for Short Messages,” in Proceedings of the 16th ACM Conference on Conference on information and Knowledge Management, Lisbon, Portugal, 2007, pp. 313–320.
  • W. Liu and T. Wang, “Index-based Online Text Classification for SMS Spam Filtering,” Journal of Computers, vol. 5, no. 6, pp. 844–851, 2010.
  • J. Lee and M. Hsieh, “An Interactive Mobile SMS Confirma- tion Method Using Secret Sharing Technique,” Computers and Security, vol. 30, no. 8, pp. 830–839, 2011.
  • E. Vallés and P. Rosso, “Detection of Near-duplicate User Gen- erated Contents: The SMS Spam Collection,” in Proceedings of the 3rd International CIKM Workshop on Search and Mining User-Generated Contents, 2011, pp. 27–33.
  • S. J. Delany, M. Buckley, and D. Greene, “Sms spam filtering: Methods and data,” Expert Systems with Applications, vol. 39, no. 10, pp. 9899–9908, 2012.
  • M. Taufiq Nuruzzaman, C. Lee, M. F. A. b. Abdullah, and D. Choi, “Simple sms spam filtering on independent mobile phone,” Security and Communication Networks, vol. 5, no. 10, pp. 1209–1220, 2012.
  • B. Coskun and P. Giura, “Mitigating sms spam by online detection of repetitive near-duplicate messages,” in 2012 IEEE International Conference on Communications, 2012, pp. 999 –1004.
  • Q. Xu, E. Xiang, Q. Yang, J. Du, and J. Zhong, “Sms spam detection using noncontent features,” IEEE Intelligent Systems, vol. 27, no. 6, pp. 44–51, 2012.
  • Y. Yang and J. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” in Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, USA, 1997, pp. 412–420.
  • J. P. Kumar and P. Govindarajulu, “Duplicate and near duplicate documents detection: A review,” European Journal of Scientific Research, vol. 32, pp. 514–527, 2009.
  • A. M. El Tahir Ali, H. M. Dahwa Abdulla, and V. Snasel, “Survey of Plagiarism Detection Methods,” in Proceedings of the 5th Asia Modelling Symposium, Manila, Philippines, 2011, pp. 39–42.
  • A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Dupli- cate record detection: A survey,” IEEE Trans. on Knowl. and Data Eng., vol. 19, pp. 1–16, January 2007.
  • G. Salton and M. J. McGill, Introduction to Modern Information Retrieval.
  • N. O. Kang, A. Gelbukh, and S. Y. Han, “Ppchecker: Plagiarism pattern checker in document copy detection,” Lecture Notes in Computer Science, vol. 4188, pp. 661–667, 2006.
  • A. Z. Broder, “On the resemblance and containment of docu- ments,” in Compression and Complexity of Sequences. Salerno, Italy: IEEE Computer Society Press, June 1997, pp. 21–29.
  • C. Siefkes, F. Assis, S. Chhabra, and W. Yerazunis, “Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering,” in Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, Pisa, Italy, 2004, pp. 410–421.
  • Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” in Thirteenth International Conference on Machine Learning. 156.
  • S. J. Press and S. Wilson, “Choosing between logistic regression and discriminant analysis,” Journal of the American Statistical Association, vol. 73, no. 364, pp. 699–705, 1978.
  • A. Y. Ng and M. I. Jordan, “On discriminative vs. genera- tive classifiers: A comparison of logistic regression and naive bayes,” pp. 841–848, 2002.
  • S. S. Haykin, Neural Networks and Learning Machines. Pren- tice Hall, 2009.
  • G. Forman, M. Scholz, and S. Rajaram, “Feature Shaping for Linear SVM Classifiers,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 2009, pp. 299–308.
  • J. C. Platt, “Sequential minimal optimization: A fast algorithm for training support vector machines,” Microsoft Research, Tech. Rep. MSR-TR-98-14, 1998. [Online]. Available: http: //research.microsoft.com/apps/pubs/default.aspx?id=69644
  • D. Aha and D. Kibler, “Instance-based learning algorithms,” Machine Learning, vol. 6, pp. 37–66, 1991.
  • J. R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., 1993.
  • E. Frank and I. H. Witten, “Generating Accurate Rule Sets Without Global Optimization,” in Proceedings of the 15th International Conference on Machine Learning, Madison, WI, USA, 1998, pp. 144–151.
  • L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32, 2001.
  • L. Rokach, “Ensemble-based classifiers,” Artificial Intelligence Review, vol. 33, pp. 1–39, 2010.
  • A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Like- lihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977.
International Journal of Information Security Science-Cover
  • Yayın Aralığı: Yılda 4 Sayı
  • Başlangıç: 2012
  • Yayıncı: Şeref SAĞIROĞLU