Fast and de-noise support vector machine training method based on fuzzy clustering method for large real world datasets

Fast and de-noise support vector machine training method based on fuzzy clustering method for large real world datasets

Classifying large and real-world datasets is a challenging problem in machine learning algorithms. Among the machine learning methods, the support vector machine (SVM) is a well-known approach with high generalization ability. Unfortunately, while the number of training data increases and the data contain noise, the performance of SVM significantly decreases. In this paper, a fast and de-noise two-stage method for training SVMs to deal with large, realworld datasets is proposed. In the first stage, data that contain noises or are suspected to be noisy are identified and eliminated from the genuine training dataset. The process of elimination and identification is based on the movement of the center of the convex hull data in the training dataset. The convex hull data are computed via the QHull algorithm. On the other hand, the well-known fuzzy clustering method (FCM) is applied to compress and reduce the size of the training dataset. Finally, the reduced and purified cluster centers are used for training the SVM. A set of experiments is conducted on the four benchmarking datasets of the UCI database. Moreover, the amount of training time and the generalization of the proposed approach are compared with FCM-SVM and normal SVM. The results indicate that the proposed method reduces the amount of training time and has a considerable success in removing noisy data from the training dataset. Therefore, the proposed method can achieve a higher generalization performance in comparison with the other methods in large, real-world datasets.

___

  • [1] Hulse JV, Khoshgoftaar TM. Knowledge discovery from imbalanced and noisy data. Data Knowl Eng 2009; 68: 1513–1542.
  • [2] Mavroforakis ME, Theodoridis S. A geometric approach to Support Vector Machine (SVM) classification. IEEE T Neural Networ 2006; 17: 671–682.
  • [3] Angelova A, Abu-Mostafa Y, Perona P. Pruning training sets for learning of object categories. Proc Cvpr IEEE 2005; 494–501.
  • [4] Zhu X, Wu X. Class noise vs. attribute noise: a quantitative study of their impacts. Artif Intell Rev 2004; 22: 177–210.
  • [5] Yang X, Zhang G, Lu J, Ma J. A kernel fuzzy c-means clustering-based fuzzy support vector machine algorithm for classification problems with outliers or noises. IEEE T Fuzzy Syst 2011; 19: 105–115.
  • [6] Vapnik V. Statistical Learning Theory. New York, NY, USA: Wiley, 1998. [7] Cristianini N, Taylor JS. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge, UK: Cambridge University Press, 2000.
  • [8] Bordes A, Ertekin S, Weston J, Bottou L. Fast kernel classifiers with online and active learning. J Mach Learn Res 2005; 6: 1579–1619.
  • [9] Angiulli F, Astorino A. Scaling up support vector machines using nearest neighbor condensation. IEEE T Neural Networ 2010; 21: 351–357.
  • [10] Dong JX, Krzyzak A, Suen CY. Fast SVM training algorithm with decomposition on very large data sets. IEEE T Pattern Anal 2005; 27: 603–618.
  • [11] Xinjun P. A nu-twin support vector machine (nu-TSVM) classifier and its geometric algorithms. Inform Sciences 2010; 180: 3863–3875.
  • [12] Tang WM. SVM with a new fuzzy membership function to solve the two-class problems. Neural Process Lett 2011; 34: 209–219.
  • [13] Lu YL, Li L, Zhou MM, Tian GL. A new fuzzy support vector machine based on mixed kernel function. In: IEEE International Conference on Machine Learning and Cybernetics; 12–15 July 2009; Baoding, China: IEEE. pp.12–15.
  • [14] Lin CF, Wang S. Fuzzy support vector machines. IEEE T Neural Networ 2002; 13: 464–471.
  • [15] Osuna E, Freund R, Girosi F. An improved training algorithm for support vector machines. Proceedings of Neural Networks for Signal Processing 1997; 276–285.
  • [16] Vapnik V. Estimation of Dependences Based on Empirical Data. Berlin, Germany: Springer-Verlag, 1982.
  • [17] Platt J. Fast training of support vector machines using sequential minimal optimization. In: Sch¨olkopf B, Burges C, Smola A, editors. Advances in Kernel Methods-Support Vector Learning, Cambridge, MA, USA: MIT Press, 1999. pp. 185–208.
  • [18] Keerthi SS, Shevade SK, Bhattachayya C, Murth KRK. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput 2001; 13: 637–649.
  • [19] Zhiyong D, Zuolin D, Peixin Q, Xianfang W. Fuzzy support vector machine based on improved sequential minimal optimization algorithm. In: IEEE International Conference on Computer and Communication Technologies in Agriculture Engineering; 12–13 June 2010; Chengdu, China: IEEE. pp. 152–155.
  • [20] Peng P, Ma QL, Hong LM. The research of the parallel SMO algorithm for solving SVM. In: IEEE International Conference on Machine Learning and Cybernetics; 12–15 July 2009; Baoding, China: IEEE. pp. 1271–1274.
  • [21] Liu Z, Liu JG, Pan C, Wang G. A novel geometric approach to binary classification based on scaled convex hulls. IEEE T Neural Networ 2009; 20: 1215–1220.
  • [22] Hong Z, Xiao W, Long XH, Lei LY, Wen Q. Fast SVM training based on thick convex-hull. In: IEEE Congress on Image and Signal Processing; 27–30 May 2008; Sanya, China: IEEE. pp. 584–587.
  • [23] Liu H, Xiong S, Chen Q. Fuzzy support vector machines based on convex hulls. In: IEEE International Symposium on Knowledge Acquisition and Modeling; 21–22 December 2008; Wuhan, China: IEEE. pp. 920–923.
  • [24] Xu R, Wunsch D. Survey of clustering algorithms. IEEE T Neural Networ 2005; 16: 645–678.
  • [25] Cervantes J, Li X, Yu W, Li K. Support vector machine classification for large data sets via minimum enclosing ball clustering. Neurocomputing 2008; 71: 611–619.
  • [26] Cervantes J, Li X, Yu W. Support vector machine classification based on fuzzy clustering for large data sets. In: MICAI 2006: Advances in Artificial Intelligence. Berlin, Germany: Springer-Verlag, 2006. pp. 572–582.
  • [27] Li X, Cervantes J, Yu W. A novel SVM classification method for large data sets. In: IEEE International Conference on Granular Computing; 14–16 August 2010; Silicon Valley, CA, USA: IEEE. pp. 297–302.
  • [28] Saha I, Ujjwal M, Sanghamitra B, Dariusz P. Improvement of new automatic differential fuzzy clustering using SVM classifier for microarray analysis. Expert Syst Appl 2011; 38: 15122–15133.
  • [29] Bezdek JC. Pattern Recognition with Fuzzy Objective Function Algorithms. New York, NY, USA: Plenum Press, 1981.
  • [30] Carvalho D, De AT F, Lechevallier Y, De Melo FM. Relational partitioning fuzzy clustering algorithms based on multiple dissimilarity matrices. Fuzzy Set Syst 2013: 215: 1–28.
  • [31] Zhou SM, Gan JQ. Constructing L2-SVM-based fuzzy classifiers in high-dimensional space with automatic model selection and fuzzy rule ranking. IEEE T Fuzzy Syst 2007; 15: 398–409.
  • [32] Zhu X, Wu X, Yang Y. Error detection and impact-sensitive instance ranking in noisy data. In: AAAI National Conference on Artificial Intelligence; 25–29 July 2004; San Jose, CA, USA: AAAI. pp. 378–384.
  • [33] John GH. Robust decision trees: removing outliers from databases. Lect Notes Artif Int 1995; 174–179.
  • [34] Inoue T, Abe S. Fuzzy support vector machines for pattern classification. In: The International Joint Conference on Neural Networks; 15–19 July 2001; Washington DC, USA: IEEE. pp. 1449–1454.
  • [35] Brodley CE, Friedl MA. Identifying mislabeled training data. J Artif Intell Res 1999; 11: 131–167.
  • [36] Yang X, Zhang G, Lu J, Ma J. A kernel fuzzy c-means clustering-based fuzzy support vector machine algorithm for classification problems with outliers or noises. IEEE T Fuzzy Syst 2011; 19: 105–115.
  • [37] Khoshgoftaar TM, Seliya N. The necessity of assuring quality in software measurement data. In: Proceedings of the 10th International Software Metrics Symposium; 14–16 September 2004; Chicago, IL, USA: IEEE. pp. 119–130.
  • [38] Sun J, Zheng C, Li X, Zhou Y. Analysis of the distance between two classes for tuning SVM hyper-parameters. IEEE T Neural Networ 2010; 21: 305–318.
  • [39] Peng X, Wang Y. A geometric method for model selection in support vector machine. Expert Syst Appl 2009; 36: 5745–5749.
  • [40] Pal NR, Bezdek JC. On cluster validity for the fuzzy c-means model. IEEE T Fuzzy Syst 1995; 3: 370–379.
  • [41] Goodrich B, Albrecht D, Tischer P. Algorithms for the Computation of Reduced Convex Hulls. Berlin, Germany: Springer-Verlag, 2009.
  • [42] Bennett KP, Bredensteiner EJ. Duality and geometry in SVM classifiers. In: Proceedings of the 17th International Conference on Machine Learning; 2000; San Francisco, CA, USA. pp. 57–64.
  • [43] Preparata FP, Hong SJ. Convex hulls of finite sets of points in two and three dimensions. Commun ACM 1977; 20: 87–93.
  • [44] Zhou X, Wenhan J, Yingjie T, Yong S. Kernel subclass convex hull sample selection method for SVM on face recognition. Neurocomputing 2010; 73: 2234–2246.
  • [45] Barber CB, Huhdanpaa H. The quickhull algorithm for convex hulls. ACM T Math Software 1966; 22: 469–483.
  • [46] Hulse JV, Khoshgoftaar TM, Huang H. The pairwise attribute noise detection algorithm. Knowl Inf Syst 2007; 11: 171–190.
  • [47] Khoshgoftaar TM, Zhong S, Joshi V. Enhancing software quality estimation using ensemble-classifier based noise filtering. Intell Data Anal 2005; 9: 3–27.
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK