Bakshi Rohit Prasad, Sonali Agarwal

Design development and performance analysis of distributed least square twin support vector machine for binary classification

Machine learning (ML) on Big Data has gone beyond the capacity of traditional machines and technologies. ML for large scale datasets is the current focus of researchers. Most of the ML algorithms primarily suffer from memory constraints, complex computation, and scalability issues.The least square twin support vector machine (LSTSVM) technique is an extended version of support vector machine (SVM). It is much faster as compared to SVM and is widely used for classification tasks. However, when applied to large scale datasets having millions or billions of samples and/or large number of classes, it causes computational and storage bottlenecks. This paper proposes a novel scalable design for LSTSVM named distributed LSTSVM (DLSTSVM). This design exploits distributed computation on cluster of machines to provide a scalable solution to LSTSVM. Very large datasets are partitioned and distributed in the form of resilient distributed datasets on top of Spark cluster computing engine. LSTSVM is trained to generate two nonparallel hyper-planes. These hyper-planes are achieved by solving two systems of linear equations each of which involves data instances from either class. While designing DLSTSVM we employed distributed matrix operations using the MapReduce paradigm of computing to distribute the tasks over multiple machines in the cluster. Thus, memory constraints with extremely large datasets are averted. Experimental results show the reduction in time complexity as compared to existing scalable solutions to SVM and its variants. Moreover, detailed experiments depict the scalability of the proposed design with respect to large datasets.

PDF

___

[1] Cortes C, Vapnik V. Support-vector networks. Machine Learning 1995; 20 (3): 273-297.
[2] Khemchandani R, Chandra S. Twin support vector machines for pattern classification. IEEE transactions on pattern analysis and machine intelligence 2007; 29 (5): 905-910.
[3] Kumar MA, Gopal M. Least squares twin support vector machines for pattern classification. Expert systems with applications 2009; 36 (4): 7535-7543.
[4] Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Communications of the ACM 2008; 51 (1): 107-113.
[5] Isard M, Budiu M, Yu Y, Birrell A, Fetterly D. Dryad: distributed data-parallel programs from sequential building blocks. InProceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007; 59-72.
[6] Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: cluster computing with working sets In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 10–10. Berkeley, CA, USA: USENIX Association, 2010.
[7] Inoubli W, Aridhi S, Mezni H, Maddouri M, Nguifo EM. An experimental survey on big data frameworks. Future Generation Computer Systems 2018; 86: 546-564.
[8] Bal H, Pal A. Parallel and distributed machine learning algorithms for scalable Big Data analytics. Future Generation Computer Systems 2020; 108: 1159-1161.
[9] Zhang M L, Zhou Z H. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 2013; 26 (8): 1819-1837.
[10] Hsieh CJ, Si S, Dhillon I. A divide-and-conquer solver for kernel support vector machines. In: International Conference on Machine Learning 2014, pp. 566-574.
[11] Chang CC, Lin C J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2011; 2 (3): 1-27.
[12] Bordes A, Ertekin S, Weston J, Bottou L. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research 2005; 6: 1579-1619.
[13] Graf H, Cosatto E, Bottou L, Dourdanovic I, Vapnik V. Parallel support vector machines: the cascade svm. Advances in Neural Information Processing Systems 2004; 17: 521-528.
[14] Zhang K, Lan L, Wang Z, Moerchen F. Scaling up kernel svm on limited resources: a low-rank linearization approach. In: Artificial Intelligence and Statistics 2012: 1425-1434.
[15] Le Q, Sarlós T, Smola A. Fastfood-computing hilbert space expansions in loglinear time. In: International Conference on Machine Learning 2013 Feb 13, pp. 244-252.
[16] Selvaraj SK, Decoste D M, inventors; Altaba Inc, assignee. Building support vector machines with reduced classifier complexity. United States patent US 7,630,945. 2009 Dec 8.
[17] Moody J, Darken CJ. Fast learning in networks of locally-tuned processing units. Neural Computation 1989; 1 (2): 281-294.
[18] Sartakhti JS, Afrabandpey H, Ghadiri N. Fuzzy least squares twin support vector machines. Engineering Applications of Artificial Intelligence 2019; 85: 402-409.
[19] Chen S, Cao J, Chen F, Liu B. Entropy-based fuzzy least squares twin support vector machine for pattern classification. Neural Processing Letters 2020; 51 (1) :41-66.
[20] Richhariya B, Tanveer M. Universum least squares twin parametric-margin support vector machine. In: 2020 International Joint Conference on Neural Networks (IJCNN) 2020; 19: 1-8.
[21] Mir A, Nasiri J A. KNN-based least squares twin support vector machine for pattern classification. Applied Intelligence 2018; 48 (12): 4551-4564.
[22] Zhu K, Wang H, Bai H, Li J, Qiu Z, Cui H et al. Parallelizing support vector machines on distributed computers. In: Advances in Neural Information Processing Systems 2008, pp. 257-264.
[23] Boyd S, Parikh N, Chu E. Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc; 2011.
[24] Guan L, Sun T, Qiao LB, Yang ZH, Li DS et al. An efficient parallel and distributed solution to nonconvex penalized linear SVMs. Frontiers of Information Technology & Electronic Engineering 2019; 5: 1-7.
[25] Singh D, Mohan CK. Projection-SVM: distributed Kernel Support Vector Machine for Big Data using Subspace Partitioning. In: 2018 IEEE International Conference on Big Data (Big Data) 2018 Dec 10, pp. 74-83.
[26] Shah R, Zhang S, Lin Y, Wu P. xSVM: Scalable Distributed Kernel Support Vector Machine Training. In: 2019 IEEE International Conference on Big Data (Big Data) 2019; 9: 155-164.
[27] Musicant D R. NDC: normally distributed clustered datasets. Computer Sciences Department, University of Wisconsin, Madison. 1998.
[28] Tavara S. Parallel computing of support vector machines: a survey. ACM Computing Surveys (CSUR) 20198; 51 (6):1-38.
[29] Landset S, Khoshgoftaar TM, Richter AN, Hasanin T. A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data 2015; 2 (1): 24.
[30] University of California, Irvine, Datasets
[31] Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2011; 2 (3):1-27.
[32] Zhang K, Lan L, Wang Z, Moerchen F. Scaling up kernel svm on limited resources: a low-rank linearization approach. In: Artificial Intelligence and Statistics 2012, pp. 1425-1434.
[33] Le VL, Lauer F, Bako L, Bloch G. Learning nonlinear hybrid systems: from sparse optimization to support vector regression. In: Proceedings of the 16th International Conference on Hybrid systems: computation and control 2013, pp. 33-42.
[34] Hsieh CJ, Si S, Dhillon I. A divide-and-conquer solver for kernel support vector machines. In: International Conference on Machine Learning 2014, pp. 566-574.
[35] Çatak FÖ, Balaban ME. A MapReduce-based distributed SVM algorithm for binary classification. Turkish Journal of Electrical Engineering & Computer Sciences 2016; 24 (3): 863-873.