Ramachandra KAIMAL, Sandhya HARIKUMAR, Akhil ARAVINDAKSHAN SAVITHRI

A depth-based nearest neighbor algorithm for high-dimensional data classification

Nearest neighbor algorithms like k-nearest neighbors (kNN) are fundamental supervised learning techniquesto classify a query instance based on class labels of its neighbors. However, quite often, huge volumes of datasets arenot fully labeled and the unknown probability distribution of the instances may be uneven. Moreover, kNN suffers fromchallenges like curse of dimensionality, setting the optimal number of neighbors, and scalability for high-dimensionaldata. To overcome these challenges, we propose an improvised approach of classification via depth representation ofsubspace clusters formed from high-dimensional data. We offer a consistent and principled approach to dynamicallychoose the nearest neighbors for classification of a query point by i) identifying structures and distributions of data; ii)extracting relevant features, and iii) deriving an optimum value of k depending on the structure of data by representingdata using data depth function. We propose an improvised classification algorithm using a depth-based representationof clusters, to improve performance in terms of execution time and accuracy. Experimentation on real-world datasetsreveals that proposed approach is at least two orders of magnitude faster for high-dimensional dataset and is at least asaccurate as traditional kNN.

PDF

___

[1] Murphy, Kevin P. Machine Learning: a Probabilistic Perspective. London, UK: MIT Press, 2013.
[2] Pandey S, Supriya M, Shrivastava A. Data classification using machine learning approach. In: Thampi SM, Mitra S, Mukhopadhyay J, Li, Kuan-Ching, James AP, Berretti S (editors). Intelligent Systems Technologies and Applications. Cham, Switzerland: Springer, 2017, pp. 112-122.
[3] Menon RR, Aswathi P. Document classification with hierarchically structured dictionaries. In: Thampi SM, Dasgupta S, Berretti S (editors). Intelligent Systems Technologies and Applications. Cham, Switzerland: Springer, 2016, pp. 387-397.
[4] Aggarwal CC. Data Classification: algorithms and applications, 1st ed. London, UK: Chapman & Hall/CRC, 2014.
[5] Oren Anava, Kfir Levy Y. k*-Nearest neighbors: From global to local. In: Proceedings of 30th Conference on Neural Information Processing Systems (NIPS); Barcelona, Spain; 2016.
[6] Cover T, Hart P. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 2006; 13(1): 21-27.
[7] Song Y, Huang J, Zhou D, Zha H, Giles CL. IKNN: Informative k-nearest Neighbor Pattern Classification. Berlin, Germany: Springer Berlin Heidelberg, 2007, pp. 248-264.
[8] Koppen M. The curse of dimensionality. In: Proceedings of the 5th Online World Conference on Soft Computing in Industrial Applications (WSC5); 2000; pp. 4-8.
[9] Kriegel HP, Kroger P, Zimek A. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data 2009; 3(1): 1-58.
[10] Kriegel HP, Kroger P, Zimek A. Subspace clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2012; 2(4): 351–364.
[11] Parsons L, Haque E, Liu H. Subspace clustering for high-dimensional data: a review. SIGKDD Explor Newsl June 2004; 6(1): 90-105.
[12] Hund M, Sturm W, Schreck T, Ullrich T, Keim D et al. Analysis of patient groups and immunization results based on subspace clustering. In: Guo Y, Friston K, Aldo F, Hill S, Peng H (editors). Brain Informatics and Health. BIH, Lecture Notes in Computer Science. Cham, Switzerland: Springer, 2015, pp.358-368.
[13] Rakesh A, Johannes G, Dimitrios G, Prabhakar R. Automatic subspace clustering of high-dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data; ACM Press; 1998; pp. 94-105.
[14] Kailing K, Kriegel HP, Kroger P. Density-connected subspace clustering for high dimensional data. In: proceedings of the 4th SIAM International Conference on Data Mining; Orlando, FL; 2004; pp. 46-257.
[15] Goil S, Nagesh H, Choudhary A. MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report CPDC-TR-9906-010 Northwestern University, 1999.
[16] Aggarwal CC, Wolf JL, Yu PS, Procopiuc C, Park JS. Fast algorithms for projected clustering. In: Proceedings of the ACM SIGMOD international conference on Management of data; Philadelphia, PA, USA; 1999; pp. 61-72.
[17] Aggarwal CC, Yu PS. Finding generalized projected clusters in high-dimensional spaces. In: Proceedings of ACM SIGMOD international conference on Management of data; Dallas, TX, USA; 2000. pp. 70-81.
[18] Woo KG, Lee JH. FINDIT: A fast and intelligent subspace clustering algorithm using dimension voting. Information and Software Technology 2004; 46: 255-271.
[19] Harikumar S, Akhil AS. Semi supervised approach towards subspace clustering. Journal of Intelligent & Fuzzy Systems 2018; 34(3): 1619-1629.
[20] Yang J, Wang W, Wang H, Yu PS. δ -clusters: Capturing subspace correlation in a large dataset. In: International Conference on Data Engineering; San Jose, CA, USA; 2002. pp. 517-528.
[21] Vardi, Yehuda, Cun-Hui Zhang. The multivariate L1-median and associated data depth. In: Proceedings of the National Academy of Sciences of the United States of America 2000; 97(4): 1423-1426.
[22] Friedman JH, Bentley JL, Finkel RA. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software 2007; 3(3): 209-226.
[23] Gongde Guo, Hui Wang, Bell David A, Yaxin Bi, Kieran Greer. KNN model-based approach in classification. In: Meersman R, Tari Z, Schmidt DC (editors). On The Move to Meaningful Internet Systems: CoopIS, DOA, and ODBASE. OTM 2003. Lecture Notes in Computer Science, vol 2888. Berlin, Germany: Springer Berlin Heidelberg, 2003,pp.986-996.
[24] Parzen E. On estimation of a probability density function and mode. The Annals of Mathematical Statistics 1962; 33(3): 1065-1076.
[25] Ghaddar B, Naoum-Sawaya J. High dimensional data classification and feature selection using support vector machines. European Journal of Operational Research 2018; 265(3): 993-1004
[26] Zhongwen Z, Huanghuang G. Visualization study of high-dimensional data classification based on PCA-SVM. In: Proceedings of IEEE Second International Conference on Data Science in Cyberspace (DSC); Shenzhen; 2017. pp. 346-349.
[27] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Springer Series in Statistics; New York, NY, USA: Springer New York Inc., 2001.
[28] Lan L, Shi H, Wang Z, Vucetic S. Active learning based on parzen window. In: Active Learning and Experimental Design workshop, in conjunction with AISTATS; Sardinia, Italy; 2010. pp. 99-112.
[29] Harikumar S, Haripriya H, Kaimal MR. Implementation of projected clustering based on SQL queries and UDFs in relational databases. In: Proceedings of IEEE Recent Advances in Intelligent Computational Systems (RAICS); Trivandrum; 2013. pp. 7-12.
[30] Langley P, Blum AL. Selection of relevant features and examples in machine learning. Artificial Intelligence 1997; 97(1-2): 245-271.
[31] Ayan NF. Using information gain as feature weight. In: TAINN’99 8th Turkish Symposium on Artificial Intelligence and Neural Networks; Istanbul; 1999; pp. 48-57.
[32] Modha DS, Spangler WS. Feature weighting in k-means clustering. Machine Learning 2003; 52(3): 217-237.
[33] Gan GJ, Wu JH. A convergence theorem for the fuzzy subspace clustering (FSC) algorithm. Pattern Recognition 2008; 41(6): 1939-1947.
[34] John GH, Kohavi R, Pfleger P. Irrelevant features and the subset selection problem. In : Eleventh International Conference on Machine Learning (ICML);New Brunswick, NJ, USA; 1994. pp. 121-129.
[35] Vencalek O. Weighted data depth and depth based discrimination. PhD, Charles University, Prague, The Czech Republic, 2011.
[36] Vencalek O. New depth-based modification of the k-nearest neighbor method. SOP Transactions on Statistics and Analysis 2013; 1(2): 131-138.
[37] Hlubinka D and Vencalek O. Depth-based classification for distributions with nonconvex support. Journal of Probability and Statistics 2013. doi:10.1155/2013/629184
[38] Lange T, Mosler K, Mozharovskyi P. Fast nonparametric classification based on data depth. Statistical Papers 2014; 55(1):49-69.
[39] Vencalek O. Depth-based classification for multivariate data. Austrian Journal of Statistics 2017; 46(3-4): 117-128.
[40] Pokotylo O, Mozharovskyi P, Dyckerhoff R. Depth and depth-based classification with R-package ddalpha. arXiv preprint arXiv:1608.04109,2016.
[41] Dutta S, Sarkar S, Ghosh AK. Multi-scale classification using localized spatial depth. Journal of Machine Learning Research 2016; 17(217):1-30.
[42] Serfling R, Zuo Y. General notions of statistical depth function. Annals of Statistics 2000; 28: 461-482.
[43] Quinlan JR. Induction of decision trees. Machine Learning 1986; 1(1): 81-106.
[44] Lopez-Pintado S, Romo J. On the concept of depth for functional data. Journal of the American Statistical Association 2009; 104(486): 718-734.
[45] Davies DL, Bouldin DW. A cluster separation measure. IEEE Transactions on Pattern Analysis Machine Intelligence 1979; 1(2): 224-227.