A Distributed K Nearest Neighbor Classifier for Big Data

A Distributed K Nearest Neighbor Classifier for Big Data

The K-Nearest Neighbor classifier is a well-known and widely applied method in data mining applications. Nevertheless, its high computation and memory usage cost makes the classical K-NN not feasible for today’s Big Data analysis applications. To overcome the cost drawbacks of the known data mining methods, several distributed environment alternatives have emerged. Among these alternatives, Hadoop MapReduce distributed ecosystem attracted significant attention. Recently, several K-NN based classification algorithms have been proposed which are distributed methods tested in Hadoop environment and suitable for emerging data analysis needs. In this work, a new distributed Z-KNN algorithm is proposed, which improves the classification accuracy performance of the well-known K-Nearest Neighbor (K-NN) algorithm by benefiting from the representativeness relationship of the instances belonging to different data classes. The proposed algorithm relies on the data class representations derived from the Z data instances from each class, which are the closest to the test instance. The Z-KNN algorithm was tested in a physical Hadoop Cluster using several real-datasets belonging to different application areas. The performance results acquired after extensive experiments are presented in this paper and they prove that the proposed Z-KNN algorithm is a competitive alternative to other studies recently proposed in the literature

___

  • Klaus Schwab, "The Fourth Industrial Revolution", Crown Business, 2017
  • D. Singh and .K. Reddy, ”A survey on platforms for big data analytics”, Journal of Big Data vol. 1, no. 8, 2014.
  • P. Tan, M. Steinbach and V. Kumar, ”Introduction to Data Mining”, 1st ed., Reading, MA: Addison-Wesley, 2005.
  • J. Dean, S. Ghemawat , ”MapReduce: A Flexible Data Processing Tool”, Communications of the ACM, vol. 53 no. 1, pp.72-77, 2010.
  • X. Wu et. Al., ”Top 10 algorithms in data mining”, Knowledge and Information Systems,vol. 14, no. 1, pp 137, 2008.
  • Fahad et. AL., ”A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis”, IEEE Trans.on Emerging Topics in Computing, vol. 2, no.3, pp. 267-279, 2014.
  • S. Zhang, M. Zong and D. Cheng, ”Learning k for KNN Classification”, ACM Transactions on Intelligent Systems and Technology, vol. 8, no. 3, pp. 43:1-19, 2017
  • K. Niu, F. Zhao and S. Zhang, ”A Fast Classification Algorithm for Big Data Based on KNN”, Journal of Applied Sciences, vol. 13,no. 12, pp. 2208-2212, 2013.
  • Bifet, J. Read, B. Pfahringer and G. Holmes, ”Efficient Data Stream Classification via Probabilistic Adaptive Windows”, in Proc. 28th Annual ACM Symposium on Applied Computing, 2013, pp. 801-806
  • S. S. Labib, ”A Comparative Study to Classify Big Data Using fuzzy Techniques”, in Proc. 5th International Conference on Electronic Devices, Systems and Applications (ICEDSA), 2016.
  • M. El Bakry, S. Safwat and O. Hegazy, ”A Mapreduce Fuzzy technique of Big Data Classification, in Proc. SAI Computing Conference 2016, pp. 118-128.
  • B. Quost and T. Denoeux, ”Clustering and Classification of fuzzy data using the fuzzy EM algorithm”, Fuzzy Sets and Systems, vol. 286, pp. 134-156, 2016.
  • Z. Deng, X. Zhu, D. Cheng, M. Zong and S. Zhang, ”Efficient kNN classification algorithm for big data”, Neurocomputing, vol.195, pp. 143-148, 2016
  • S. Zhang, D. Cheng, M. Zong and L. Gao, ”Self representation nearest neighbour search for classification”, Neurocomputing, vol.195, pp. 137-142, 2016
  • G. Song, J. Rochas, L. El Beze, F. Huet and F. Magoules, ”K Nearest Neighbour Joins for Big Data on MapReduce:A Theoretical and Experimental Analysis”, IEEE Trans. on Knowledge and Data Engineering, vol. 28, no. 9, pp. 2376-2392, 2016.
  • J. Maillo, S. Ramirez, I. Triguero and F. Herrera, ”kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbours classifier for big data”, Knowledge-Based Systems, vol. 117, pp. 3-15, 2017.
  • T.Tulgar, A.haydar and İ.Erşan, "Data Distribution Aware Classification Algorithm based on K-Means", International Journal of Advanced Computer Science and Applications, Article in Press, 2017.
  • T. White, "Hadoop: A Definitive Guide", 4th ed., O'Reilly, 2015.
  • J. Gosling, B. Joy, G. Steele, G. Bracha, A. Buckley, (2017,AUG 01). The Java Language Specification-Java SE 8 Edition Online. Available: https://docs.oracle.com/javase/specs/jls/se8/html/index.html
  • UCI Center for Machine Learning and Intelligent Systems, (2017, AUG 01). UC Irvine Machine Learning RepositoryOnline.Available: https://archive.ics.uci.edu/ml/
  • O.L. Mangasarian, W.N. Street and W.H. Wolberg, “Breast cancer diagnosis and prognosis via linear programming”, Operations Research, vol. 43, no. 4, pp. 570-577, July-August 1995
  • M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, “A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images”, Information Technologies in Biomedicine, Springer-Verlag, Berlin-Heidelberg, pp. 15-24, 2010.
  • F. Alimoglu, E. Alpaydin, “Methods of Combining Multiple Classifiers Based on Different Representations for Pen-based Handwriting Recognition”, in Proc. Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium (TAINN 96), June 1996