Mehmet TEKEREK, Göksu Zekiye OZEN, Rayimbek SULTANOV

Hadoop framework implementation and performance analysis on a cloud

The Hadoop framework uses the MapReduce programming paradigm to process big data by distributingdata across a cluster and aggregating. MapReduce is one of the methods used to process big data hosted on largeclusters. In this method, jobs are processed by dividing into small pieces and distributing over nodes. Parameterssuch as distributing method over nodes, the number of jobs held in a parallel fashion, and the number of nodes in thecluster affect the execution time of jobs. The aim of this paper is to determine how the numbers of nodes, maps, andreduces affect the performance of the Hadoop framework in a cloud environment. For this purpose, tests were carriedout on a Hadoop cluster with 10 nodes hosted in a cloud environment by running PiEstimator, Grep, Teragen, andTerasort benchmarking tools on it. These benchmarking tools available under the Hadoop framework are classi ed asCPU-intensive and CPU-light applications as a result of tests. In CPU-light applications, increasing the numbers ofnodes, maps, and reduces does not improve the efficiency of these applications; they even cause an increase in time spenton jobs by using system resources unnecessarily. Therefore, in CPU-light applications, selecting the numbers of nodes,maps, and reduces as minimum is found as the optimization of time spent on a process. In CPU-intensive applications,according to the phase that small job pieces is processed, it is found that selecting the number of maps or reduces equalto total number of CPUs on a cluster is the optimization of time spent on a process.

PDF

___

[1]Hurtwitz J, Nugent A, Halper F, Kaufman M. Big Data for Dummies. Hoboken, NJ, USA: John Wiley & Sons,2013.
[2]Slagter K, Hsu CH, Chung YC, Zhang D. An improved partitioning mechanism for optimizing massive data analysisusing mapreduce. The Journal of Supercomputing 2013; 66: 539-555.
[3]Dean J, Ghemawat S. Mapreduce: Simpli ed data processing on large clusters. Commun ACM 2008; 51: 107-113.
[4]Perera S, Gunarathne T. Hadoop MapReduce Cookbook. Birmingham, UK: Packt Publishing, 2013.
[5]Aditya BP, Manashvi B, Ushma N. Addressing big data problem using hadoop and map reduce. In: 2012 NirmaUniversity International Conference on Engineering; 6{8 December 2012; Ahmedabad, India. New York, NY, USA:IEEE. pp. 1-5.
[6]Wu Y, Ye F, Chen K, Zheng W. Modeling of distributed le systems for practical performance analysis. IEEE TParall Distr 2014; 25: 156-166.
[7]Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed le system. In: 2010 IEEE 26th Symposiumon Mass Storage Systems and Technologies; 3{7 May 2010; Incline Village, NV. New York, NY, USA: IEEE. pp.1-10.
[8]Lee SW, Yu F. Securing KVM-based cloud systems via virtualization introspection. In: 47th Hawaii InternationalConference on System Sciences; 6{9 January 2014; Waikoloa, HI. New York, NY, USA: IEEE. pp. 5028-5037.
[9]Guo S. Hadoop Operations and Cluster Management Cookbook. Birmingham, UK: Packt Publishing, 2013.
[10]Tan YS, Tan J, Chng ES. Hadoop framework: impact of data organization on performance. Wiley Online Library2011; 43: 1241-1260.
[11]Premchaiswadi W, Romsaiyud W. Optimizing and tuning MapReduce jobs to improve the large-scale data analysisprocess. Int J Intell Syst 2013; 28: 185-200.
[12]Herodotou H, Lim H, Luo G, Borisov N, Dong L, Cetin FB, Babu S. Star sh: A self-tuning system for big dataanalytics. In: 5th Biennial Conference on Innovative Data Systems Research; 9{12 January 2011; Asimolar, CA.pp. 261-272.
[13]Verma A, Cherkasova L, Campbell RH. ARIA: Automatic resource inference and allocation for MapReduce envi-ronments. In: Proceedings of the 8th ACM International Conference on Autonomic Computing; 14{18 June 2011;Karlsruhe, Germany. New York, NY, USA: ACM. pp. 235-244.
[14]Zhang Z, Cherkasova L, Loo BT. Benchmarking approach for designing a MapReduce performance model. In:Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering; 21{24 April 2013;Prague, Czech Republic. New York, NY, USA: ACM. pp. 253-258.
[15]Babu S. Towards automatic optimization of MapReduce programs. In: Proceedings of the 1st ACM Symposium onCloud Computing; 10{11 June 2010; Indianapolis, IN, USA. New York, NY, USA: ACM. pp. 137-142.
[16]Fadika Z, Dede E, Govindaraju M, Ramakrishnan L. Benchmarking MapReduce implementations for applicationusage scenarios. In: 12th IEEE/ACM International Conference, Grid Computing; 21{23 September 2011; Lyon,France. New York, NY, USA: IEEE. pp. 90-97.
[17]Massie ML, Chun BN, Culler DE. The ganglia distributed monitoring system: design, implementation, andexperience. Parallel Comput 2004; 30: 817-840.
[18]Venner J. Pro Hadoop. New York, NY, USA: Apress, 2009.[19]Maurya M, Mahajan S. Performance analysis of MapReduce programs on hadoop cluster. In: 2012 World Congresson Information and Communication Technologies; 30 October 2012{02 November 2012; Trivandrum, India. NewYork, NY, USA: IEEE. pp. 505-510.
[20]White T. Hadoop De nitive Guide. 2nd Ed. Sebastopol, CA, USA: O'Reilly Media, 2010.