Göksu Zekiye ÖZEN, Mehmet TEKEREK, Rayimbek SULTANOV

Hadoop framework implementation and performance analysis on a cloud

The Hadoop framework uses the MapReduce programming paradigm to process big data by distributing data across a cluster and aggregating. MapReduce is one of the methods used to process big data hosted on large clusters. In this method, jobs are processed by dividing into small pieces and distributing over nodes. Parameters such as distributing method over nodes, the number of jobs held in a parallel fashion, and the number of nodes in the cluster affect the execution time of jobs. The aim of this paper is to determine how the numbers of nodes, maps, and reduces affect the performance of the Hadoop framework in a cloud environment. For this purpose, tests were carried out on a Hadoop cluster with 10 nodes hosted in a cloud environment by running PiEstimator, Grep, Teragen, and Terasort benchmarking tools on it. These benchmarking tools available under the Hadoop framework are classified as CPU-intensive and CPU-light applications as a result of tests. In CPU-light applications, increasing the numbers of nodes, maps, and reduces does not improve the efficiency of these applications; they even cause an increase in time spent on jobs by using system resources unnecessarily. Therefore, in CPU-light applications, selecting the numbers of nodes, maps, and reduces as minimum is found as the optimization of time spent on a process. In CPU-intensive applications, according to the phase that small job pieces is processed, it is found that selecting the number of maps or reduces equal to total number of CPUs on a cluster is the optimization of time spent on a process.

Keywords:

Big data, Hadoop, cloud computing, MapReduce KVM virtualization environment, benchmarking tools, Ganglia,

PDF

Turkish Journal of Electrical Engineering and Computer Science-Cover

ISSN: 1300-0632
Yayın Aralığı: Yılda 6 Sayı
Yayıncı: TÜBİTAK

Arşiv