Wajahat Hussain MIR, K Hemant Kumar REDDY, Diptendu SINHA ROY

A counter based approach for reducer placement with augmented Hadoop rack awareness

As the data-driven paradigm for intelligent systems design is gaining prominence, performance requirements have become very stringent, leading to numerous fine-tuned versions of Hadoop and its MapReduce programming model. However, very few researchers have investigated the effect of intelligent reducer placement on Hadoop’s performance. This paper delves into this much ignored reducer placement phase for improving Hadoop’s performance and proposes to spawn reduce phase of Hadoop tasks in an asynchronous fashion across nodes in a Hadoop cluster. The main contributions of this paper are: (i) to track when map phase of tasks are completed, (ii) to count the number of maps completed, and finally (iii) assign reducers to Hadoop nodes based on map counts such that run-time data copying is minimized. To this end, this paper presents a novel counter based reducer placement (CBRP) algorithm based on the counter values maintained by JobTracker at the rack and node levels. Experiments conducted demonstrate the merit of the proposed reducer placement with average improvements ranging between 5% and 17% experienced across different benchmarks with both late shuffle and early shuffle.

PDF

___

1] Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. International Journal of Information Management 2015; 35 (2): 137-144.
[2] Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Communications of the ACM 2008; 51 (1): 107-113.
[3] Hussain MW, Reddy KH, Roy DS. Resource aware execution of speculated tasks in Hadoop with sdn. International Journal of Advanced Science and Technology 2019; 28 (13): 72-84. [4] White T. Hadoop: The Definitive Guide. USA: O’Reilly Media, Inc., 2012.
[5] Herodotou H. Hadoop performance models. arXiv 2011; arXiv:1106.0940 [cs.DC].
[6] Zhang X, Wu Y, Zhao C. MrHeter: improving mapreduce performance in heterogeneous environments. Cluster Computing 2016; 19 (4): 1691-1701.
[7] Xiong R, Luo J, Dong F. Optimizing data placement in heterogeneous Hadoop clusters. Cluster Computing 2015; 18 (4): 1465-1480.
[8] Hammoud M, Sakr MF. Locality-aware reduce task scheduling for mapreduce. In: IEEE Third International Conference on Cloud Computing Technology and Science; NW Washington, DC, USA; 2011. pp. 570-576.
[9] Reddy KH, Das H, Roy DS. A data aware scheme for scheduling big data applications with SAVANNA Hadoop. In: Elkhodr M, Hassan QF, Shahrestani S (editors). Networks of the Future. USA: Chapman and Hall/CRC, 2017, pp. 377-392.
[10] Ashu A, Hussain MW, Reddy KH, Roy DS. Intelligent data compression policy for Hadoop performance opti- mization. In: International Conference on Soft Computing and Pattern Recognition; Hyderabad, India; 2019. pp. 80-89.
[11] Wang J, Shang P, Yin J. Draw: a new data-grouping-aware data placement scheme for data intensive applications with interest locality. In: Springer Cloud Computing for Data-Intensive Applications; New York, NY, USA; 2014. pp. 149-174.
[12] Xiong R, Luo J, Dong F. SLDP: a novel data placement strategy for large-scale heterogeneous Hadoop cluster. In: IEEE Second International Conference on Advanced Cloud and Big Data; Toulouse, France; 2014. pp. 9-17.
[13] Paik SS, Goswami RS, Roy DS, Reddy KH. Intelligent data placement in heterogeneous Hadoop cluster. In: Springer International Conference on Next Generation Computing Technologies; Singapore; 2017. pp. 568-579.
[14] Reddy KH, Roy DS. Dppacs: a novel data partitioning and placement aware computation scheduling scheme for data-intensive cloud applications. The Computer Journal 2015; 59 (1): 64-82.
[15] Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In: IEEE 2010 26th symposium on mass storage systems and technologies (MSST); NW Washington, DC, USA; 2010. pp. 1-10.
[16] Shafer J, Rixner S, Cox AL. The Hadoop distributed filesystem: balancing portability and performance. In: IEEE 2010 International Symposium on Performance Analysis of Systems & Software (ISPASS); White Plains, NY, USA; 2010. pp. 122-133.
[17] Nabavinejad SM, Goudarzi M, Mozaffari S. The memory challenge in reduce phase of mapreduce applications. IEEE Transactions on Big Data 2016; 2 (4): 380-386.
[18] Hammoud M, Rehman MS, Sakr MF. Center-of-gravity reduce task scheduling to lower mapreduce network traﬀic. In: IEEE Fifth International Conference on Cloud Computing; Honolulu, USA; 2012. pp. 49-58.
[19] Ho LY, Wu JJ, Liu P. Optimal algorithms for cross-rack communication optimization in mapreduce framework. In: IEEE 4th International Conference on Cloud Computing; NW Washington, DC, USA; 2011. pp. 420-427.
[20] Arslan E, Shekhar M, Kosar T. Locality and network-aware reduce task scheduling for data-intensive applications. In: IEEE Proceedings of the 5th International Workshop on Data-Intensive Computing in the Clouds; New Orleans, LA, USA; 2014. pp. 17-24
21] Shen H, Sarker A, Yu L, Deng F. Probabilistic network-aware task placement for mapreduce scheduling. In: IEEE International Conference on Cluster Computing (CLUSTER); Taipei, Taiwan; 2016. pp. 241-250.
[22] Zhao Y, Tian C, Fan J, Guan T, Qiao C. RPC: joint online reducer placement and coflow bandwidth scheduling for clusters. In: IEEE 2018 26th International Conference on Network Protocols (ICNP); Cambridge, UK; 2018. pp. 187-197.
[23] Ananthanarayanan G, Kandula S, Greenberg AG, Stoica I, Lu Y et al. Reining in the outliers in map-reduce clusters using mantri. InOsdi 2010; 10 (1): 24.
[24] Guo Y, Wang Z, Yin X, Shi X, Wu J. Joint optimization of task placement and routing in minimizing inter-DC coflow completion time. In: IEEE 26th International Conference on Computer Communication and Networks (ICCCN); Vancouver, Canada; 2017. pp. 1-2.
[25] Guo Y, Wang Z, Zhang H, Yin X, Shi X et al. Joint optimization of tasks placement and routing to minimize coflow completion time. Journal of Network and Computer Applications 2019; 135: 47-61.
[26] Tang Z, Jiang L, Zhou J, Li K, Li K. A self-adaptive scheduling algorithm for reduce start time. Future Generation Computer Systems 2015; 43: 51-60.
[27] Lin M, Zhang L, Wierman A, Tan J. Joint optimization of overlapping phases in mapreduce. Performance Evaluation 2013; 70 (10) : 720-735.
[28] Zaharia M, Konwinski A, Joseph AD, Katz RH, Stoica I. Improving mapreduce performance in heterogeneous environments. InOsdi 2008; 8 (4): 7.
[29] Chen Q, Liu C, Xiao Z. Improving mapreduce performance using smart speculative execution strategy. IEEE Transactions on Computers 2013; 63 (4) : 954-967.
[30] Huang S, Huang J, Dai J, Xie T, Huang B. The HiBench benchmark suite: characterization of the mapreduce-based data analysis. In: IEEE 26th International Conference on Data Engineering Workshops (ICDEW); New York, NY, USA; 2010. pp. 41-51.