A novel data placement strategy to reduce data traffic during run-time
A novel data placement strategy to reduce data traffic during run-time
High impact scientific applications processed in distributed data centers often involve big data. To avoid the intolerable delays due to huge data movements across data centers during processing, the concept of moving tasks to data was introduced in the last decade. Even after the realization of this concept termed as data locality, the expected quality of service was not achieved. Later, data colocality was introduced where data groupings were identified and then data chunks were placed wisely. However, the aspect of the expected data traffic during run time is generally not considered while placing data. To identify the expected data traffic, the knowledge of the history of data movements is useful. In this work, this knowledge is utilized and an approach to intelligently select the nodes for placing data groups to ensure the least possible data movements is proposed. Systematic scrutiny of log files is conducted and a gain matrix is generated based on maximum likelihood estimation of data movements. Formally, the gain matrix is inversely proportional to the expected data traffic inside the data center. It reflects the performance gain obtained by assigning a block to a node with the lowest possible future data movements. To identify the optimal placement, a many-to-one assignment problem-based algorithm is presented. By experimental analysis, it is observed that the movement of data is significantly reduced by the proposed approach. It is also found that the performance has improved considerably
___
- [1] Kevin B, Satoshi M. Co-locating graph analytics and HPC applications. In: IEEE 2017 International Conference on Cluster Computing; Honolulu, HI, USA; 2017. pp. 659-660.
- [2] Ke H, Li P, Guo S, Stojmenovic I. Aggregation on the fly: reducing traffic for big data in the cloud. IEEE Network 2015; 29 (5): 17-23. doi: 10.1109/MNET.2015.7293300
- [3] Leskovec J, Rajaraman A, Ullman JD. Mining of Massive Datasets. Palo Alto, CA, USA: Cambridge University Press, 2014.
- [4] Wang J, Xiao Q, Yin J, Shang P. DRAW : a new data-grouping-aware data placement scheme for data in- tensive applications with interest locality. IEEE Transactions on Magnetics 2013; 49 (6): 2514-2520. doi: 10.1109/TMAG.2013.2251613
- [5] Gorla N, Zhang K. Deriving program physical structures using bond energy algorithm. In: ASPEC 1999 Proceedings Sixth Asia Pacific Software Engineering Conference; Takamatsu, Kagawa, Japan; 1999. pp. 359-366.
- [6] Hao X, Jin P, Yue L. Efficient storage of multi-sensor object-tracking data. IEEE Transactions on Parallel and Distributed Systems 2016; 27 (10): 2881-2894. doi: 10.1109/TPDS.2015.2511735
- [7] Mahdi E, Aravind M, Andrey K, Shiyong L. BDAP : a big data placement strategy for cloud based scientific workflows. In: 2015 IEEE First International Conference on Big Data Computing Service and Applications; Redwood City, CA, USA; 2015. pp. 105-114.
- [8] Xu Q, Xu Z, Wang T. A data-placement strategy based on genetic algorithm in cloud computing. International Journal of Intelligence Science 2015; 5 (3): 145-157. doi: 10.4236/ijis.2015.53013
- [9] Zhang J, Chen J, Luo J, Song A. Efficient location-aware data placement for data-intensive applications in geo-distributed scientific data centers. Tsinghua Science and Technology 2016; 21 (5): 471-481. doi: 10.1109/TST.2016.7590316
- [10] Runqun X, Junzhou L, Fang D. SLDP : a novel data placement strategy for large-scale heterogeneous hadoop cluster. In: Second International Conference on Advanced Cloud and Big Data; Huangshan, Anhui, China; 2014. pp. 9-17.
- [11] Juan T, Daniel H, Javier B, Florin I, Jesus C. CONDESA : a framework for controlling data distribution on elastic server architectures. IEEE Transactions on Parallel and Distributed Systems 2014; 25 (8): 2010-2018. doi: 10.1109/TPDS.2013.197
- [12] Myung J. Tutorial on maximum likelihood estimation. Journal of Mathematical Psychology 2003; 47: 90-100. doi: 10.1016/S0022-2496(02)00028-7
- 13] David P. Assignment problems : a golden anniversary survey. European Journal of Operational Research 2007; 176: 774-793. doi: 10.1016/j.ejor.2005.09.014
- [14] Mauro D, Silvano M. The k-cardinality assignment problem. Discrete Applied Mathematics 1997; 76: 103-121. doi: PZZSO166-218X(97)00120-5
- [15] Atmaca T, Begin T, Brandwajn A, Castel-Taleb H. Performance evaluation of cloud computing centers with general arrivals and service. IEEE Transactions on Parallel and Distributed Systems 2016; 27 (8): 2341-2348. doi: 10.1109/TPDS.2015.2499749
- [16] Yonghong H, Xuebin C, Debbi C, David K, David Y. A quantitative index for measuring the development of supercomputing. Concurrency and Computation: Practice and Experience 2015; 27 (17): 4685-4703. doi: 10.1002/cpe.3451