Tolga BUYUKTANIR, Ahmet Ercan TOPCU

Equi-Depth Histogram Construction Methodology for Big Data Tools

In recent decades, countless data sources such as social media, machines, and networks are constantly pushing data into the digital world. The size of the data has been growing exponentially. To understand the statistical information of data query optimization, equi-depth histograms are essential. In this paper, we present approximate equi-depth histogram construction for big data using both Apache Pig Scripts and Java Web Interface interacting with Apache Hadoop. We use equi-depth histogram construction with quality guarantees for big data approaches and implement them with Apache Hadoop Map-Reduce and Apache Pig user-defined functions. We introduce a prototype implementation of the construction of the approximate equi-depth histogram from the Java Server Face page using Apache Hadoop jobs and the Hadoop Distributed Files System, and we evaluate these methods using the demonstration. We explain Apache Pig Scripts techniques to create equi-depth histograms using big data. The results indicate that our system provides the capability of writing multiple jobs using Apache Pig, and programmers can make use of the advantages of Apache Pig to create histograms and eliminate the complex implementation of Map-Reduce jobs.

Anahtar Kelimeler:

Approximate histogram, merging histograms, big data, log files, hadoop distributed file system

Equi-Depth Histogram Construction Methodology for Big Data Tools

Keywords:

approximate histogram, merging histograms, big data, log files, hadoop distributed file system,

PDF

___

B. Yıldız, T. Büyüktanır, and F. Emekci, “Equi-depth histogram construction for big data with quality guarantees,” arXiv preprint arXiv:1606.05633, 2016.
D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum, “Stateful bulk processing for incremental analytics,” in Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010, pp. 51–62.
A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, “Data warehousing and analytics infrastructure at facebook,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010, pp. 1013–1020.
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, “Hive-a petabyte scale data ware- house using hadoop,” in Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE, 2010, pp. 996–1005.
A. S. Foundation. (2008) Apache hadoop. [Online]. Available: https://hadoop.apache.org/
J. Dean and S. Ghemawat, “Mapreduce: a flexible data processing tool,” Communications of the ACM, vol. 53, no. 1, pp. 72–77, 2010.
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad, “Hadoop++: making a yellow elephant run like a cheetah (without it even noticing),” Proceedings of the VLDB Endowment, vol. 3,no. 1-2, pp. 515–529, 2010.
A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, “Building a high-level dataflow system on top of map-reduce: the pig experience,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1414–1425, 2009.
A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich, “Trojan data layouts: right shoes for a running elephant,” in Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011, p. 21.
M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, “Improving mapreduce performance in heterogeneous environments.” in OSDI, vol. 8, no. 4, 2008, p. 7.
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks,” in ACM SIGOPS Operating Systems Review, vol. 41, no. 3. ACM, 2007, pp. 59–72.
A. Schumacher, L. Pireddu, M. Niemenmaa, A. Kallio, E. Korpelainen, G. Zanetti, and K. Heljanko, “Seqpig: simple and scalable scripting for large sequencing data sets in hadoop,” Bioinformatics, vol. 30, no. 1, pp. 119–120, 2014.
S. Wu, F. Li, S. Mehrotra, and B. C. Ooi, “Query optimization for massively parallel data processing,” in Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM, 2011, p. 12.
S. Babu, “Towards automatic optimization of mapreduce programs,” in Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010, pp. 137–142.
H. Herodotou and S. Babu, “Profiling, what-if analysis, and cost-based optimization of mapreduce programs,” Proceedings of the VLDB Endowment, vol. 4, no. 11, pp. 1111–1122, 2011.
E. Jahani, M. J. Cafarella, and C. Ré, “Automatic optimization for mapreduce programs,” Proceedings of the VLDB Endowment, vol. 4, no. 6, pp. 385–396, 2011.
D. Jiang, B. C. Ooi, L. Shi, and S. Wu, “The performance of mapreduce: An in-depth study,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 472–483, 2010.
J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad, “Only aggressive elephants are fast elephants,” Proceedings of the VLDB Endowment, vol. 5, no. 11, pp. 1591–1602, 2012.
A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata, “Column-oriented storage techniques for mapreduce,” Proceedings of the VLDB Endowment, vol. 4, no. 7, pp. 419–429, 2011.
Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu, “Llama: leveraging columnar storage for scalable join processing in the mapreduce framework,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011, pp. 961–972.
Google search statistics. [Online].Available: http://www.internetlivestats.com/google-search-statistics/
Yahoo advertising. [Online]. Available: https://advertising.yahoo.com/yahoo-sites/Homepage/index.htm
Y. Ioannidis, “The history of histograms (abridged),” in Proceedings of the 29th international conference on Very large data bases-Volume 29. VLDB Endowment, 2003, pp. 19–30.
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin: a not-so-foreign language for data processing,” in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008, pp. 1099–1110.
P. M. Hallam-Baker and B. Behlendorf, “Extended log file format,” WWW Journal, vol. 3, p. W3C, 1996.