Büyük Veri Araçları için Eş-Derinlikli Histogram Oluşturma Metodolojisi

Son yıllarda, verileri sürekli olarak dijital dünyaya aktaran ağlar, makinalar ve sosyal medya gibi bir çok data kaynağı vardır. Bu kaynaklardan üretilen datanın boyutu eksponansiyel olarak artmaktadır. Hali hazırda elde bulunan datanın istatistiki bilgisini anlamak ve sorgu optimizasyonu sağlamak için eş-derinlikli histogram vazgeçilmez bir araçtır. Bu makalede, büyük veriler için hem Apache Pig betiklerini hem de Apache Hadoop ile etkileşimli Java Web Arayüzü kullanılarak yaklaşık eş-derinlikli histogram oluşturulması gösterilmektedir. Büyük veriler için, kalite garantisiyle birlikte, eş-derinlikli histogram oluşturma metotları kullanılmakta, bu metotların teknik yönlerden deneysel sunumları ortaya konulmakta ve yine bu metotlar Apache Hadoop MapReduce ve Apache Pig User Defined Functions ile uygulanmaktadır. Arka planda Apache Hadoop Map-Reduce işleri (Apache Hadoop Map-Reduce jobs) ve Hadoop Distributed Files System kullanılarak Java Server Face sayfasından kalitesi garantilenmiş eş-derinlikli histogram oluşturulmasının prototip uygulaması ve bu uygulamaların kullanılmasıyla metotların değerlendirilmesi sunulmaktadır. Ayrıca büyük verileri kullanarak eş-derinlikli histogram oluşturmak için Apache Pig betiklerinin teknikleri izah edilmektedir. Sonuçlar gösteriyor ki; sistemimiz Apache Pig kullanılarak, histogram kullanımını da gerektiren çoklu iş yazma yeteneğini basit bir şekilde sağlamaktadır. Programcılar, histogram oluşturmak ve Map-Reduce işlerinin karmaşık uygulamalarından kaçınmak için Apache Pig’in avantajlarından faydalanabilmektedir.

Equi-depth histogram construction methodology for big data tools

In recent decades, countless data sources such as social media, machines, and networks are constantly pushing data into the digitalworld. The size of the data has been growing exponentially. To understand the statistical information of data query optimization,equi-depth histograms are essential. In this paper, we present approximate equi-depth histogram construction for big data usingboth Apache Pig Scripts and Java Web Interface interacting with Apache Hadoop. We use equi-depth histogram construction withquality guarantees for big data approaches and implement them with Apache Hadoop Map-Reduce and Apache Pig user-definedfunctions. We introduce a prototype implementation of the construction of the approximate equi-depth histogram from the JavaServer Face page using Apache Hadoop jobs and the Hadoop Distributed Files System, and we evaluate these methods using thedemonstration. We explain Apache Pig Scripts techniques to create equi-depth histograms using big data. The results indicate thatour system provides the capability of writing multiple jobs using Apache Pig, and programmers can make use of the advantages ofApache Pig to create histograms and eliminate the complex implementation of Map-Reduce jobs.

PDF

___

[1] Logothetis D., Olston C., Reed B., Webb K. C., and Yocum K. , "Stateful bulk processing for incremental analytics", In Proceedings of the 1st ACM symposium on Cloud computing, 51-62, (2010).
[2] Thusoo A., Shao Z., Anthony S., Borthakur D., Jain N., Sen Sarma J., ... and Liu H. , "Data warehousing and analytics infrastructure at facebook", In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 1013-1020, (2010).
[3] Thusoo A., Sarma J. S., Jain N., Shao Z., Chakka P., Zhang N., ... and Murthy R. , "Hive-a petabyte scale data warehouse using hadoop", In 2010 IEEE 26th international conference on data engineering (ICDE 2010), 996-1005, (2010).
[4] Yıldız B., Büyüktanır T., and Emekci F. , "Equi-depth histogram construction for big data with quality guarantees", arXiv preprint arXiv:1606.05633, (2016).
[5] https://hadoop.apache.org, “A. S. Foundation Apache Hadoop”, (2008).
[6] Dean J., and Ghemawat S., "MapReduce: a flexible data processing tool", Communications of the ACM, 53(1): 72-77, (2010).
[7] Dittrich J., Quiané-Ruiz J. A., Jindal A., Kargin Y., Setty V., and Schad J., , "Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing)", Proceedings of the VLDB Endowment, 3(1-2): 515-529, (2010).
[8] Gates A. F., Natkovich O., Chopra S., Kamath P., Narayanamurthy S. M., Olston C., ... and Srivastava U., "Building a high-level dataflow system on top of MapReduce: the Pig experience", Proceedings of the VLDB Endowment, 2(2): 1414-1425, (2009).
[9] Jindal A., Quiané-Ruiz J. A., and Dittrich J. , "Trojan data layouts: right shoes for a running elephant", In Proceedings of the 2nd ACM Symposium on Cloud Computing, 1-14, (2011).
[10] Zaharia M., Konwinski A., Joseph A. D., Katz R. H., and Stoica I., "Improving MapReduce performance in heterogeneous environments", In Osdi, 8(4): 7, (2008).
[11] Isard M., Budiu M., Yu Y., Birrell A., and Fetterly D., "Dryad: distributed data-parallel programs from sequential building blocks", In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, 59-72, (2007).
[12] Schumacher A., Pireddu L., Niemenmaa M., Kallio A., Korpelainen E., Zanetti G., and Heljanko K., "SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop", Bioinformatics, 30(1): 119-120, (2014).
[13] Wu S., Li F., Mehrotra S., and Ooi B. C., "Query optimization for massively parallel data processing", In Proceedings of the 2nd ACM Symposium on Cloud Computing, 1-13, (2011).
[14] Babu S., "Towards automatic optimization of MapReduce programs", In Proceedings of the 1st ACM symposium on Cloud computing, 137-142, (2010).
[15] Herodotou H., and Babu S., "Profiling, what-if analysis, and cost-based optimization of mapreduce programs", Proceedings of the VLDB Endowment, 4(11): 1111-1122, (2011).
[16] Jahani E., Cafarella M. J., and Ré C., "Automatic optimization for MapReduce programs", arXiv preprint arXiv:1104.3217, (2011).
[17] Jiang D., Ooi B. C., Shi L., and Wu S., "The performance of mapreduce: An in-depth study", Proceedings of the VLDB Endowment, 3(1-2): 472-483, (2010).
[18] Dittrich J., Quiané-Ruiz J. A., Richter S., Schuh S., Jindal A., and Schad J., "Only aggressive elephants are fast elephants", arXiv preprint arXiv:1208.0287, (2012).
[19] Floratou A., Patel J., Shekita E., and Tata S., "Columnoriented storage techniques for MapReduce", arXiv preprint arXiv:1105.4252, (2011).
[20] Lin Y., Agrawal D., Chen C., Ooi B. C., and Wu S., "Llama: leveraging columnar storage for scalable join processing in the MapReduce framework", In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, 961-972, (2011).
[21] http://www.internetlivestats.com/google-searchstatistics, Google search statistics.
[22] https://advertising.yahoo.com/yahoosites/Homepage/index.htm, Yahoo advertising.
[23] Ioannidis Y., "The history of histograms (abridged)", In Proceedings 2003 VLDB Conference, 19-30, (2003).
[24] Olston C., Reed B., Srivastava U., Kumar R., and Tomkins A., "Pig latin: a not-so-foreign language for data processing", In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 1099- 1110, (2008).
[25] Hallam-Baker P. M., "Extended log file format", WWW Journal, (1996).
[26] https://docs.oracle.com/database/121/TGSQL/tgsql_hist o.htm#TGSQL380, Oracle Database Histograms.