Vinita BANIWAL, Meenu CHAWLA

Optimization in the catalyst optimizer of Spark SQL

Apache Spark is one of the most technically challenged frameworks for cluster computing in which dataare processed in a parallel fashion. The cluster consists of unreliable machines. It processes a large amount of datafaster compared to the MapReduce framework. For providing the facility of optimized and fast SQL query processing,a new unit is developed in Apache Spark named Spark SQL. It allows users to use relational processing and functionalprogramming in one place. It provides many optimizations by leveraging the benefits of its core. This is called thecatalyst optimizer. This optimizer has many rules to optimize queries for efficient execution. In this paper, we discussa scenario in which the catalyst optimizer is not able to optimize the query competently for a specific case. This is thereason for inefficient memory usage and increases in the time required for the execution of the query by Spark SQL. Fordealing with this issue, we propose a solution in this paper by which the query is optimized up to the peak level. Thissignificantly reduces the time and memory consumed by the shuffling process

PDF

___

Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: Cluster computing with working sets. HotCloud 2010; 10: 95.
Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A et al. Spark sql: Relational data processing in spark. In: Proceedings of the 2015 ACM International Conference on Management of Data; 2015. pp. 1383-1394.
Gopalani S, Arora R. Comparing Apache Spark and Map Reduce with performance analysis using k-means. International Journal of Computer Applications 2015; 1: 113.
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation; 25 April 2012. p. 2.
Rana N, Deshmukh S. Shuffle performance in Apache Spark. International Journal of Engineering Research and Technology 2015; 4: 177-180.
Rana N, Deshmukh S. Performance improvement in Apache Spark through shuffling. International Journal of Science, Engineering and Technology Research 2015; 4: 1636-1638.