Parameter tuning and performance evaluation with HiBench on Hive and Spark SQL

Show full item record

Permalink

http://urn.fi/URN:NBN:fi:hulib-201804131685
Title: Parameter tuning and performance evaluation with HiBench on Hive and Spark SQL
Author: Shrestha, Shiva Ram
Contributor: University of Helsinki, Faculty of Science
Thesis level: master's thesis
Abstract: Apache Hadoop has provided solutions to the obstacles related to the Big Data processing. Hadoop stores large datasets in HDFS at the distributed network of commodity hardware and process with parallelism. The parallel computing power of Hadoop comes with the MapReduce framework, in which map/reduce programs are Java code, but as for the data analytics the Structured Query Language(SQL) has been a dominant tool from a long run. Thus, to add-on efficiency and effectiveness with data analytics, Hadoop came up with SQL engines such as Hive, Spark SQL, Impala. With these engines laying on the top of Hadoop ecosystem, the end user can leverage in writing a data analytics applications in well understood SQL-like language and can focus only on data analytics. The thesis provides a comparative performance analysis of mainly two SQL engines, Hive and Spark SQL. Apache Hadoop and its components such as HDFS, MapReduce, YARN, and Spark, its design and work-flow are mentioned as a requisite background knowledge. The both SQL engines, Hive and Spark SQL, conduct the data processing with HiveQL statement deployed with several Hadoop components. The experiments were performed accompany with a tune on the configuration parameters of Hadoop components to provides more in-depth understanding of both SQL engines.The experimental Hadoop cluster was configured with limited resources, data size, and evaluation tool(HiBench) were used to provide a fair comparison between the engines. The configuration parameters resulting an optimal performance was opted at the end to evaluate and compare the Hive and Spark SQL performance. The experimental results shed light on the cluster performance with a change in the configuration parameters of Hadoop. And also, the comparative performance between the Hive and Spark SQL showed the Spark SQL perform better even when configured with the minimal cluster resource than the Hive.
URI: URN:NBN:fi:hulib-201804131685
http://hdl.handle.net/10138/234250
Date: 2018-04-16
Discipline: Networking and Service


Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show full item record