Parameter tuning and performance evaluation with HiBench on Hive and Spark SQL

Näytä kaikki kuvailutiedot

Permalink

http://urn.fi/URN:NBN:fi:hulib-201804131685
Julkaisun nimi: Parameter tuning and performance evaluation with HiBench on Hive and Spark SQL
Tekijä: Shrestha, Shiva Ram
Muu tekijä: Helsingin yliopisto, Matemaattis-luonnontieteellinen tiedekunta
Opinnäytteen taso: pro gradu -tutkielmat
Tiivistelmä: Apache Hadoop has provided solutions to the obstacles related to the Big Data processing. Hadoop stores large datasets in HDFS at the distributed network of commodity hardware and process with parallelism. The parallel computing power of Hadoop comes with the MapReduce framework, in which map/reduce programs are Java code, but as for the data analytics the Structured Query Language(SQL) has been a dominant tool from a long run. Thus, to add-on efficiency and effectiveness with data analytics, Hadoop came up with SQL engines such as Hive, Spark SQL, Impala. With these engines laying on the top of Hadoop ecosystem, the end user can leverage in writing a data analytics applications in well understood SQL-like language and can focus only on data analytics. The thesis provides a comparative performance analysis of mainly two SQL engines, Hive and Spark SQL. Apache Hadoop and its components such as HDFS, MapReduce, YARN, and Spark, its design and work-flow are mentioned as a requisite background knowledge. The both SQL engines, Hive and Spark SQL, conduct the data processing with HiveQL statement deployed with several Hadoop components. The experiments were performed accompany with a tune on the configuration parameters of Hadoop components to provides more in-depth understanding of both SQL engines.The experimental Hadoop cluster was configured with limited resources, data size, and evaluation tool(HiBench) were used to provide a fair comparison between the engines. The configuration parameters resulting an optimal performance was opted at the end to evaluate and compare the Hive and Spark SQL performance. The experimental results shed light on the cluster performance with a change in the configuration parameters of Hadoop. And also, the comparative performance between the Hive and Spark SQL showed the Spark SQL perform better even when configured with the minimal cluster resource than the Hive.
URI: URN:NBN:fi:hulib-201804131685
http://hdl.handle.net/10138/234250
Päiväys: 2018-04-16
Oppiaine: Networking and Service


Tiedostot

Tiedosto(t) Koko Formaatti Näytä

Tähän julkaisuun ei ole liitetty tiedostoja

Viite kuuluu kokoelmiin:

Näytä kaikki kuvailutiedot