Data Analysis Using Hadoop Technological Stack

Seminar 12. Mock Test

Task Overview

The aim of this mock test is to perform data analysis using Hadoop technological stack. You will practically apply big data tools on a sample dataset. You need to work on Cloudera Quickstart VM version 5.12.0 and use different technologies (Hive, Spark, MapReduce, Pig) to perform data analysis. Instructions on how to install Cloudera Quickstart VM are provided on the module page in WIUT’s Intranet.

In the course of this assignment, you should analyze the dataset and implement solutions to answer questions asked below. For the real test you will also produce a small report describing your findings.

Practical Task 

You need to create a folder named as your ID number in HDFS to store all your data. Copy your dataset file(s) from local system to HDFS folder.

You should answer these questions using Hive and Spark. Both frameworks should be used to answer each question. Then you will compare results to make sure you answered the questions properly. You need to work on the data you placed into the HDFS folder. 

You need to provide all the scripts to: § create folders

  • copy the dataset files to HDFS
  • load the data from HDFS to analytical applications
  • query the stored data
  • show descriptive statistics of the dataset (min, max, average, etc).

You should submit Jupyter notebook if you’re working with PySpark. You need to submit your map and reduce scripts for MapReduce task(s).

Proper instructions on how to execute the scripts should be provided in the report during the real exam.

Dataset

Formula One, also called F1 in short, is an international auto racing sport. F1 is the highest level of single-seat, open-wheel and open-cockpit professional motor racing contest.

The objective of a Formula 1 contest is to determine the winner of a race. The driver who crosses the finish line first after completing a pre-determined number of laps is declared the winner.

A series of Formula One races are conducted over a period of time, usually over a year called the ‘Formula One World Championship season’. Each race in a season is called a ‘Grand Prix’ or GP

The number of Grand Prix in a season has varied through the years, starting from 1950 which had 7 races. This number kept increasing up to a maximum of 20 GPs a year (in 2012). Normally there are 19 to 20 GPs in a season now. Top 10 drivers at the end of each Grand Prix will receive points based on the positions they finished, and these points will contribute towards determining both, the champion.

The results of all the Grand Prix races in a season are taken together to determine annual Championship awards. If you need more info about F1 – you can check the following website: https://www.tutorialspoint.com/formula_one/formula_one_quick_guide.htm  

The website kaggle.com published a dataset related to the F1 race data. This dataset contains data from 1950 all the way through the 2017 season, and consists of tables describing constructors, race drivers, lap times, pit stops and more. You can find the dataset at https://www.kaggle.com/cjgdev/formula-1-race-data-19502017. 

Questions

Download this dataset and using Big Data tools answer the following questions:

  1. Who became F1 World Champion in 2006?
  2. What is the fastest lap time for each circuit and who set these records?
  3. Which pilots spent the least amount of time on pit stops? Show the result for each year.
Want latest solution of this assignment