CS5607 High Performance Computational Infrastructures Assessment answers

Assessment Title : Individual Project Development
Brunel University London

MAIN OBJECTIVE OF THE ASSESSMENT In this assessment, you are required to demonstrate the appropriate practical skills and abilities to implement solutions using modern large-scale data storage and processing infrastructures, and to critically reflect on the concepts, theory and use of high performance computational infrastructures.

DESCRIPTION OF THE ASSESSMENT You are required to identify and analyse a real-world problem, design and implement a solution to the problem using Hadoop, and evaluate your implementation. The problem can be a simplified version from its original scale, extent or level of difficulties etc. An indicative list of sample problems have been provided at the end of this document. You may choose one of the problems in the list, but you are encouraged to identify your own problem for the project.

The assessment has two weighted components:

Oral presentation (20%). A workshop will be held near the end of the term. Each candidate will be allocated with 10 minutes (including question time) to present their individual project development and demonstrate, if any, your prototype software. You should take this as an opportunity to seek feedback and improve your project for the final submission.
Report (80%). A written report including the theory behind and the development of the individual project needs to be submitted.

LEARNING OUTCOMES AND MARKING CRITERIA

Learning Outcomes:

LO1: Demonstrate the appropriate practical skills/abilities required to implement solutions using modern large-scale data storage and processing infrastructures.

LO2: Reflect critically on the concepts, theory and appropriate use of large-scale data storage and processing infrastructures (commonly used in modern organisational environments).

Marking Criteria:

The coursework will be marked for 4 main criteria:

Demonstrating an understanding of the relevant theory underpinning distributed file systems & data analysis (LO2)
Identifying a real data analytics problem with strong motivation for using distributed processing methods (LO1)
Implementing and applying a working solution using distributed analytical techniques (LO1)
Critically evaluating the results of the implementation on the data with a discussion of how the approach is different from standard non-distributed methods (e.g. relational databased, serial data-mining) (LO2)

Grade Band E and F (E+, E, E-, F) The candidate fails to meet the minimum requirements as outlined in the learning outcomes.

Grade Band D (D+, D, D-) The work demonstrates significant weaknesses, but all of the learning outcomes have been met at the minimum requirement level. The work provides evidence of some critical understanding of the concepts and theories of large-scale data storage and processing infrastructures, and demonstrates some abilities and skills to implement solutions using these technologies.

Grade Band C (C+, C, C-) In addition to the requirements for a grade in D-band, the work demonstrates a critical and substantial understanding of the concepts and theories of large-scale data storage and processing infrastructures. It demonstrates the ability to develop an independent, systematic, logical and effective solution to the problems identified. It also demonstrates a significant degree of competence in the appropriate use of the relevant literature, theory, methodologies, practices, and tools, etc., to analyse the problems and evaluate the solutions.

Grade Band B (B+, B, B-) In addition to the requirements for a grade in C-band, the work clearly demonstrates a well-developed, critical and substantial understanding of the concepts and theories of large-scale data storage and processing infrastructures. It clearly demonstrates the ability to develop an independent, systematic, logical and effective solution to the problems identified. It also demonstrates a high degree of competence in the appropriate use of the relevant literature, theory, methodologies, practices, and tools, etc., to analyse the problems and evaluate the solutions.

Grade Band A (A*, A+, A, A-) In addition to the requirements for a grade in B-band, the work clearly demonstrates a sophisticated, critical and thorough understanding of the concepts and theories of large-scale data storage and processing infrastructures. It provides evidence of originality of thought and clearly demonstrates the ability to develop an independent, systematic, logical and effective solution to the problems identified. It also demonstrates excellence in the appropriate use of the relevant literature, theory, methodologies, practices, and tools, etc., to analyse the problems and evaluate the solutions.

FORMAT OF THE ASSESSMENT

There is no word/page limit for this assessment, but the best effort should be made to ensure the submission is as concise as possible. You should include sections on (percentage of overall mark):

Introduction (10%) - criteria 1
Problem description & associated dataset (15%) -criteria 2
Design & Implementation (20%) - criteria 3
Results (20%) - criteria 4
Conclusions (15%) - criteria 1

Indicative Coursework Topics

In this assessment, you are required to identify and analyse a real-world problem, design and implement a solution to the problem using Hadoop, and evaluate your implementation. The problem can be a simplified version from its original scale, extent or level of difficulties etc. Please refer to the official assessment specifications for the objectives, descriptions, marking criteria, format and submission requirement of the coursework.

An indicative list of sample problems is given as below. You may choose one of the problems in the list, but it would be better, and you are most encouraged, to identify your own problem and provide your own solutions for the coursework.

1. Word counting. We have used word counting as a "Hello World" problem in the lectures and labs. However, there is still space to extend the problem, for example, the dealing of upper/lower cases, punctuation marks, top N frequent words, cooccurrence words etc.

2. Scientific data analysis. We have used the UK weather data as an example in the lab. You may extend this application by developing tools to provide more in-depth analysis of weather and climate.

3. Image conversion. The story of converting millions of image documents from TIFF to PDF by the New York Times has been a highlight of Hadoop. Technically, it is not a very complicated task. Perhaps you will have a try.

4. Network traffic analysis. Take a web server log file, and write a program to analyse the traffic to the server, for example, the number of visit for each IP address per unit time, the top N visitors etc, for a starting point.

5. Monte Carlo simulation. Estimation of pi (the ratio of a circle's circumference to its diameter) using the Monte Carlo method.

6. Social Media Analytics. Pete Warden’s infamous story with Facebook has caught the eye of people from both within and outside the data science community. Regardless your opinion on this particular case, the power of big data technologies has been demonstrated to a great extent. You do not have to get yourself into the troubles like Pete Warden did, but certainly you can use the legitimate approaches to explore the hidden values of social media such as capturing consumer attitudes, managing online reputation, anticipating customer needs and making recommendations, etc.

High Performance Distributed Processing & Analysis – A Case Study on XXXXX Data

(Presentation - 20%) Introduction (10%)

Brief presentation of distributed methods to manage and analyse data and how they differ from traditional approaches

The Problem and Associated Dataset (15%)

Motivation of problem
What is the data?
Why does the data need to be analysed?
Why does it suit distributed methods?
How will analysing the data solve the associated problem?

Design & Implementation (20%)

What are the detailed characteristics of the data (use summary statistics to explain)?
What algorithms exist for solving such a problem – distributed or not
How can MapReduce be used to solve the problem?
What are the details of your implementation – annotated code and description of experiments

Results (20%)

Documentation of the execution of the code, e.g. time efficiency
Documentation of the results, e.g. charts, graphs, sample outputs
What do the results tell you about the use of MapReduce compared to other techniques?

Conclusions (15%)

Discussion of the results: Were they a success?
If not why not? If they were, how could they be improved?
What would you have done differently with hindsight?
What other approaches / algorithms / implementations could be explored beyond MapReduce?

Topic Ideas for CS5607 High Performance Computational Infrastructures Assessment by Our MapReduce and Hadoop Experts

At Assignmenthippo.com our experts have crafted various some diverse problem statements across various domains that could be suitable for analysis using distributed processing methods like Hadoop:

E-commerce Recommendation Engine: Develop a recommendation system using Hadoop for an e-commerce platform to suggest personalized products to users based on their browsing history, purchase patterns, and demographic information.

Healthcare Analytics: Analyze electronic health records (EHR) from different hospitals to identify patterns in diseases, treatment effectiveness, and patient outcomes, aiming to improve healthcare strategies and resource allocation.

Fraud Detection in Financial Transactions: Utilize Hadoop to process large volumes of financial transaction data in real-time to detect fraudulent activities, focusing on identifying anomalous patterns and preventing financial fraud.

Traffic Optimization and Prediction: Analyze traffic data from various sources to predict congestion patterns, optimize traffic flow, and suggest efficient routes for urban planners and commuters.

Climate Change Modeling: Process climate data from diverse sources to model climate change patterns, predict future trends, and assess the impact of environmental factors, aiding policymakers in decision-making.

Social Media Sentiment Analysis: Use Hadoop to analyze sentiments expressed in social media posts related to a specific topic, brand, or event, extracting insights to understand public opinions and trends.

Natural Language Processing (NLP) for Text Analysis: Apply NLP techniques using Hadoop to process and analyze large volumes of text data, focusing on tasks like sentiment analysis, topic modeling, or summarization.

Energy Consumption Optimization: Analyze energy consumption patterns in households or industries using smart meter data, aiming to identify trends and recommend strategies for optimizing energy usage.

Genomic Data Analysis: Process genomic sequences using Hadoop to identify genetic variations, understand disease predispositions, and explore personalized medicine possibilities.

Supply Chain Optimization: Analyze supply chain data to optimize inventory management, streamline logistics, and enhance overall supply chain efficiency for businesses.

Choose a problem statement that aligns with your interests, has available datasets (or data sources), and allows for meaningful analysis and insights through the lens of distributed processing methods.

Step By Step Answer Writing Explanation for CS5607 High Performance Computational Infrastructures Assessment

This assessment seems comprehensive, covering both the practical application and theoretical understanding of using Hadoop for solving real-world problems. To tackle this, you'll need a structured approach:

Problem Selection: Choose a real-world problem or define your problem that aligns with the criteria provided. Ensure it's suitable for distributed processing methods like Hadoop.

Research & Understanding: Deeply understand the problem, associated dataset, and the need for distributed processing. Why is this problem significant? How can Hadoop offer a solution?

Design & Planning: Outline the structure of your report, ensuring it covers each specified section (Introduction, Problem Description, Design & Implementation, Results, Conclusions).

Data Understanding: Analyze the characteristics of your chosen dataset. Use summary statistics and visualizations to explain its nature.

Algorithm Exploration: Research existing algorithms relevant to your problem. Investigate how these can be adapted for distributed computing. Understand MapReduce and its applicability.

Implementation: Code your solution using Hadoop and MapReduce. Annotate your code thoroughly and describe experiments conducted during implementation.

Results Analysis: Document execution details like time efficiency and results in graphs or charts. Compare these results with non-distributed methods. What insights do you gain?

Conclusions & Reflection: Reflect on the success of your implementation. Discuss improvements, what could have been done differently, and potential alternatives beyond MapReduce.

Documentation & Presentation: Ensure your report is well-structured, concise, and covers each section with adequate detail. Create a presentation summarizing your report for the workshop.

Preparation: Practice your oral presentation to effectively communicate your project's development, challenges faced, and the significance of your findings within the allocated time.

Remember, this project demands a balance between theoretical understanding and practical implementation. Utilize the marking criteria as a guide to ensure you cover all necessary aspects while demonstrating critical thinking and understanding throughout your report and presentation.

Sample Assignment Report for CS5607 High Performance Computational Infrastructures Assessment

Here's a sample structure for a report section that addresses the problem, associated dataset, design, implementation, results, and conclusions:

The Problem and Associated Dataset

Motivation of Problem

The increasing volume of social media data necessitates efficient processing for valuable insights. Understanding sentiments, managing online reputation, and anticipating customer needs are crucial in today's business landscape.

What is the Data?

The dataset comprises Twitter feeds collected over a year, including text, timestamps, and user metadata. Each entry contains user information, tweet content, retweet count, and timestamp.

Why Analyze the Data?

Analyzing social media data helps in discerning consumer sentiments, identifying trends, and understanding user behavior. It enables personalized marketing strategies and aids in crisis management for businesses.

Suitability for Distributed Methods

Given the massive volume of tweets, their unstructured nature, and the need for parallel processing, this dataset aligns perfectly with the distributed processing capabilities offered by Hadoop and MapReduce.

Solving the Associated Problem

Analyzing this dataset can unveil trends, sentiment shifts, and influential users. This information can inform marketing strategies, detect emerging issues, and aid in improving customer relations.

Design & Implementation

Characteristics of the Data

The dataset comprises 10 million tweets, averaging 280 characters per tweet. The timestamps span a year, showing varying tweet frequencies.

Algorithms for Problem Solving

Various sentiment analysis algorithms like Vader, Naive Bayes, and Lexicon-based approaches are suitable for understanding the sentiment of tweets. Clustering algorithms like K-means can identify user segments based on interactions.

Using MapReduce

MapReduce can be employed to preprocess tweets, tokenize text, and count occurrences of keywords. Sentiment analysis can be parallelized across tweets for faster computation.

Implementation Details

Our implementation involves preprocessing tweets, tokenizing text using MapReduce, applying a sentiment analysis algorithm, and clustering influential users based on retweet count. The code is annotated for clarity and scalability.

Results

Execution Documentation

Our MapReduce implementation reduced processing time by 40% compared to a sequential approach. Preprocessing took 20% of the total execution time, while sentiment analysis and clustering each took 40%.

Results Documentation

Charts displaying sentiment trends over time and user clusters based on retweet count are included. Positive sentiment increased by 15% over the year, and three distinct user segments were identified.

Insights on MapReduce Usage

MapReduce efficiently handles parallel tasks, significantly reducing processing time. The scalability of MapReduce makes it ideal for handling vast amounts of unstructured data.

Conclusions

Discussion of Results

The implementation was successful in identifying sentiment trends and influential users. However, fine-tuning sentiment analysis for better accuracy and exploring more sophisticated clustering techniques could enhance results.

Improvements and Hindsight

In hindsight, optimizing MapReduce tasks and experimenting with hybrid approaches combining MapReduce and Spark could further improve efficiency and analysis depth.

Beyond MapReduce

Exploring Apache Spark's in-memory processing or deep learning models for sentiment analysis could offer alternatives beyond MapReduce.

This report section gives a structured overview addressing each criterion specified in the assessment guidelines. It's crucial to delve deeper into the technical implementation, provide detailed documentation, and offer critical reflections on the results to achieve a comprehensive evaluation.

Not the Exact Question you were looking for? Post your question for instant answers.