# ENN543, Data Analytics and Optimisation

Supplementary Assignment ENN543, Data Analytics and Optimisation, Semester 2, 2019 Queensland University of Technology

**Problem 1. Clustering. **Bike share systems are becoming increasingly common in cities across the world, but their usage is highly variable and depends on factors such as local weather.

You have been provided with two months data from the New York Bike Share system covering one month in summer (Q1/JC-201707-citibike-tripdata.csv) and one month in winter (Q1/JC-201801-citibike-tripdata.csv). From the size of the files alone it is clearly evident that there are substantially fewer trips in winter than there are in summer, however it it unclear if the actual pattern of use (i.e. the typical types of trips) is different.

Using this data and the clustering method of your choice, you are to attempt to answer the question: ‘aside from the overall number of trips, do usage patterns change from from summer to winter?’. In doing this you should cluster the data using the following five dimensions:

- start station latitude;
- start station longitude;
- end station latitude;4. end station longitude;
- tripduration.

Note that this means that clusters will contain 5 dimensions, and visualisation of clusters in a single 2D plot will not be possible.

Your answer should demonstrate and discuss how usage patterns are similar or dissimilar (depending on what you find), and should also consider different time periods (morning, afternoon, etc) to better explore how the service is used.

Your answer should explain all decisions made when conducting the analysis, including details such as:

- the clustering method selected;
- any parameters that are required for the clustering;
- any outlier removal that is conducted; and
- any data normalisation or scaling that is performed.

**Problem 2. Classification. **Software systems are complex, and errors in deployed software can be very costly and difficult to correct. In an effort to help detect faulty software, a number of metrics have been proposed that measure software complexity.

You have been provided with data (Q2/pc1.csv) which contains various code metrics for a number of software examples, as well as a flag to indicate if the software contains a fault or not. For clarity:

- The first 21 columns contain predictors that measure some aspect of the software complexity, and may be used to determine if software is faulty or not;
- The last column contains a value of true or false, indicating if the software has a defect or not.

Using this data, you are to train a support vector machine (SVM) to separate defective software from error free software. You are to report on the accuracy of the developed model, and on any problems or challenges that you encounter in developing the model. In doing this you should:

- Divide the data into appropriate training, validation and testing datasets;
- Consider what SVM parameters (box constraint, kernel type, etc.) you should use;
- Consider the classNamedistribution of the data, and make allowances within the model asneeded.

Please note that allowing MATLAB to optimise hyper-parameters in place of properly investigating parameter settings **is not acceptable **as a justification for hyper-parameter selection, though a grid search (which is a more systematic approach) will be accepted.

Your answer should explain the choice of parameters in the final model, and discuss it’s performance.

**Problem 3. Dimension Reduction and Classification. **Recognising content in images can be a challenging problem due to the high dimensional nature of the input data. As such, dimension reduction methods can be used to reduce a problem space and make tasks more computationally feasible.

You have been provided with data (Q3/shvn test.mat) that shows images of single digits (0, 1, 2, 3, 4, 5, 6, 7, 8 and 9) of house numbers, extracted from Google street view data. Using this data you are to train classifiers (the type of classifier is up to you) to classify the observed digit in the image. Prior to classification, you are to reduce the data using:

- PCA;
- LDA;

i.e. you should train two classifiers: one using data reduced using PCA, one using data reduced using LDA. You are then to evaluate the two classifiers and compare their performance.

In completing this question you should:

- Divide the data into appropriate training, validation and testing datasets;
- Consider what type of classifier to use;
- Determine what an appropriate amount of dimensions to retain is.

Also note that due to memory constraints, it may not be possible to train the PCA or LDA space on all samples, and you may need to use only a subset of the data to compute the PCA and LDA transforms.

Your answer should explain the choice of any parameters and choices made (type of classifier, number of dimensions retained, etc) in arriving at your solution, and discuss the performance of the two methods, relating this what the two transforms (PCA and LDA) are seeking to achieve.