MIS772 Predictive Models for Business

MIS772 2020 T2

A1 Predictive Models for Business

AirbnbAI approached you to develop a RapidMiner process of determining if Sydney rental accommodation is suitable for tourists, which is based on the customer feedback given in terms of their: ● overall_satisfaction

AirbnbAI provided you with a sample of 221,000 listings of all AirBnB Sydney rentals for the period of 2014-2017. They have also isolated the last two months of the data available for 2017, with just over 50,000 rentals. Both listings include the following information:

  • Property / room id, its type and price per night (in US$)
  • Id of the property host (the person owning it or renting it out)
  • Property geo-location and the name of its neighbourhood The number of reviews recorded so far (since first listing)
  • Minimum number of nights to be booked (if applicable) The number of occupants allowed in a rental ● Last date and time of data collection, etc.

AirbnbAI would like you to use RapidMiner to generate some insights into the rental listings and these questions are of interests:

  1. What Sydney neighbourhoods have rentals of the highest satisfaction?
  2. What’s the popularity of different room types in what neighbourhoods?
  3. How to reliably predict rental success (satisfaction ≥ 4) for a new listing?
  4. Explain the classifier’s failures.
  5. Challenge just for fun: Identify the properties which were reviewed in the period Jun-Jul 2017 and which dropped the customer satisfaction by at least 50%.

AirbnbAI wants you to use RapidMiner to cleanup and explore the provided data, then develop and evaluate a classifier to predict the rentals’ long-term viability, and to minimise mis-classifications.

This assignment aims for students to learn how to ...

  • Articulate problems and solutions in business terms
  • Gain insights from data

The following mini-case study will be used in assignment A1.

Data: http://www.deakin.edu.au/~jlcybuls/pred/data/AirBnB-Sydney.zip Original Source: http://tomslee.net/airbnb-data-collection-get-the-data

Unzip and use these files: For P/C level use: AirBnB Sydney Last.csv Individual Tasks and Deliverables For D/HD level use: AirBnB Sydney All.csv

Always submit the entire report, including all completed and incomplete sections.

Partial Submission / Questions A, B – To be marked with the final report.

The first two sections of the report completed and submitted.

Exec Problem: Define your problem in business terms, in doing so answer questions A and B, cross-reference with other report sections for support. Data Exploration: Visualise the selected attribute characteristics. Use the visualisations to support answering questions A and B.

Final Submission / Questions C, D, E – Entire report will be marked at this point in time.

Submitted report must include the previously completed and updated sections, plus the following sections.

Exec Solution:Describe your solution in business terms, in doing so answer question C, cross-reference with other report sections for support.

Data Preparation: Deal with duplicates and missing values. Transform attributes or create new ones as needed. Use appropriate analysis and data visualisation to investigate relationships between attributes. Interpret results.

Model: Create and explain one or two classification models, i.e.

k-NN and Decision Tree, to address question C. Explain and justify your models’ properties. Investigate and deal with the class imbalance.

Evaluation: Use hold-out or cross-validation of the model. Include an honest testing. Compare the performance of different models and select the best. Answer question C and address issue D. Optionally, take challenge E (not for mark).

The rubric explains differentiation between Pass/Credit vs Distinction/HD achievement levels.

  • Prepare data for different models MIS772 Predictive Analytics submissions are compulsory.1 / 1
  • Develop classification models
  • Assess and report model performance

Individual Assignment A1 / Template

Executive problem statement (one page)

Aim

To clearly articulate your understanding of the business problem to management.

Clearly state who is interested in solving the problem and why.

Ensure that this section is improved and included in the final submission.

Simple Data Set

Overall satisfaction is defined in simple terms (≥ 4).

Cross-reference your problem statement with tables or charts from the following section.

Answers to business questions (A) and (B) are given and justified (highlight them).

Complex Data Set

The concept of overall satisfaction is extended with justification to be relative to other attributes, e.g. the number of reviews, price, room type or neighbourhood.

Cross-reference your problem statement with tables or charts from the following section.

Answers to business questions (A) and (B) are given and justified (highlight them).

Hints

Make sure your exec summary is very clear.

You can restate or rephrase the problem statement as you gain better understanding.

Do not invent your own problem – it has been given to you but may not be achievable in its current form.

Ensure that whatever problem you describe can be solved using the provided data.

Make sure the statement describes the problem from the business perspective and not a technical perspective.

Use business language and not computer / mathematical / statistical / data science language.

The problem statement should describe the high level aims and not the methods of their achieving.

Think and state the likely benefit of this project for the company and its management.

Think and state who the company clients are and what the likely benefits of this project are for them.

Do not include any charts or tables in the problem statement section.

However, cross-reference your problem statement with tables or charts from the following section, e.g. you can refer to them as “… (see Figure 1)” or “As shown in Table 4…”.

If you need to support your statements / analysis / argument with references to any published materials, use Harvard citation style as described in: http://www.deakin.edu.au/students/studying/study-support/referencing/harvard. As the executive summary should not take even one page, we suggest to include your bibliographic references at the bottom of this page, immediately below the executive summary (or problem description).

All comments, such as this, which are not part of your submission should be deleted to save space.

Data exploration (one page)

Aim

To demonstrate your understanding of data and report any insights emerging from data analysis.

Ensure that this section is improved and included in the final submission.

Simple Data Set

Data obtained.

RM project prepared.

Attributes selected and analysed.

Their characteristics are tabulated and visualised, with brief annotations (using text and arrows).

Complex Data Set

In addition, all tables and charts are analysed for the relevant and important business insights, which are explicitly reported. All visualisations are included selectively to support further work.

Hints:

Take screenshots of the relevant parts of the screen, not the whole window or the whole screen.

On Win10 use Snipping tool, on Mac or Linux use Screenshot app, or install Spectacle.

Include here the text of your analysis with visual evidence to support the analysis.

If you include any charts or tables you must describe them (e.g. by using arrows / boxes).

Make sure that any included chart is readable (so do not shrink it into microscopic size).

If you scale the included screen shots keep their proportions (do not distort images).

Most importantly describe what those data features mean and how important they are, and why.

Do not include here any parts of the RM process – it has its own section further in the report.

If your analysis or results could only be determined by inspecting the process and running it,
the marks will be reduced – if it is not in the report, it does not exist for the marker!

Your analysis and description could include:

  • What is the distribution of the selected attributes (e.g. using histograms / not just statistic tab)
  • What are attribute features (e.g. using a scatter / block / bar / stacked bars chart)
  • What are the important statistics of the selected attributes (e.g. using statistics tab)
  • Are there any missing values in the selected attributes (e.g. using the statistics tab)
  • What should be done about missing values (e.g. eliminate attributes / replace missing values)
  • Any other more creative visualisations / tabulations, possibly with some value aggregation

Avoid indiscriminate “dumping” of tables, charts or code into this section – all content must have its purpose.

All included charts, tables or RM processes (or their parts) have to be described or used in the discussion.

Make sure that all charts, tables and important results are labelled for cross-referencing, e.g. “Figure 1 - Histogram of Overall Rating” or “Table 4 – Comparison of model performance”.

All comments, such as this, which are not part of your submission should be deleted to save space.

Executive solution statement (one page)

Aim

To clearly articulate your understanding of the business solution to management.

Simple Data Set

The business solution is succinctly described for executives and justified.

Cross-references with the technical sections of the report provided for support, e.g. to tables, charts and plots.

Business answer to question (C) is given and justified (highlight it).

Complex Data Set

In addition, business decisions and actions enabled by the solution are explained.

Cross-refs with technical sections support exec summary.

Business answers to questions (C-D) and opt (E) are given and justified (highlight them).

Hints

Ensure that whatever problem you describe can be solved using the provided data.

Make sure the exec summary describes the solution from the business perspective and not a technical perspective.

Use business language and not computer / mathematical / statistical / data science language.

The solution statement should describe the high level benefit and not the methods of their delivery.

Think and state who the company clients are and what the likely benefits of this project are for them.

Ensure that your solution clearly matches the problem statement.

Ensure that the solution is formulated in terms of achieving the high-level business aim.

Do not include any charts or tables in the solution statement section.

However, cross-reference your problem statement with tables or charts from the following section, e.g. you can refer to them as “… (see Figure 1)” or “As shown in Table 4…”.

If you need to support your statements / analysis / argument with references to any published materials, use Harvard citation style as described in: http://www.deakin.edu.au/students/studying/study-support/referencing/harvard. As the executive summary should not take even one page, we suggest to include your bibliographic references at the bottom of this page, immediately below the executive summary (or problem description).

All comments, such as this, which are not part of your submission should be deleted to save space.

Data Preparation (one page)

Aim

To demonstrate your understanding of data by describing complex relationships between attributes.
Depending on the selected model some attributes may need to be transformed or new attributes created.

Simple Data Set

Relationships between attributes, are explored and visualised.

Labels and predictors are selected and justified.

New attributes are generated and old ones transformed as needed.

All charts annotated (with text and arrows) to highlight important insights.

Complex Data Set

In addition, attribute weights are used to select the most useful attributes.

All missing values, duplicates and data errors handled adequately.

Hints

Many hints are identical to those in the section on “Data Exploration” so read them!

Some preliminary data exploration has already been conducted in the previous sections.

Focus on depicting attributes relationships and not their individual characteristics,

Include here the text of your analysis with tables and charts.

Your analysis and description could include:

  • What attribute is to be used as a label and why
  • What attributes are to be used as predictors and why
  • What relationships exist between numerical attributes (e.g. using correlation tables or scatter plots)
  • What relationships exist between nominal attributes (e.g. using stacked bars, block, heat/tree maps)
  • What is the weight between predictor attributes and labels and what does it mean
  • Any other more creative visualisations / tabulations, possibly with some value aggregation


Avoid indiscriminate “dumping” of tables, charts or code into this section – all content must have some purpose.

All included charts, tables or RM processes (or their parts) have to be described or used in the discussion.

Make sure that all charts, tables and important results are labelled for cross-referencing, e.g. “Figure 1 - Histogram of Overall Rating” or “Table 4 – Comparison of model performance”.

All comments, such as this, which are not part of your submission should be deleted to save space.

Model Development (one page limit)

Aim

To explain details of developed classification models and selected methods for data preparation and reporting.

Simple Data Set

k-NN classification model developed.

The process, its operators and their parameters described and annotated (with text and arrows).

The values of the model parameters are justified.

Operators annotated (with text and arrows) to highlight important insights.

Complex Data Set

In addition, a Decision Tree (or forest) is included as the second classifier.

Class imbalance is investigated, dealt with and justified.

Hints

Your textbook will be extremely helpful in this task.

Include here screenshots of all or parts of the RM process.

If your process is very large, consider splitting it into sub-processes or separate processes.

If your process does not fit into this page, include only the most important parts.

By including arrows and text boxes (e.g. with numbers to refer to) annotate each operator and its properties.

Note that some of your justifications may utilise cross-referencing with tables or charts from other sections.

Avoid indiscriminate “dumping” of RM processes/models into this section – all content must have some purpose.

You may include a brief description of the operators and what they did but this is NOT the aim of this section.

Do not include definition of terms or a “textbook” description of operations – we already know this!

All comments, such as this, which are not part of your submission should be deleted to save space.

Model Evaluation (one page)

Aim

To report and explain the performance of developed classification models.

Simple Data Set

The model is hold-out validated using accuracy and kappa.

Validation results are analysed, interpreted and reported.

A statement is included with justification on to what degree the model advice can actually be trusted
(based on the performance measurements).

Technical answer to question (C) is given and justified (highlight it).

Complex Data Set

In addition, all models are cross-validated and “honestly” tested.

Parameters of all models experimented with and their selection justified.

All models performance tabulated and compared - the best model identified.

In addition to accuracy and kappa measures / charts such as AUC and ROC are also used.

Also, technical answers to questions (D) and optionally (E) are given and justified (highlight them).

Hints

Your textbook will be extremely helpful in this task.

If you have few results to report, include here screenshots of your results, e.g. confusion table or ROC charts.

If you have many results to report, include here a table of all results.

You need to describe and explain your results.

It is the most important that you include here the detailed analysis of your results –
explain the impact of the obtained results on the future use of the model to support decision making.

Avoid indiscriminate “dumping” of performance results – all content must have some purpose.

All comments, such as this, which are not part of your submission should be deleted to save space.

Any materials, analysis or reports that do not fit into 7 (seven pages in total, including the front page) will not be assessed or marked. The only exception is the inclusion of your response to the challenge question.

Challenge Just for Fun

Aim

To undertake a challenging task requiring independent research.

We will definitely look at your work reported here but we will not mark it.

Simple Data Set or Complex Data Set

You can use either of the two data sets.

Include your descriptive analysis of the problem.

Include the screenshot of your RapidMiner process (with annotations).

Include results generated by the process.

Provide some assessment and reflection on the insights generated.

Hints

Your textbook, RapidMiner built-in help and web resources will be extremely helpful in this task.

All comments, such as this, which are not part of your submission should be deleted to save space.