Data Preprocessing

Solutions to Chapter 2

DATA PREPROCESSING

Describe the possible negative effects of proceeding directly to mine data that has not been preprocessed.

Neglecting to preprocess the data adequately before data modeling begins will likely produce data models that are unreliable and whose results should be considered dubious as best. Performing data cleaning and data transformation during the data preparation phase is absolutely necessary for successful data mining efforts.

For example, suppose you are analyzing a data set that includes a person’s Age and Date_of_Birth attributes, and you want to calculate the average Age. Now, if 5% of the records contain a value of 0 for Age, the mean value would be very misleading and inaccurate. One solution to this problem would be to derive Age for the zero-based records based on information contained in the Date_of_Birth variable. Now, the mean value for Age is more representative of those persons in the data set.

Refer to the income attribute of the five customers in Table 2.1, before preprocessing.

Find the mean income before preprocessing.

The mean value for Income before preprocessing is 2,037,599.8 and is derived by the possible inclusion of Income values -40,000 (erroneous), 99,9999 (probably indicating missing value) and 10,000,000 (possible outlier).

What does this number actually mean?

In this case the mean value has little meaning because we are combining real data values with erroneous values.Now, calculate the mean income for the three values left after preprocessing. Does this value have a meaning?

However, the mean value for Income produced by values 80,000 (78,000 rounded to nearest 5,000), 50,000, and 10,000 is 46,666.67. The latter value is certainly more representative of the true mean for Income, now that the records containing questionable values have been excluded.

Explain why zip codes should be considered text variables rather than numeric.

Zip codes should be considered text variables because they cannot be quantified on any numeric scale. Even their order has no numerical significance.

What is an outlier? Why do we need to treat outliers carefully?

Consider a set of numerical observations and the center of this observation set. An outlier is an observation that lies much farther away from the center than the majority of the other observations in the set.

We must treat outliers carefully because they can cause us to misrepresent the true center of an observation set incorrectly if they lie significantly farther away from the other observations in the set.

Explain why a birthdate variable would be preferred to an age variable in a database.

A birthdate variable is preferable to an age variable in a database because (1) one can always derive age from birthdate by taking the difference from the current date, and (2) age is relative to the current date only and would need to be updated continuously over time in order to remain accurate.

True or false: All things being equal, more information is almost always better.

The answer is true. In general, more information is almost always better. The more information we have to work with, the more insight into the underlying relationships of a

particular domain of discourse we can glean from it.

Explain why it is not recommended, as a strategy for dealing with missing data, to simply omit the records or fields with missing values from the analysis.

It is not recommended to omit records or fields from an analysis simply because they have missing values. The rationale for this recommendation is that omitting these fields and records may cause us to lose valuable insight into the underlying relationships that we may have gleaned from the partial information that we do have.

Which of the four methods for handling missing data would tend to lead to an underestimate of the spread (e.g., standard deviation) of the variable? What are some benefits to this method?

Replacing a missing value by the attribute value’s mean artificially reduces the measure of spread for that particular attribute. Although the mean value is not necessarily a typical value, for some data sets this form of substitution may work well. Specifically, the effectiveness of this technique depends on the size of the variation of the underlying population. In other words, the technique works well for populations having small variations, and works less effectively for populations having larger variations.

Several benefits to leveraging this method include (1) ease of implementation (i.e. only one value to impute), (2) preservation of the standard error (i.e. no additional residual error is introduced).

What are some of the benefits and drawbacks for the method for handling missing data that chooses values at random from the variable distribution?

By using the data values randomly generated from the variable distribution, the measures of center and spread are most likely to remain similar to the original; however, there is a chance that the resulting records may not make intuitive sense.

Of the four methods for handling missing data, which method is preferred?

Having the analyst choose a constant to replace missing values based on specific domain knowledge is overall, probably the most conservative choice. If missing values are replaced with a flag such as “missing” or “unknown”, in many situations those records would ultimately be excluded from the modeling process; that is, all remaining valid, potentially important, values contained in those records would not be included in the data model.

Make up a classification scheme which is inherently flawed, and would lead to misclassification, as we find in Table 2.2. For example, classes of items bought in a grocery store.

Breakfast	Count
Cold Cereals	72
Sugar Smacks	1
Cheerios	2
Hot Cereals	28
Cream of Wheat	3

Using the table above, the “Breakfast” categorical attribute contains 5 apparent classes. However, upon further inspection the classes are discovered to be inconsistent. For example, both “Sugar Smacks” and “Cheerios” are cold cereals, and “Cream of Wheat” is a hot cereal. Below, the cereals are now classified according to one of two classes, “Cold Cereals” or “Hot Cereals.”

Breakfast	Count
Cold Cereals	75
Hot Cereals	31

Make up a data set, consisting of the heights and weights of six children, in which one of the children is an outlier with respect to one of the variables, but not the other. Then alter this data set so that the child is an outlier with respect to both variables.

In the table below, Child #1 is an outlier with respect to Weight only. All children in the table are close in Height differing at most by 9 inches. However, all children except for Child # 1 are close in Weight differing at most by 7 pounds. Child #1 is an outlier as the Weight differs by 18 pounds from the second-heaviest child (Child #6), making this righttailed difference in Weight greater than the entire Weight range for the other five children.

Child	Height (in)	Weight (lbs)
1	49	100
2	50	75
3	52	77
4	55	79
5	57	80
6	58	82

In the table below, Child #1 is an outlier with respect to both Height and Weight. All children except for Child #1 in the table are close in Height differing at most by 8 inches and are close in Weight differing at most by 7 pounds. Child #1 is an outlier for both Height and Weight as the Height differs by 14 inches from the secondshortest child (Child#2) (which is greater than the entire Height range of the other five children), and the Weight differs by 18 pounds from the second-heaviest child (Child #6) (which is greater than the entire Weight range of the other five children).

Child	Height (in)	Weight (lbs)
1	36	100
2	50	75
3	52	77
4	55	79
5	57	80
6	58	82

Use the following stock price data (in dollars) for Exercises 13–18

10 7 20 12 75 15 9 18 4 12 8 14

Calculate the mean, median, and mode stock price.

The mean is calculated as the sum of the data points divided by the number of points as follows:

Mean Stock Price = (10+7+20+12+75+15+9+18+4+12+8+14) / 12 = 204 / 12 = $17.

The median is calculated by placing the prices in order and (a) selecting the middle value if the number of points is odd, or (b) taking the average of the two middle values if the number of points is even. Since we have twelve points, median is calculated as follows:

Median Stock Price = mean of center values {4,7,8,9,10,12,12,14,15,18,20,75} = 24/2 = $12.

The mode is calculated as the value that occurs the most often in the set and is calculated as follows:

Mode Stock Price = highest frequency of {4,7,8,9,10,12,12,14,15,18,20,75} = $12.

Note: for quantitative variables, the mode is usually defined as the class with the highest frequency.

Compute the standard deviation of the stock price. Interpret what this number means.

The standard deviation represents the expected distance of a point chosen at random from a data set to the center of that set and is calculated by taking the square root of the variance. The variance is the average of the sum of squared distances of each point from the data-set mean. Given that the mean is $17 (see Exercise #13) for this set, the variance for the set of stock prices is calculated as follows:

Stock Price Variance (Var) =

(4-17)²+(7-17)²+(8-17)²+(9-17)²+(10-17)²+(12-17)²+(12-17)²+(14-17)²+(15-17)²+(18-17)²+(20-17)²+(7517)² =

(-13)² + (-10)² + (-9)² + (-8)² + (-7)² + (-5)² + (-5)² + (-3)² + (-2)² + (1)² + (3)² + (58)² =

169 + 100 + 81 + 64 + 49 + 25 + 25 + 9 + 4 + 1 + 9 + 3364 = 3900 / 11 = 354.55 $².

Taking the square root of the Variance, the Standard Deviation (SD) is calculated as follows:

Stock Price Standard Deviation (SD) of Stock Price = √(354.55) = ±$18.83.

A 95% confidence interval is given by: 17_±1.96 18.83_{⋅ → −}[ 20.66,54.66]

Assume that a stock price can never be less than one penny (USD), the lower limit can also be taken as 0.01.

As we can see, each stock with the exception of the one priced at $75 is priced within this range.

Find the min-max normalized stock price for the stock worth $20.

Min-Max normalization scales an observation relative to the data-set’s range resulting in a value between 0 and 1 (this value has no units) and is formulated as follows:

MinMaxX_i = [X_i – Min(X)] / [Max(X) – Min(X)]

Therefore, the min-max normalized stock price of $20 is calculated as follows:

MinMax($20) = ($20 - $4) / ($75 - $4) = ($16) / ($71) = 0.2254.

Calculate the midrange stock price.

The midrange stock price is the central price for the entire price range and is formulated as follows:

MidRangeX = [Max(X) + Min(X)] / 2

For the problem at hand we have as follows:

MidRangeX = ($75 + $4) / 2 = ($79) / 2 = $39.5

Compute the Z-score standardized stock price for the stock worth $20.

Z-Score standardization scales an observation where the mean value is zero, the SD is 1 and most values lie between -3 and 3 (this value has no units) and is formulated as follows:

Z-Score(X) = [X_i – Mean(X)] / |SD(X)|

Given the mean of $17 (see Exercise #13) and |SD| of 18.83 (see Exercise #14), The ZScore for the stock price of $20 is calculated as follows:

Z-Score($20) = ($20 - $17) / $18.83 = ($3) / $18.83 = 0.1593.

Please note that this value makes sense as it is slightly greater than zero just as $20 is slightly greater than $17.

Find the decimal scaling stock price for the stock worth $20.

Decimal standardization scales an observation to a value between -1 and 1 (this value has no units) and is formulated as follows:

Decimal(X_i) = X_i / 10^d

where d is the number of digits in the observation in the data set having the largest absolute value. Since the largest stock price is $75, d = 2 as there are two digits in this price. The decimal standardization is then calculated as follows:

Decimal($75) = $75 / $10² = $75 / $100 = 0.75

Calculate the skewness of the stock price data.

Skewness is the lack of normalization of a Z-Score-standardized distribution and is measured using the following formula:

Skewness = 3 [Mean(X) – Median(X)] / SD(X)

Given the mean of $17 and median of $12 (see Exercise #13), and an SD of $18.03 (see Exercise #14), the skewness for the stock price distribution is calculated as follows:

Skewness = 3 [$17 - $12] / $18.83 = 3[$5] / $18.83 = $15 / $18.83 = 0.7966.

We observe that this distribution is right-skewed since a right-skewed distribution has a mean that is greater than its median yielding a positive skewness value. In contrast, a left-skewed distribution will have a mean that is less than its median and thus a negative skewness value.

Explain why data analysts need to normalize their numeric variables.

Data analysts need to normalize their numeric variables as it places all variables on the same scale. Normalizing all variables to the same scale is critical when performing operations that are sensitive to data variation or spread so that variables having larger variations do not adversely overpower variables having smaller variations. Most (if not all) analytic operations involving linearization (e.g. Regression, PCA, MANOVA, etc.) are sensitive to data spread.

Describe three characteristics of the standard normal distribution.

The three main characteristics of the Standard Normal Distribution are as follows:

The mean is zero
The SD is 1
It is symmetric (equal and opposite in shape and size) about the mean and ‘normal’: the mean has the highest frequency, and frequency decreases symmetrically as distance from the mean increases.

If a distribution is symmetric, does it follow that it is normal? Give a counterexample.

If a distribution is symmetric, it is not guaranteed to be normal. In order for a distribution to be normal it has to have a single expected value (i.e. the value with the highest frequency).

A classic counterexample is the Uniform Distribution, which is symmetric about the center of its interval, yet since it all values on the interval occur with equal frequency, it has an infinite number of expected values making it non-normal.

Another counterexample: t-distribution.

What do we look for in a normal probability plot to indicate non-normality?

A normal probability plot is simply a plot of the quantiles of a given distribution to the quantiles of the Standard Normal Distribution. If the quantiles are approximately equal, then the plot will approximate a straight line indicating that the given distribution is normal.

In contrast, if the quantiles of the distribution are not equal to the Standard Normal

Distribution, then the plot will not approximate a straight line indicating non-normality.

Use the stock price data for Exercises 24–26.

Do the following:

Identify the outlier.

The outlier is the stock price of $75. The difference from the next-closest stock price ($20) is $55, which is nearly 3.5X larger than the entire range of the other eleven stocks (i.e. $16).

Verify that this value is an outlier, using the Z-score method.

We can also verify that $75 is in fact an outlier using the Z-score method. The Zscore for this stock is calculated using our mean of $17 (see Exercise #13) and our SD of $18.83 (see Exercise #14) as follows:

Z-Score($75) = ($75 - $17) / $18.83 = ($58) / $18.83 = 3.08.

Since a Z-score that is less than -3 or greater than 3 is considered an outlier, we conclude that stock price $75 is an outlier as its Z-score is 3.08 which is greater than 3.

Verify that this value is an outlier, using the IQR method.

We can also verify that $75 is in fact an outlier using the Inter-Quartile Range or IQR method. The quartiles are determined by placing the stock prices in ascending order and dividing them onto four parts as follows:

The ordered stock prices are: {4,7,8,9,10,12,12,14,15,18,20,75}, and since there are an even number of values, we partition as {4,7,8,9,10,12} and

{12,14,15,18,20,75}

Location for Q1: (n−1) ^p+ =1 11 0.25⋅ + =1 3.75, so

100

Q1= 8+(9−8) 0.75⋅ = 8.75

Location for Q3: (n−1) ^p+ =1 11 0.75⋅ + =1 9.25, so

100

Q3 =15+(18−15) 0.25⋅ =15.75

We then calculate IQR = Q3 – Q1 as follows:

IQR = $15.75 – $8.75 = $7

If an observation is an outlier, then it will have a value that is less than Q1 - 1.5IQR or a value greater than Q3 + 1.5IQR. We then calculate the upper and lower boundary values for the stock price set as follows:

LowerBound = Q1 - 1.5IQR = 8.75 – 1.5(7) = 8.75 – 10.5 = -$1.75

UpperBound = Q3 + 1.5IQR = 15.75 + 1.5(7) = 15.75 + 10.5 = $26.25

Since $75 is greater than $26.25, we conclude that $75 is an outlier.

Identify all possible stock prices that would be outliers, using:

The Z-score method.

The ordered stock prices are: {4,7,8,9,10,12,12,14,15,18,20,75} where the mean is $17 lying between the $15 and $18 stock indicated in bold text, and the SD is $18.83. We already know that $75 is an outlier having a Z-score of 3.08 (see Exercise #24). Next, we check the Z-scores for 4 and 20. Working from the left, we have as follows:

Z-Score($4) = ($4 - $17) / $18.83 = (-$13) / $18.83 = -0.69.

Z-Score($20) = ($20 - $17) / $18.83 = ($3) / $18.83 = 0.16.

However, no other outliers were identified using Z-score standardization.

The IQR method.

From Exercise #24, we have a Lower Bound of -1.75, and an Upper Bound of 26.25.

Therefore, stock prices $75 is once again the only outlier as it is greater than the upper bound of $26.25.

Investigate how the outlier affects the mean and median by doing the following:

Find the mean score and the median score, with and without the outlier. The mean for the entire set of stock prices is $17 (see Exercise #13),, and the mean without the $75 outlier is calculated as follows:

Mean_{No_Outlier} = (10+7+20+12+15+9+18+4+12+8+14) / 11 = 129 / 11 = $11.73.

Median Stock Price = mean of center values {4,7,8,9,10,12,12,14,15,18,20,75} = 24/2 = $12.

Median_{No_Outlier} = mean of center values {4,7,8,9,10,12,12,14,15,18,20} = $12.

State which measure, the mean or the median, the presence of the outlier affects more, and why.

It is obvious that the presence of the outlier affects the mean more than the median. It increases the mean by $5, and has no effect on the median.

For this particular data set, the outlier affects the mean more than the median because the mean determines the numerical center of the data set through interpolation and this data is right-skewed having a large right-tailed outlier. In contrast, the median determines the distributive center of the dataset through physical partitioning and the largest value of the lower half of the data is equal to the smallest value of the upper half of this data set.

What are the four common methods for binning numerical predictors? Which of these are preferred?

The four common methods for binning numerical predictors are as follows:

Equal width binning – this method divides into k categories of equal width chosen by the client or analyst.

Equal frequency binning - this method divides the numerical predictor into k categories, each having k/n records, where n is the total number of records.

Binning by clustering – this method uses a clustering algorithm, such as k-means clustering.

Binning based on predictive value - this method partitions the numerical predictor based on the effect each partition has on the value of the target variable.

The preferred methods are Binning by Clustering (method #3) and Binning based on predictive value (method #4). Both methods determine the partitions by the nature of the data and its underlying relationships.

In contrast, the Equal Width Binning (method #1) and Equal Frequency Binning (method

#2) determine the partitions simply by their individual numeric values. In general, Equal Width Binning should not be used for anything more than rough exploration as the use of equal width bins is very susceptible to outliers. The method of Equal Frequency Binning is inherently flawed in that it can produce bins having overlapping values.

Use the following data set for Exercises 28–30: 1 1 1 3 3 7

Bin the data into three bins of equal width (width = 3).

Using equal-width binning with width=3, the bin boundaries are calculated as follows:

Bin1: 0 <= X < 3 containing {1,1,1}

Bin2: 3 <= X < 6 containing {3,3}

Bin3: 6 <= X < 9 containing {7}

Bin the data into three bins of two records each.

Binning this set into three bins of two records each is an application of equal-frequency binning with k=3, and since n=6, the bin size is k/n => 3/6 = 2. The bins are as follows:

{1,1}, {1,3}, {3,7}

Clarify why each of the binning solutions above are not optimal.

Although this toy data set is relatively small and the clarification may not be obvious, the equal-width binning from Exercise #28 is suboptimal since Bin3 contains the outlier value 7, giving the illusion that the discretized class Bin3 is just as close to Bin2 as Bin1 when in fact it is much farther away.

The equal-frequency binning from Exercise #29 is also suboptimal since the three bins contain overlapping values. For example, the value 1 lies in both the first and second bins, and the value 3 lies in both the second and third bins. Therefore, models constructed from this new set of discretized classes are bound to produce unpredictable results.

Explain why we might not want to remove a variable that had 90% or more missing values.

In general, analysts should not remove variables that have a large number of missing values. The small number of values that are present may in fact be representative of the underlying population, and it would therefore be worthwhile to attempt to impute the missing values. However, if the small of values are not representative of the underlying population, the fact that the variable has missing values may actually be able to be correlated to other variables and produce predictive power that would have been lost if this data were discarded.

Explain why we might not want to remove a variable just because it is highly correlated with another variable.

In general, analysts should not remove variables, even when they are highly correlated (e.g. correlation >0.9). An analyst may be tempted to remove one of a pair of highly correlated variables in order to avoid over-emphasizing a particular informational characteristic. However, although removing a highly correlated variable avoids potential “double-counting”, doing so may also cause a loss of valuable predictive relationships to other variables that the target variable is not highly correlated with.

As an alternative to removing highly correlated variables, it is recommended that the analyst employ Principle Component Analysis to translate the highly correlated variables into a set of uncorrelated principal components.

Buy Data Preprocessing Assessment Answers Online

Talk to our expert to get the help with Data Preprocessing Answers from Assignment Hippo Experts to complete your assessment on time and boost your grades now

The main aim/motive of the finance assignment help services is to get connect with a greater number of students, and effectively help, and support them in getting completing their assignments the students also get find this a wonderful opportunity where they could effectively learn more about their topics, as the experts also have the best team members with them in which all the members effectively support each other to get complete their diploma assignment help Australia. They complete the assessments of the students in an appropriate manner and deliver them back to the students before the due date of the assignment so that the students could timely submit this, and can score higher marks. The experts of the assignment help services at www.assignmenthippo.com are so much skilled, capable, talented, and experienced in their field and use our best and free Citation Generator and cite your writing assignments, so, for this, they can effectively write the best economics assignment help services.

Get Online Support for Data Preprocessing Assignment Help Online

Want to order fresh copy of the Sample Data Preprocessing Answers? online or do you need the old solutions for Sample Data Preprocessing, contact our customer support or talk to us to get the answers of it.

Assignment Help Australia

Not the Exact Question you were looking for? Post your question for instant answers.