Week 6 Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Business Analytics

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Predictive Data Mining

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Introduction
• An observation, or record, is the set of recorded values of variables
associated with a single entity.
• Supervised learning: Data mining methods for predicting an
outcome based on a set of input variables, or features.
• Supervised learning can be used for:
• Estimation of a continuous outcome.
• Classification of a categorical outcome.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Introduction
The data mining process comprises the following steps:
1. Data sampling. Extract a sample of data that is relevant to the business
problem under consideration.
2. Data preparation. Manipulate the data to put it in a form suitable for formal
modeling.
3. Data partitioning. Divide the sample data for the training, validation, and
testing of the data mining algorithm performance.
4. Model construction. Apply the appropriate data mining technique to the
training data set to accomplish the desired data mining task (classification or
estimation).
5. Model assessment. Evaluate models by comparing performance on the
training and validation data sets.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and
Partitioning

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning

• When dealing with large volumes of data, best practice is to extract


a representative sample for analysis.
• A sample is representative if the analyst can make the same
conclusions from it as from the entire population of data.
• Data mining algorithms typically are more effective given more data.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning

• When obtaining a representative sample, it is generally best to


include as many variables as possible in the sample.
• After exploring the data with descriptive statistics and visualization,
the analyst can eliminate variables that are not of interest.
• Data mining applications deal with an abundance of data that
simplifies the process of assessing the accuracy of data-based
estimates of variable effects.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning

• Overfitting occurs when the analyst builds a model that does a


great job of explaining the sample of data on which it is based, but
fails to accurately predict outside the sample data.
• We can use the abundance of data to guard against the potential for
overfitting by splitting the data set into different subsets for:
• The training (or construction) of candidate models.
• The validation and testing (or performance comparison/assessment)
of candidate models and future performance of a selected model.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning

k-Fold Cross-Validation
• k-Fold Cross-Validation: A robust procedure to train and validate
models in which observations to be used to train and validate the
model are repeatedly randomly divided into k subsets called folds.
In each iteration, one fold is designated as the validation set and the
remaining k-1 folds are designated as the training set. The results of
the iterations are then combined and evaluated.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sampling, Preparation, and Partitioning
Class Imbalanced Data
• There are two basic sampling approaches for modifying the class
distribution of the training set:
• Undersampling: Balances the number of Class 1 and Class 0
observations in a training set by removing majority class
observations from the training set.
• Oversampling: Balances the number of Class 1 and Class 0
observations in a training set by inserting copies of minority class
observations into the training set.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Evaluating the Classification of Categorical Outcomes
Evaluating the Estimation of Continuous Outcomes

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Evaluating the Classification of Categorical Outcomes:
• By counting the classification errors on a sufficiently large validation set
and/or test set that is representative of the population, we will generate
an accurate measure of the model’s classification performance.
• Classification confusion matrix: Displays a model’s correct and incorrect
classifications.
o True Positive (TP) Predicted Class
o False Positive (FP)
o False Negative (FN) Actual Class Positive Negative
o True Negative (TN) Positive TP FN

Negative FP TN

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Evaluating the Classification of Categorical Outcomes (cont.):
• One minus the overall error rate is often referred to as the accuracy of the model.

• Overall error rate: Percentage of misclassified observations:


• While overall error rate conveys an aggregate measure of misclassification, it counts
as misclassifying an actual Class 0 observation as a Class 1 observation (a false
positive) the same as misclassifying an actual Class 1 observation as a Class 0
observation (a false negative).

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Evaluating the Classification of Categorical Outcomes (cont.):
• The ability to correctly predict Class 1 (positive) observations is
commonly expressed as sensitivity, or recall, and is calculated as:

• The ability to correctly predict Class 0 (negative) observations is


commonly expressed as specificity and is calculated as:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Evaluating the Classification of Categorical Outcomes (cont.):
• Precision is a measure that corresponds to the proportion of
observations predicted to be Class 1 by a classifier that are actually in
Class 1:
Precision = TP / |TP + TN|

• The F1 Score combines precision and sensitivity into a single measure and
is defined as:

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Evaluating the Classification of Categorical Outcomes (cont.):
• The receiver operating characteristic (ROC) curve is an alternative
graphical approach for displaying the tradeoff between a classifier’s ability
to correctly identify Class 1 observations and its Class 0 error rate.
• In general, we can evaluate the quality of a classifier by computing the area
under the ROC curve, often referred to as the AUC.
• The greater the area under the ROC curve, i.e., the larger the AUC, the
better the classifier performs.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance Measures
Figure: Receiver
Operating
Characteristic
(ROC) Curve

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

You might also like