Classification and Prediction

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 41

Classification and

Prediction
Introduction

Databases are rich with hidden information that can be used for intelligent decision
making
Classification and prediction are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends.
Classification predicts categorical (discrete, unordered) labels, prediction models
continuous valued functions.
Ex: Build a classification model to categorize bank loan applications as either safe or risky.
Prediction model to predict the expenditures in dollars of potential customers given their
income and occupation.
Most algorithms are memory resident, assuming a small data size. Recent data mining
research developing scalable classification and prediction techniques capable of handling
large disk-resident data.
What is Classification?

A bank loans officer needs analysis of her data in order to learn which loan applicants
are “safe” and which are “risky” for the bank.
A marketing manager at AllElectronics needs data analysis to help guess whether a
customer with a given profile will buy a new computer.
A medical researcher wants to analyze breast cancer data in order to predict which
one of three specific treatments a patient should receive.
In each of these examples, the data analysis task is classification, where a model or
classifier is constructed to predict categorical labels,
- “safe” or “risky” for the loan application data;
- “yes” or “no” for the marketing data;
- “treatment A,” “treatment B,” or “treatment C” for the medical data.
What is Prediction?

A marketing manager would like to predict how much a given customer will spend
during a sale at AllElectronics.
This data analysis task is an example of numeric prediction, where the model
constructed predicts a continuous-valued function, or ordered value. This model is
a predictor.
Regression analysis is a statistical methodology that is most often used for numeric
prediction.
Classification and numeric prediction are the two major types of prediction problems.
General approach to Classification

•Data classification is a two-step process,


–Learning - where a classification model is constructed(training phase)
–Classification - where the model is used to predict class labels for given data(testing phase)
•Learning step: a classifier or a classification algorithm builds the classifier by analyzing or “learning
from” a training set
–Training set made up of database tuples and their associated class labels.
–The class label attribute is discrete-valued and unordered and categorical (or nominal) in that each value
serves as a category or class.
•Classification Step
•Training tuples are randomly sampled from the database under analysis.
•Data tuples can be referred to as samples, examples, instances, data points, or objects
•Because the class label of each training tuple is provided, this step is also known as supervised
learning
General approach to Classification

•Learning: Training data are analyzed by a classification algorithm. Here, loan_decision is the
class label attribute, and classification rules represents the learned model or classifier.
General approach to Classification

Classification: Test
data are used to
estimate the accuracy
of the classification
rules. If the accuracy
is considered
acceptable, the rules
can be applied to the
classification of new
data tuples
General approach to Classification

•This first step of the classification process can also be


viewed as the learning of a mapping or function, y=f(X)
–X is a test data tuple input to a function
– y is a associated class label as output.
•This mapping is represented in the form of classification
rules, decision trees, or mathematical formulae.
Classification Accuracy

First, the predictive accuracy of the classifier is estimated.

If we were to use the training set to measure the classifier’s accuracy, this
estimate would likely be optimistic, because the classifier tends to overfit the data.

Therefore, a test set is used, made up of test tuples and their associated class
labels. They are independent of the training tuples, meaning that they were not
used to construct the classifier.

The accuracy of a classifier on a given test set is the percentage of test set tuples
that are correctly classified by the classifier. The associated class label of each
test tuple is compared with the learned classifier’s class prediction for that tuple.
Issues Regarding Classification and Prediction

The following preprocessing steps may be applied to the data to help improve the accuracy, efficiency,
and scalability of the classification or prediction process.

Data cleaning: To remove or reduce noise and the treatment of missing values. Although most
classification algorithms have some mechanisms for handling noisy or missing data, this step can help
reduce confusion during learning.

Relevance analysis: Correlation analysis can be used to identify whether any two given attributes are
statistically related.

For example, a strong correlation between attributes A1 and A2 would suggest that one of the two could
be removed from further analysis.

Attribute subset selection can be used to find a reduced set of attributes such that the resulting
probability distribution of the data classes is as close as possible to the original distribution obtained
using all attributes.
Issues Regarding Classification and Prediction

Data transformation and reduction

The data may be transformed by normalization, particularly when neural networks or


methods involving distance measurements are used in the learning step.

The data can also be transformed by generalizing it to higher-level concepts. Concept


hierarchies may be used for this purpose. This is particularly useful for continuous
valued attributes.

Because generalization compresses the original training data, fewer input/output


operations may be involved during learning.

Data can also be reduced by applying many other methods, ranging from wavelet
transformation and principle components analysis to discretization techniques, such as
binning, histogram analysis, and clustering.
Comparing Classification and Prediction Methods

Classification and prediction methods can be compared and evaluated according to the following criteria:

Accuracy: The accuracy refers to the ability of a given classifier to correctly predict the class label of new or
previously unseen data (i.e.,tuples without class label information). Similarly, the accuracy of a predictor
refers to how well a given predictor can guess the value of the predicted attribute for new or previously
unseen data.

Speed: This refers to the computational costs involved in generating and using the given classifier or
predictor.

Robustness:This is the ability of the classifier or predictor to make correct predictions given noisy data or
data with missing values.

Scalability: This refers to the ability to construct the classifier or predictor efficiently given large amounts of
data.

Interpretability: This refers to the level of understanding and insight that is provided by the classifier or
predictor. Interpretability is subjective and therefore more difficult to assess.
Classification by Decision Tree Induction

Decision tree induction is the learning of decision


trees from class-labeled training tuples.

A decision tree is a flowchart-like tree structure,

- each internal node (non leaf node) denotes a


test on an attribute
- each branch represents an outcome of the test
- each leaf node (or terminal node) holds a class
label.

Fig: A decision tree for the concept buys_computer, indicating whether an AllElectronics
customer is likely to purchase a computer.
Classification by Decision Tree Induction

“How are decision trees used for classification?”

● Given a tuple, X, for which the associated class label is unknown, the attribute values of the tuple
are tested against the decision tree.
● A path is traced from the root to a leaf node, which holds the class prediction for that tuple. Decision
trees can easily be converted to classification rules.

“Why are decision tree classifiers so popular?”

● The construction of decision tree classifiers does not require any domain knowledge or parameter
setting.
● Decision trees can handle multidimensional data.
● The learning and classification steps of decision tree induction are simple and fast with good
accuracy.
● Widely used for classification in many application areas such as medicine, manufacturing and
production, financial analysis, astronomy, and molecular biology, etc.,
Decision Tree Induction Algorithm

● During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in
machine learning, developed a decision tree algorithm known as ID3
(Iterative Dichotomiser).
● This work expanded on earlier work on concept learning systems,
described by E. B. Hunt, J. Marin, and P. T. Stone. Quinlan later presented
C4.5 (a successor of ID3), which became a benchmark to which newer
supervised learning algorithms are often compared.
● In 1984, a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C.
Stone) published the book Classification and Regression Trees (CART),
which described the generation of binary decision trees.
Decision Tree Induction Algorithm

● ID3, C4.5, and CART adopt a greedy (i.e.,non backtracking) approach in


which decision trees are constructed in a top-down recursive divide-and-
conquer manner.
● Starts with a training set of tuples and their associated class labels. The
training set is recursively partitioned into smaller subsets as the tree is being
built.
Basic
algorithm Information Gain
Gain Ratio
for inducing Gini Index

a decision Pure
tree from
training
tuples
The node N is
labeled with the
splitting
criterion (step 6)
Decision Tree Induction Algorithm
The recursive partitioning stops only when any one of the following terminating conditions
is true:
● All the tuples in partition D (represented at node N) belong to the same class (steps 2 and 3).
● There are no remaining attributes on which the tuples may be further partitioned (step 4). In
this case, majority voting is employed (step 5). This involves converting node N into a leaf and
labeling it with the most common class in D.
● There are no tuples for a given branch, that is, a partition Dj is empty (step 12). In this case, a
leaf is created with the majority class in D (step 13).

The computational complexity of the algorithm given training set D is

Ο(n×|D|×log⁡( |D|))

where n is the number of attributes describing the tuples in D

|D|is the number of training tuples in D


Attribute Selection Measures: Decision Tree Induction

● An attribute selection measure is a heuristic for selecting the splitting criterion


that “best” separates given data partition, D,of class-labeled training tuples into
individual classes.
● Provides a ranking for each attribute describing the given training tuples.
● The attribute having the best score for the measure is chosen as the splitting
attribute for the given tuples.
● If the splitting attribute is continuous-valued or if we are restricted to binary
trees, then, either a split point or a splitting subset must also be determined
as part of the splitting criterion.

Three popular attribute selection measures are:

—Information gain, gain ratio, and Gini index.


Attribute Selection Measures: Information Gain

Used Notations:

D, be a training set of class-labeled tuples.

m is the numebr of distinct values defining m distinct classes, Ci (for i=1,..., m).

Ci,D be the set of tuples of class Ci in D.

|D| and |Ci,D| denote the number of tuples in D and Ci,D, respectively.
Attribute Selection Measures: Information Gain
•ID3 uses information gain as its attribute selection measure.
•Let node N represent or hold the tuples of partition D. Choose the attribute with highest information gain
as a splitting attribute for node N.
•This attribute minimizes the information needed to classify the tuples in the resulting portions and
reflects the least randomness or impurity in the partitions.
•The expected information needed to classify a tuple D is given by

•Pi is the non zero probability that an arbitrary tuple in D belongs to class Ci and is estimated by |Ci,d|/|
D|
•Log function to the base 2 is used, because the information is encoded in bits
•Info(D) is also known as Entropy of D, is a avg amount of info needed to identify the class label of a tuple
in D
Attribute Selection Measures: Information Gain

How much more information would we still need (after the partitioning) to arrive at an
exact classification? This amount is measured by

The term |Dj| / |D| acts as the weight of the jth partition.

InfoA(D) is the expected information required to classify a tuple from D based on


the partitioning by A. The smaller the expected information (still) required, the
greater the purity of the partitions.
Attribute Selection Measures: Information Gain

Information gain is defined as the difference between the original information


requirement (i.e.,based on just the proportion of classes) and the new requirement
(i.e., obtained after partitioning on A).
Example
Classification by Decision Tree Induction

A (root) node N is created for the tuples in D. To find the splitting criterion for
these tuples, we must compute the information gain of each attribute.

Next, compute the expected information requirement for each attribute.


Classification by Decision Tree Induction
Classification by Decision Tree Induction
Classification by Decision Tree Induction
Classification by Decision Tree Induction
Classification by Decision Tree Induction
Classification by Decision Tree Induction
Classification by Decision Tree Induction
Classification by Decision Tree Induction
Classification by Decision Tree Induction
Classification by Decision Tree Induction
Classification by Decision Tree Induction
Classification by Decision Tree Induction
Classification by Decision Tree Induction
Classification by Decision Tree Induction
Classification by Decision Tree Induction

You might also like