0% found this document useful (0 votes)
63 views21 pages

ML Assignment-2: Unit 3

This document discusses hypothesis testing concepts. It defines key terms like the null hypothesis, which states there is no relationship between statistical objects, and the alternative hypothesis, which states the null is false. It describes significance as the probability of incorrectly concluding the null is false, with common levels of 1%, 5%, and 10%. Power is the probability of correctly concluding the null is true. A test statistic captures the relationship between observations, and its distribution underpins hypothesis testing.

Uploaded by

sirisha
Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
Download as odt, pdf, or txt
0% found this document useful (0 votes)
63 views21 pages

ML Assignment-2: Unit 3

This document discusses hypothesis testing concepts. It defines key terms like the null hypothesis, which states there is no relationship between statistical objects, and the alternative hypothesis, which states the null is false. It describes significance as the probability of incorrectly concluding the null is false, with common levels of 1%, 5%, and 10%. Power is the probability of correctly concluding the null is true. A test statistic captures the relationship between observations, and its distribution underpins hypothesis testing.

Uploaded by

sirisha
Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
Download as odt, pdf, or txt
Download as odt, pdf, or txt
You are on page 1/ 21

ML ASSIGNMENT-2

UNIT 3

1.What is pruning?Explain with an example?

● Pruning is a technique in machine learning and search algorithms that


reduces the size of decision trees by removing sections of the tree that provide
little power to classify instances.
● Pruning reduces the complexity of the final classifier, and hence improves
predictive accuracy by the reduction of overfitting.
● Pruning can be done in two ways:
○ Prepruning
○ Postpruning

Prepuning:

● Prepruning is the halting of subtree consists construction at some node after


checking some measures.
● These measures can be information gain,Gini index etc.
● If partitioning the tuple at a node would result in a split that falls below a
prespecified threshold, then pruning is done.

1.Minimum no of object pruning:

● In this method of pruning, the minimum no of object is specified as a


threshold value.
● In the weak interface , there is one parameter in J48 (weak implementation of
C4.5), called minobj is set to specify threshold value.
● Whenever the split is made which yields a child leaf that represents less than
minobj from the data set, the parent node and children node are compressed
to a single node.
● As shown in figure, minnumobj is specified as 30, so when inducing the
tree at last it finds 34 and 9. 34>30, so it is greater than this threshold
value and node is not spitted (pruning is done)

2.Chi-square pruning:

● This approach to pruning is to apply a statistical test to the data to determine


whether a split on some feature Xk is statistically significant, in terms of the
effect of the split on the distribution of classes in the partition on data induced
by the split.
● Here the null hypothesis is considered, that the data is independently
distributed according to a distribution on data consistent with that at the
current node.

Consider one example, The statistical for Pearson’s chi-square test , which will
be used here as a test of independence.

● To think about this, suppose that at the current node the data is split
10:10 between negative and positive examples.
● Furthermore, suppose there are 8 instances for which Xk is false, and
12 for which Xk is true. We’d like to understand whether a split that
generates labeled data 3:5 (on the Xk false branch) and 7:5 (on the Xk
true branch) could occur simply by chance.

Postpruning:
● Grow decision tree to its entirety.
● Trim the nodes of decision tree in bottom-up fashion. postpruning is done by
replacing the node with the leaf.
● If error improves after trimming,replace the sub tree by leaf node.

1.Reduced Error Pruning:

● This method was proposed by Quinlan . It is the simplest and most


understandable method in decision tree pruning.
● This method considers each of the decision nodes in the tree to be candidates
for pruning, consist of removing the subtree rooted at that node, making it a
leaf node.
● The available data is divided into three parts: the training examples, the
validation examples used for pruning the tree, and a set of test examples used
to provide an unbiased estimate of accuracy over future unseen examples.

2.Error complexity pruning:

● In error complexity pruning is concern with calculating error cost of a node. It


finds error complexity at each node. the error cost of the node is calculated
using following equation:

R(t)=r(t)*p(t)

r(t) is error rate of a node which is given as:

r(t)=no of examples misclassified in node/no of all examples in node

p(t) is probability of occurrence of a node

p(t)=no of examples in node /no of total examples

The method consists of following steps:

❖ a is computed for each node.


❖ the minimum a node is pruned.
❖ The above is repeated and a forest of pruned trees is formed.
❖ the tree with best accuracy is selected

3.Minimum Error pruning:


The method consists of following steps :

● At each non leaf node in the tree, calculate expected error rate if that subtree
is pruned.
● Calculate the expected error rate for that node if subtree is not pruned.
● If pruning the node leads to greater expected error rate, then keep the subtree;
otherwise, prune it.

4.Cost based pruning:

This is one of the post pruning technique. In this method not only an error
rate is considered at each node but also a cost is considered.

That is for pruning decision tree error rate and cost of deciding selection of one or
more class-label attribute is considered.

Here one example is explained for healthy or sick classification


2.What is regression tree ?What are the different ways to

construct a decision tree?

● A decision tree is a flowchart like structure in which each internal node


represents a test on a feature,each leaf node represents a class label and
branches represents conjunction of features
● Decision trees where the target values can take continuous values(typically
real values) are calledRegression tree
● Decision trees where the target variables cannot take continuous values are
called classification trees

Example:
● The algorithm used in Decision tree:
○ ID3(extension of D3)
○ CART(classification and regression trees)
○ C4.5
○ CHAID(chi-squared automatic interaction detection performs multi
levels splits)
○ MARS(multivariate adaptive regression splines

ID3 Algo:

● This algorithm builds decision trees using a top down greedy search approach
through the space of possible branches with the backtracking.A greedy
algorithm as the name suggests always makes the choice that seems to be the
best at that moment.

Steps in ID3 :

● It begins with the original set S as the root.


● On each iteration of the algorithm,it iterates through the very unused attribute
of set S and calculates Entropy (H) and Information Gain(IG) of this attribute.
● It then selects the attribute which has the smallest entropy or largest
information gain.
● The set S is split by a selected attribute to produce a subset of data.
● The algorithm continues to recur on each subset,considering only attributes
never selected before.

CART :

● CART algorithm is a classification algorithm for building a decision tree based


on Gini’s impurity index as a splitting criterion.
● CART is a binary tree built by splitting nodes into two child nodes
respectively.This algorithm works repeatedly in three steps.
○ Find each features best split.For each feature with k different values
there exists k-1 possible splits.Find split which maximizes the splitting
criterion.
○ Find the node’s best split .Among the best splits from step 1 find the
one,which maximizes the splitting criterion
○ Split the node using best node split from step2 until stopping criterion
is satisfied.

UNIT-4
1.What are the basic concepts of Hypothesis testing?

Hypothesis testing is a statistical technique that is used in a variety of situations.

Though the technical details differ from situation to situation, all hypothesis tests use

the same core set of terms and concepts. The following descriptions of common

terms and concepts refer to a hypothesis test in which the means of two populations

are being compared.

Null hypothesis
The null hypothesis is a clear statement about the relationship between two

(or more) statistical objects. These objects may be measurements,

distributions, or categories. Typically, the null hypothesis, as the name implies,

states that there is no relationship.


In the case of two population means, the null hypothesis might state that the

means of the two populations are equal.

Alternative hypothesis
Once the null hypothesis has been stated, it is easy to construct the

alternative hypothesis. It is essentially the statement that the null hypothesis is

false. In our example, the alternative hypothesis would be that the means of

the two populations are not equal.

Significance
The significance level is a measure of the statistical strength of the hypothesis

test. It is often characterized as the probability of incorrectly concluding that

the null hypothesis is false.

The significance level is something that you should specify up front. In

applications, the significance level is typically one of three values: 10%, 5%, or

1%. A 1% significance level represents the strongest test of the three. For this

reason, 1% is a higher significance level than 10%.

Power
Related to significance, the power of a test measures the probability of

correctly concluding that the null hypothesis is true. Power is not something

that you can choose. It is determined by several factors, including the

significance level you select and the size of the difference between the things

you are trying to compare.


Unfortunately, significance and power are inversely related. Increasing

significance decreases power. This makes it difficult to design experiments

that have both very high significance and power.

Test statistic
The test statistic is a single measure that captures the statistical nature of the

relationship between observations you are dealing with. The test statistic

depends fundamentally on the number of observations that are being

evaluated. It differs from situation to situation.

Distribution of the test statistic


The whole notion of hypothesis rests on the ability to specify (exactly or

approximately) the distribution that the test statistic follows. In the case of this

example, the difference between the means will be approximately normally

distributed (assuming there are a relatively large number of observations).

One-tailed vs. two-tailed tests


Depending on the situation, you may want (or need) to employ a one- or two-

tailed test. These tails refer to the right and left tails of the distribution of the

test statistic. A two-tailed test allows for the possibility that the test statistic is

either very large or very small (negative is small). A one-tailed test allows for

only one of these possibilities.

In an example where the null hypothesis states that the two population means

are equal, you need to allow for the possibility that either one could be larger
than the other. The test statistic could be either positive or negative. So, you

employ a two-tailed test.

The null hypothesis might have been slightly different, namely that the mean

of population 1 is larger than the mean of population 2. In that case, you don’t

need to account statistically for the situation where the first mean is smaller

than the second. So, you would employ a one-tailed test.

Critical value
The critical value in a hypothesis test is based on two things: the distribution of

the test statistic and the significance level. The critical value(s) refer to the

point in the test statistic distribution that give the tails of the distribution an

area (meaning probability) exactly equal to the significance level that was

chosen.

Decision
Your decision to reject or accept the null hypothesis is based on comparing

the test statistic to the critical value. If the test statistic exceeds the critical

value, you should reject the null hypothesis. In this case, you would say that

the difference between the two population means is significant. Otherwise,

you accept the null hypothesis.

P-value
The p-value of a hypothesis test gives you another way to evaluate the null

hypothesis. The p-value represents the highest significance level at which


your particular test statistic would justify rejecting the null hypothesis. For

example, if you have chosen a significance level of 5%, and the p-value turns

out to be .03 (or 3%), you would be justified in rejecting the null hypothesis.

2.Explain about ROC(Receiving operating


characteristic)curve?

In many applications, there's a need to decide between two alternatives. In the


military, radar operators look at approaching objects and decide if it's a
threat. Doctors look at an image and decide if it's a tumour. For facial
recognition, an algorithm has to decide if it's a match. In Machine Learning,
we call this binary classification while in radar we call it signal detection.
The decision depends on a threshold. Receiver Operating Characteristic
(ROC) Curve is a graphical plot that helps us see the performance of a binary
classifier or diagnostic test when the threshold is varied. Using the ROC
Curve, we can select a threshold that best suits our application. The idea is to
maximize correct classification or detection while minimizing false positives.
ROC Curve is also useful when comparing alternative classifiers or
diagnostic tests.
Let's take a binary classification problem that has two distributions: one

for positives and one for negatives. To classify subjects into one of

these two classes, we select a threshold. Anything above the threshold

is classified as positive. The accuracy of the classifier depends directly

on the threshold we use. ROC Curve is plotted by varying the

thresholds and recording the classifier's performance at each threshold.

ROC curve plots True Positive Rate (TPR) versus False Positive

Rate (FPR). TPR is also called recall or sensitivity. TPR is the

probability that we detect a signal when it's present. FPR is the

complement of specificity: (1-specificity). FPR is the probability that


we detect a signal when it's not present. Being based on only recall and

specificity, ROC curve is independent of prevalence, that is, how

common is the condition in the population.

An ideal classifier will have an ROC curve that rises sharply from

origin until FPR rises when TPR is already high. Each point on the

ROC curve represents the performance of the classifier at one threshold value.

Which application domains are using ROC Curves?

ROC started in radar applications. It was later applied in many other

domains including psychology, medicine, radiology, biometrics, and

meteorology. More recently, it's being used in machine learning and

data mining.

In medical practice, it's used for assessing diagnostic biomarkers,

imaging tests or even risk assessment. It's been used to analyse

information processing in the brain during sensory difference testing.

In bioinformatics and computational genomics, ROC analysis is being

applied. In particular, it's used to classify biological sequences and

protein structures.

ROC has been used to describe the performance of instruments built

to detect explosives. In engineering, it's been used to evaluate the

accuracy of pipeline reliability analysis and predict the failure threshold value .
ROC curves beyond binary classification

The extension of ROC curves for classification problems with more than two classes
has always been cumbersome, as the degrees of freedom increase quadratically with
the number of classes, and the ROC space has dimensions, where

is the number of classes. Some approaches have been made for the particular case
with three classes (three-way ROC). The calculation of the volume under the ROC
surface (VUS) has been analyzed and studied as a performance metric for multi-class
problems.However, because of the complexity of approximating the true VUS, some
other approaches based on an extension of AUC are more popular as an evaluation
metric.

Given the success of ROC curves for the assessment of classification models, the
extension of ROC curves for other supervised tasks has also been investigated.
Notable proposals for regression problems are the so-called regression error
characteristic (REC) Curves and the Regression ROC (RROC) curves. In the latter,
RROC curves become extremely similar to ROC curves for classification, with the
notions of asymmetry, dominance and convex hull. Also, the area under RROC
curves is proportional to the error variance of the regression model.
UNIT-5
1. Explain various ensemble methods

Ensemble methods are meta-algorithms that combine several machine


learning techniques into one predictive model in order to decrease variance
(bagging), bias (boosting), or improve predictions (stacking).

● Bagging and boosting are commonly used terms by various data enthusiasts
around the world. But what exactly bagging and boosting mean and how does
it help the data science world. From this post you will learn about bagging,
boosting and how they are used.
● Both Bagging and boosting are part of a series of statistical techniques called
ensemble methods.
● Bagging to decrease the model’s variance.
● Boosting to decrease the model’s bias.

Bagging
● The idea behind bagging is combining the results of multiple models (for
instance, all decision trees) to get a generalized result. Bootstrapping is a
sampling technique in which we create subsets of observations from the
original dataset, with replacement. The size of the subsets is the same as the
size of the original set.
● Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get
a fair idea of the distribution (complete set). The size of subsets created for
bagging may be less than the original set.

1. Multiple subsets are created from the original dataset, selecting observations
with replacement.
2. A base model (weak model) is created on each of these subsets.
3. The models run in parallel and are independent of each other.
4. The final predictions are determined by combining the predictions from all
the models.
BOOSTING
● Boosting is a sequential process, where each subsequent model attempts to
correct the errors of the previous model.
● The succeeding models are dependent on the previous model.
● To understand the way the boosting works go through the following steps:
1. A subset is created from the original dataset.
2. Initially, all data points are given equal weights.
3. A base model is created on this subset.
4. This model is used to make predictions on the whole dataset.
5. Errors are calculated using the actual values and predicted values.

1. The observations which are incorrectly predicted, are given higher


weights.(Here, the three misclassified blue-plus points will be given
higher weights).
2. Another model is created and predictions are made on the dataset.
(This model tries to correct the errors from the previous model)
8. Similarly, multiple models are created, each correcting the errors of the previous
model.

9.The final model (strong learner) is the weighted mean of all the models (weak
learners).

Thus, the boosting algorithm combines a number of weak learners to


form a strong learner. The individual models would not perform well on the entire
dataset, but they work well for some part of the dataset. Thus, each model actually
boosts the performance of the ensemble.

RANDOM FORESTS
● Random Forest Models can be thought of as BAGGing, with a slight weakness.
● When deciding where to split and how to make decisions, BAGGed Decision
Trees have the full disposal of features to choose from. Therefore, although the
bootstrapped samples may be slightly different, the data is largely going to
break off at the same features throughout each model.
● In contrast, Random Forest models decide where to split based on a random
selection of features. Rather than splitting at similar features at each node
throughout, Random Forest models implement a level of differentiation
because each tree will split based on different features.
● This level of differentiation provides a greater ensemble to aggregate over,
ergo producing a more accurate predictor. Refer to the image for a better
understanding.
GRADIENT BOOSTING
● Gradient boosting is a machine learning technique for regression and
classification problems, which produces a prediction model in the form of an
ensemble of weak prediction models, typically decision trees.
● The objective of any supervised learning algorithm is to define a loss function
and minimize it. Let’s see how math works out for the Gradient Boosting
algorithm. Say we have mean squared error (MSE) as loss defined as:

● We want our predictions, such that our loss function (MSE) is minimum. By
using gradient descent and updating our predictions based on a learning
rate, we can find the values where MSE is minimum.

● So, we are basically updating the predictions such that the sum of our
residuals is close to 0 (or minimum) and predicted values are sufficiently close
to actual values.
● So, the intuition behind gradient boosting algorithms is to repetitively
leverage the patterns in residuals and strengthen a model with weak
predictions and make it better. Once we reach a stage where residuals do not
have any pattern that could be modeled, we can stop modeling residuals
(otherwise it might lead to overfitting). Algorithmically, we are minimizing
our loss function, such that test loss reaches its minimum.

STACKING

● Stacking is another ensemble model, where a new model is trained from the
combined predictions of two (or more) previous models.
● The predictions from the models are used as inputs for each sequential layer,
and combined to form a new set of predictions. These can be used on
additional layers, or the process can stop here with a final result.
● Ensemble stacking can be referred to as blending, because all the numbers are
blended to produce a prediction or classification.

2.Difference between Bagging and Boosting.

BAGGING

1. Simplest way of combining predictions that belong to the same type.


2. Aim to decrease variance, not bias
3. Each model receives equal weight.
4. Each model is built independently.
5. Different training data subsets are randomly drawn with replacement from
the entire training dataset.
6. Bagging tries to solve over-fitting problem.
7. If the classifier is unstable (high variance), then apply bagging.
8. Random forest.

BOOSTING

1. A way of combining predictions that belong to the different types.


2. Aim to decrease bias, not variance.
3. Models are weighted according to their performance.
4. New models are influenced by performance of previously built models.
5. Every new subset contains the elements that were misclassified by previous
models.
6. Boosting tries to reduce bias.
7. If the classifier is stable and simple (high bias) the apply boosting.
8. Gradient Boosting

You might also like