ML Assignment-2: Unit 3
ML Assignment-2: Unit 3
UNIT 3
Prepuning:
2.Chi-square pruning:
Consider one example, The statistical for Pearson’s chi-square test , which will
be used here as a test of independence.
● To think about this, suppose that at the current node the data is split
10:10 between negative and positive examples.
● Furthermore, suppose there are 8 instances for which Xk is false, and
12 for which Xk is true. We’d like to understand whether a split that
generates labeled data 3:5 (on the Xk false branch) and 7:5 (on the Xk
true branch) could occur simply by chance.
Postpruning:
● Grow decision tree to its entirety.
● Trim the nodes of decision tree in bottom-up fashion. postpruning is done by
replacing the node with the leaf.
● If error improves after trimming,replace the sub tree by leaf node.
R(t)=r(t)*p(t)
● At each non leaf node in the tree, calculate expected error rate if that subtree
is pruned.
● Calculate the expected error rate for that node if subtree is not pruned.
● If pruning the node leads to greater expected error rate, then keep the subtree;
otherwise, prune it.
This is one of the post pruning technique. In this method not only an error
rate is considered at each node but also a cost is considered.
That is for pruning decision tree error rate and cost of deciding selection of one or
more class-label attribute is considered.
Example:
● The algorithm used in Decision tree:
○ ID3(extension of D3)
○ CART(classification and regression trees)
○ C4.5
○ CHAID(chi-squared automatic interaction detection performs multi
levels splits)
○ MARS(multivariate adaptive regression splines
ID3 Algo:
● This algorithm builds decision trees using a top down greedy search approach
through the space of possible branches with the backtracking.A greedy
algorithm as the name suggests always makes the choice that seems to be the
best at that moment.
Steps in ID3 :
CART :
UNIT-4
1.What are the basic concepts of Hypothesis testing?
Though the technical details differ from situation to situation, all hypothesis tests use
the same core set of terms and concepts. The following descriptions of common
terms and concepts refer to a hypothesis test in which the means of two populations
Null hypothesis
The null hypothesis is a clear statement about the relationship between two
Alternative hypothesis
Once the null hypothesis has been stated, it is easy to construct the
false. In our example, the alternative hypothesis would be that the means of
Significance
The significance level is a measure of the statistical strength of the hypothesis
applications, the significance level is typically one of three values: 10%, 5%, or
1%. A 1% significance level represents the strongest test of the three. For this
Power
Related to significance, the power of a test measures the probability of
correctly concluding that the null hypothesis is true. Power is not something
significance level you select and the size of the difference between the things
Test statistic
The test statistic is a single measure that captures the statistical nature of the
relationship between observations you are dealing with. The test statistic
approximately) the distribution that the test statistic follows. In the case of this
tailed test. These tails refer to the right and left tails of the distribution of the
test statistic. A two-tailed test allows for the possibility that the test statistic is
either very large or very small (negative is small). A one-tailed test allows for
In an example where the null hypothesis states that the two population means
are equal, you need to allow for the possibility that either one could be larger
than the other. The test statistic could be either positive or negative. So, you
The null hypothesis might have been slightly different, namely that the mean
of population 1 is larger than the mean of population 2. In that case, you don’t
need to account statistically for the situation where the first mean is smaller
Critical value
The critical value in a hypothesis test is based on two things: the distribution of
the test statistic and the significance level. The critical value(s) refer to the
point in the test statistic distribution that give the tails of the distribution an
area (meaning probability) exactly equal to the significance level that was
chosen.
Decision
Your decision to reject or accept the null hypothesis is based on comparing
the test statistic to the critical value. If the test statistic exceeds the critical
value, you should reject the null hypothesis. In this case, you would say that
P-value
The p-value of a hypothesis test gives you another way to evaluate the null
example, if you have chosen a significance level of 5%, and the p-value turns
out to be .03 (or 3%), you would be justified in rejecting the null hypothesis.
for positives and one for negatives. To classify subjects into one of
ROC curve plots True Positive Rate (TPR) versus False Positive
An ideal classifier will have an ROC curve that rises sharply from
origin until FPR rises when TPR is already high. Each point on the
ROC curve represents the performance of the classifier at one threshold value.
data mining.
protein structures.
accuracy of pipeline reliability analysis and predict the failure threshold value .
ROC curves beyond binary classification
The extension of ROC curves for classification problems with more than two classes
has always been cumbersome, as the degrees of freedom increase quadratically with
the number of classes, and the ROC space has dimensions, where
is the number of classes. Some approaches have been made for the particular case
with three classes (three-way ROC). The calculation of the volume under the ROC
surface (VUS) has been analyzed and studied as a performance metric for multi-class
problems.However, because of the complexity of approximating the true VUS, some
other approaches based on an extension of AUC are more popular as an evaluation
metric.
Given the success of ROC curves for the assessment of classification models, the
extension of ROC curves for other supervised tasks has also been investigated.
Notable proposals for regression problems are the so-called regression error
characteristic (REC) Curves and the Regression ROC (RROC) curves. In the latter,
RROC curves become extremely similar to ROC curves for classification, with the
notions of asymmetry, dominance and convex hull. Also, the area under RROC
curves is proportional to the error variance of the regression model.
UNIT-5
1. Explain various ensemble methods
● Bagging and boosting are commonly used terms by various data enthusiasts
around the world. But what exactly bagging and boosting mean and how does
it help the data science world. From this post you will learn about bagging,
boosting and how they are used.
● Both Bagging and boosting are part of a series of statistical techniques called
ensemble methods.
● Bagging to decrease the model’s variance.
● Boosting to decrease the model’s bias.
Bagging
● The idea behind bagging is combining the results of multiple models (for
instance, all decision trees) to get a generalized result. Bootstrapping is a
sampling technique in which we create subsets of observations from the
original dataset, with replacement. The size of the subsets is the same as the
size of the original set.
● Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get
a fair idea of the distribution (complete set). The size of subsets created for
bagging may be less than the original set.
1. Multiple subsets are created from the original dataset, selecting observations
with replacement.
2. A base model (weak model) is created on each of these subsets.
3. The models run in parallel and are independent of each other.
4. The final predictions are determined by combining the predictions from all
the models.
BOOSTING
● Boosting is a sequential process, where each subsequent model attempts to
correct the errors of the previous model.
● The succeeding models are dependent on the previous model.
● To understand the way the boosting works go through the following steps:
1. A subset is created from the original dataset.
2. Initially, all data points are given equal weights.
3. A base model is created on this subset.
4. This model is used to make predictions on the whole dataset.
5. Errors are calculated using the actual values and predicted values.
9.The final model (strong learner) is the weighted mean of all the models (weak
learners).
RANDOM FORESTS
● Random Forest Models can be thought of as BAGGing, with a slight weakness.
● When deciding where to split and how to make decisions, BAGGed Decision
Trees have the full disposal of features to choose from. Therefore, although the
bootstrapped samples may be slightly different, the data is largely going to
break off at the same features throughout each model.
● In contrast, Random Forest models decide where to split based on a random
selection of features. Rather than splitting at similar features at each node
throughout, Random Forest models implement a level of differentiation
because each tree will split based on different features.
● This level of differentiation provides a greater ensemble to aggregate over,
ergo producing a more accurate predictor. Refer to the image for a better
understanding.
GRADIENT BOOSTING
● Gradient boosting is a machine learning technique for regression and
classification problems, which produces a prediction model in the form of an
ensemble of weak prediction models, typically decision trees.
● The objective of any supervised learning algorithm is to define a loss function
and minimize it. Let’s see how math works out for the Gradient Boosting
algorithm. Say we have mean squared error (MSE) as loss defined as:
● We want our predictions, such that our loss function (MSE) is minimum. By
using gradient descent and updating our predictions based on a learning
rate, we can find the values where MSE is minimum.
● So, we are basically updating the predictions such that the sum of our
residuals is close to 0 (or minimum) and predicted values are sufficiently close
to actual values.
● So, the intuition behind gradient boosting algorithms is to repetitively
leverage the patterns in residuals and strengthen a model with weak
predictions and make it better. Once we reach a stage where residuals do not
have any pattern that could be modeled, we can stop modeling residuals
(otherwise it might lead to overfitting). Algorithmically, we are minimizing
our loss function, such that test loss reaches its minimum.
STACKING
● Stacking is another ensemble model, where a new model is trained from the
combined predictions of two (or more) previous models.
● The predictions from the models are used as inputs for each sequential layer,
and combined to form a new set of predictions. These can be used on
additional layers, or the process can stop here with a final result.
● Ensemble stacking can be referred to as blending, because all the numbers are
blended to produce a prediction or classification.
BAGGING
BOOSTING