RB's ML2 Notes

Decision Tree
Decision Trees: A decision tree is a graph that uses branching The process of creating a decision tree is as follows:
methods to illustrate a course of action and various outcomes. 1. Select the best variable for the split
At each node a variable is evaluated to decide which path to 2. Apply the decision parameters and split
follow, It is a way to display an algorithm with only conditional 3. Repeat the process for the subset obtained
control statements. 4. Continue process until stopping criterion is reached
Elements: Root Node, Branch, Decision Node, Leaf Node 5. Note: It is a Top down and greedy approach
Advantages of Decision Tree Disadvantages of Decision Tree

1. DT has easy to understand and intuitive approach 1. Minor changes can cause instability
2. Pre-processing is relatively easier 2. Model training process requires large amount of time
3. They are very versatile and fit wide ranging situations 3. The model is highly prone to overfitting
Classification trees are built for unordered/discrete values Regression Trees are built for continuous variables, they
with dependent variables. The method used is recursive create similar subsets and then assign the average of those
partitioning, i.e. using certain parameters, the data set is subsets as the value of the dependent variable.
partitioned over and over again until you reach a state of For regression trees, variance reduction method is employed
homogeneity. Classification trees are usually fast and easily for splitting. This involves measuring the net change in
interpreted. variance when you split a node.
They aim to classify objects into one or more categories by For final leaf nodes, take the average of the output variable
taking the mode of the subset’s dependent variable. and assign it to any new observation that falls in the leaf
Homogeneity – It is used to assess whether a subsequent partitions of a dataset is homogenous (pure) or not
Impurity – Is opposite of homogeneity i.e. as you subset further or go down the tree, impurity should decrease.
Both Gini and Entropy are measures to quantify impurity of a node. A node having multiple classes is impure whereas a node
having only one class is pure.
Gini Impurity is used in decision tree algorithms to decide the optimal split from a root node, and subsequent splits. IT tells us
the probability of misclassifying an observation
 GI = ∑ P (X) * (1 – P (X)) - GI lies between 0 to 0.5, lower the GI, higher is the purity.
Entropy quantifies the amount of impurity present in any random variable.
 Entropy = ∑ - (P (X) * log P (X)) - Entropy lies between 0 to 1, lower the better
Information gain is calculated by comparing the net change in entropy before and after a split is created in the decision tree.
 Information Gain = Entropy (parent) – [Weighted avg] * Entropy (children), higher the better
There are two ways of improving a decision Tree:

1. Truncation is the Process of reducing the size of the tree 1. Pruning is the process of reducing the size of the tree
by limiting its downward growth by keeping only the model relevant decision nodes.
2. It stops the tree while it is still growing to avoid leaves 2. It reduces the size of decision trees by removing
containing very low data points, this is done by setting a sections of the tree that provide little power to classify
minimum number of training inputs to use on each leaf. instances
Decision Tree Classifier function in sklearn provides the Some commonly used algorithms for DT are as follows:
following hyperparameters which control:  CART(Classification and Regression Tree) – is a classic
1. max_depth – This measures the max depth of given supervised learning model and produces only binary splits
tree  ID3 (Iterative dichotomizer) – it maximises info gain or
2. min_samples_split – measures min no. of samples minimizes entropy
needed in order to split the node  C4.5 & C5 – both are advanced versions of ID3 and can
3. min_samples_leaf – measures min allowed sample size handle both continuous and categorical variables
of the resultant leaves  CHAID (Chi-Squared Auto Iterative detector) - it is an
4. criterion – measures impurity Ex. GI or Entropy unsupervised model producing multiple splits
Model Selection
Model selection is a process of selecting the best model for a problem statement from a plethora of ML models
Classification of Models
 Based on learning algorithm (K-nn, Naive bayes,Decision tree etc.)
 Based on hyperparameters (K in K-nn, Max tree depth in decision tree etc.)
Underfitting means not being able to extract information from the data. An underfit model fails to significantly grasp the
relationship between the input values and target variables.
Ex. The model function does not have enough complexity (parameters) to fit the true function correctly
What are the causes of underfitting? How to fix an underfitted model?
1. Due to scarcity of data, model may not capture the 1. Including more parameters and information to the data
true relationship between source and target variable set i.e. gather useful and more information to help the
2. The model is not good enough to learn the underlying model understand the trend and make better predictions
trend in the data 2. Add complexity by inbuilt parameters or completely
3. For a true polynomial relationship, trying to implement choose a new more complex model like KNN or decision
a linear regression model would lead to a bad model tree to capture the trend in the data
Overfitting refers to a model that models the training data too well. It happens when a model learns the detail and noise in
the training data to the extent that it negatively impacts the performance of the model. This means that the noise or
random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these
concepts do not apply to new data and negatively impact the models ability to generalize.
Ex, Decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training
data. This problem can be addressed by pruning a tree
Causes of overfitting? How to fix an overfitted model?
1. ML model overfits when it captures the noise of the data. 1. Adding more training data to the model would
2. Rather than focusing on the underlying trend, it moves increase the model’s understanding of an outlier vs
towards fitting every point in the dataset in the model. a signal, thus helps it to fit better
3. Highly complex models are vulnerable to overfitting 2. Removing features would reduce the complexity of
4. Large numbers of input features reduce the flexibility and the model and helps it to learn better through more
affect the versatility of the model. important features.
Underfitting Overfitting
1. It fails to extract the trend in the data 1. Model tends to learn the data instead of capturing the trend
2. An under fitted model shows high errors in 2. An over fitted model shows good prediction accuracy on the
prediction over similar datasets but consistently train dataset but high errors in prediction over test dataset
3. The model has a biased assumption that its 3. The model is highly sensitive to the change in information. The
limited features are highly related to the target high sensitivity is a result of learning predictions from the
variable noise in the dataset
4. The errors in prediction are always consistently 4. A complex prediction pattern implies that the target variable is
high over every dataset not highly dependent over a single feature
 Signal is the data which supports or transmits the underlying trend in the dataset. These data points occur more
frequently and establishes the underlying trend of the data
 Noise is that data which is basically an outlier. It occurs less frequently and does not contribute in establishing the
underlying trend of the data
1. Bias is a phenomenon in which the predictions of the 1. Variance is a phenomenon in which the predictions of
model are not close to the actual answer but the the model are close to the actual answer for one dataset
difference between the prediction and the actual but show high errors when tested on another similar
answer is consistent. dataset.
2. Bias can be understood as a model which is not 2. Variance could be understood as a model with good
accurate but is precise in its predictions accuracy about its predictions but shows high variations
What is bias variance trade-off?
1. Objective - of any supervised model is to
attain low variance and low bias to make
better predictions
2. The trade-off - increasing the bias will
decrease the variance and vice versa
3. Complexity selection - The parameter
selection for algorithm is always a
challenge to maintain a bias-variance
equilibrium
Model Stability
Model Optimisation is fine tuning the model in order to generate the high model scores for the given dataset
Model validation is a method to evaluate the trained model against a test data set. The main objective of using the test data set
is to test the ability of a trained model for generalizing underlying logic
What are the drawbacks of train_test_split?
 Randomness of split - Since the split of data is random, there is a possibility that the split might be unhealthy for the model.
o For example, the data split is such that the training data has all signal points while test data holds majority of the
noise points.
 Model performance - Model scores are not consistent and changes with every iteration of the function. Thus, you cannot
decide which model fits best for the particular dataset
KFold is a cross validation technique used to evaluate the model in an unbiased way. K in KFold defines the number of splits
in the cross-validation step.
1. Data Splitting: The initial sample is randomly divided into k Advantages of using KFold cross-validation?
equal size subsamples 1. Uniformity: All observations are used for both
2. Train vs Test: Single subsample is retained as validation data, training and validation, hence the issue of
remaining k-1 subsamples are used as training data irregular data split is removed
3. Re-iteration: Process is repeated k times, each of the k 2. Unbiased validation: Every observation is used
subsample being used exactly once as test data exactly once for validation, which avoids a bias
4. Averaging: The k values can then be summed to provide a towards some data points and minimizes the
single estimate chance of overfitting
Feature selection - is a process to select the best feature Different filter methods to perform FS:
combination to improve the efficiency of the ML model SelectKBest
Advantages of feature selection? 1. Use appropriate test for the variable types (Chisquare,
1. Noise reduction A large number of features and low ANOVA, Correlation) to evaluate importance of different
sample / feature ratios adds noise to the dataset features to the target variable.
2. Effort reduction Reducing the number of features 2. select ‘K’ best features out of all which are highly related
would minimize running time and saves good amount to the target variable
of computational power Correlation Matrix
3. Simplicity A reduced feature set is easier for humans to 1. Establish a correlation matrix amongst variables of the
understand and Allows to concentrate on the key features of the dataset
predictability sources 2. Eliminate redundant features based on the absolute
correlation values to the target variable
What are Hyperparameters? advantages of hyperparameter tuning

Parameters which can be modified to control the behavior 1. Model optimization Choosing the best set of
of a machine learning algorithm are hyperparameters. hyperparameters for the model to improve model
E.g. K in Knn, Tree depth in decision tree. Models can have performance
many parameters and determining the right set of 2. Effort reduction Optimising the set of hyperparameters
parameters can be treated as a search issue saves computational power and time for future
Features of GridSearch CV Features of RandomizedSearch CV

1. Extensive Most extensive process for hyperparameters tuning 1. Search Logic Unlike GridSearch CV, not all
2. Re-iteration Loops through hyperparameters and fit model on parameter values are tried, but only a specific
training set fixed number of parameters are tested from
3. Selection Selects the best set of parameters from the listed distribution sample
hyperparameters 2. Optimisation By limiting and randomizing cross-
4. Cross Validation Specify the frequency of the cross-validation validated search over parameter settings, this
for each set method achieves computational optimization
5. Heavy Requires too much of computational power, time and
cost as it’s an extensive process
Ensembles
Ensemble methods combine a number of base learners together into a single predictive model with increased
accuracy. These base learners are individual models that use a machine learning algorithm, such as decision trees,
linear regression, etc., to predict the results as accurately as possible. Even though we have so many predictive
models in place, ensembles serve as important models to increase the accuracy of the predictions made.
Why ensembles or why aren't our individual models enough?

1. Ensembles can help improve the predictions or the accuracy, and also make the model more robust.
2. It helps avoid overfitting. It does so by introducing randomization in some parts of the process
3. Ensembles can help reduce bias and variance in the data, in the models
4. Results in more generalized model
5. One of the key ideas in what makes an ensemble effective is the notion of bringing in diversity.
The 2 important characteristics that make base learners How do we achieve diversity in ensembles?
suitable for building an ensemble model are 1. Sub-sampling the training data
 Diversity – Ensures that base learners serve 2. Manipulating the training hyperparameters
complimentary purposes. Each model learns different 3. Using different features
insights from the data making its own assumptions, 4. Using different base learning algorithms
i.e. independent of each other. 5. Using combination of above methods
 Acceptability – implies that each model should be at
least better than a random guesser. The probability
of result being incorrect should be < 50%.
Ensembles combine the results of various models together to get a model with better accuracy. The outputs from
different base learners can be combined using three different techniques as given in the image below:
Max voting (Classification problems) Average (regression problems) Weighted Average (regression)
The final output is obtained by voting The final o/p is obtained by The final o/p is obtained by taking the
through the outputs of all the models. taking the mean of the weighted mean of predicted outputs
The class with max votes is considered predicted outputs of all the of all the models. The base models
as the predicted class models with higher accuracy tend to have
more weights.
An optimal way of assigning the weights to different base learners while combining the outputs is an automated
approach known as stacking. Stacking uses different ML algorithms to build first-level learners and then the outputs
from these are given as inputs to another model called the meta-learner to form a second level learner. It is also
called Stacked Generalization.
 Why Stacking? – combining o/p of multiple models manually is not optimal.
 Solution – Multiple models make the intermediate predictions on the training data. Then a new model uses
these predictions to make the final prediction. The model is said to be stacked on top of other models.
Steps for the working of the stacking algorithm are:
1. Train-test Split – Split he data into training (learning model) and testing (prediction model) dataset
2. Base Model Building – Base model is built using different ML algorithms like KNN, SVM etc. to make
predictions.
3. Data Set Creation – A new data set is created using the prediction made by base models as independent
variables and target variables as final prediction variables
4. Aggregating the models – a new model is built on the new dataset created known as Meta estimator to
make predictions.
The first two types of ensemble models, classified on the basis of their model type are:
 Homogenous - Ensembles method using same learning algos for all base learners i.e. learners of same type
 Heterogeneous - Ensembles method which use different learning algorithms for the base learners
The next two types of ensemble models, classified on the basis of the algorithm:
 Parallel
 Sequential
Parallel Ensembles
Bootstrapping is a technique of choosing random samples of observations from a dataset with replacement. These
samples are then used to train each of the base learners in bagging. The methods of combining the outputs of base
learners for each of these is given in the image below:
Bagging is a parallel ensemble which uses homogeneous Random Forest is a variant of bagging. It combines several
base learners, the base learners are independent of each decision trees as a base learner to build an ensemble learner.
other. Random forest selects a random sample of data points
It uses bootstrapped aggregation, i.e. a technique of (bootstrap sample) to build each tree, and a random sample
choosing random samples of observations from dataset with of features while splitting a node. Randomly selecting
replacement, i.e. observations can be repeated in the features ensures that each tree is diverse.
subsets. The o/p from each base learners are then Please note that bagging is a general technique and does not
aggregated using a specific method to predict the result. pertain to random forests only. On the other hand, random
Bagging is used for both regression (mean/avg) and forest is a specific type of bagging algorithm that only uses
classification (max-voting/mode) problems. decision trees as the base learners
The process flow of bagging algorithm The process flow of forest algorithm
1. Train-test Split – Split data into training and testing 1. Train-test Split – Split data into training and testing
2. Bootstrapping – Split training data into subsets by 2. Bootstrapping – Split training data into subsets by
randomly selecting data points with replacement. randomly selecting data points with replacement.
3. Build Base learners – For each sample, build a base 3. Feature subsets – For each sample, build a decision tree
learner using same algorithm & same hyperparameters by choosing random subsets of features for splitting
4. Aggregation– The final prediction is made by averaging 4. Aggregation– The final prediction is made by averaging
or voting among all the predictions made by base or voting among all the predictions made by each
learners. decision trees
Benefits – Why random forest?
1. Works well with high variance algos like Decision Trees, 1. Decision trees tend to overfit
KNN, SVM, Neural Networks 2. RF introduces more generalization in the model
2. Easy to parallelize 3. Introduces diversity by random sampling of training
3. Faster execution data points and features
Limitations - 4. Combining the o/p of several decision trees.
1. Loss of interpretability
2. If some features dominate, then models end up being
similar.
Sequential Ensembles
Boosting is a sequential ensemble technique which combines several weak learners iteratively to build a strong
learner. Each learner at a given iteration depends upon the learner at the previous iteration.
There are following three popular boosting methods:
1. Adaptive boosting
2. Gradient boosting
3. XGBoost
Adaptive boosting leverages the ability to build a series of models specifically targeted at the data points that have
been incorrectly predicted by the other models in the ensemble.
The important steps in the working of the adaptive boosting algorithm are mentioned below:
1. Equal weights are assigned to all training samples St.
2. Base learner generated Bt: X -> Y by calling the base learning algorithm
3. The weights of the incorrectly classified examples are increased
4. Updated weight distribution St+1 is obtained and another base learner is generated.
5. Process is repeated T times and the final learner is derived by weighted majority voting of the T base
learners

RB's ML2 Notes

Uploaded by

Copyright:

Available Formats

RB's ML2 Notes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RB's ML2 Notes

Uploaded by

Copyright:

Available Formats

Decision Tree

Advantages of Decision Tree Disadvantages of Decision Tree

There are two ways of improving a decision Tree:

What are Hyperparameters? advantages of hyperparameter tuning

Features of GridSearch CV Features of RandomizedSearch CV

Why ensembles or why aren't our individual models enough?

You might also like