RB's ML2 Notes
RB's ML2 Notes
RB's ML2 Notes
Decision Trees: A decision tree is a graph that uses branching The process of creating a decision tree is as follows:
methods to illustrate a course of action and various outcomes. 1. Select the best variable for the split
At each node a variable is evaluated to decide which path to 2. Apply the decision parameters and split
follow, It is a way to display an algorithm with only conditional 3. Repeat the process for the subset obtained
control statements. 4. Continue process until stopping criterion is reached
Elements: Root Node, Branch, Decision Node, Leaf Node 5. Note: It is a Top down and greedy approach
Classification trees are built for unordered/discrete values Regression Trees are built for continuous variables, they
with dependent variables. The method used is recursive create similar subsets and then assign the average of those
partitioning, i.e. using certain parameters, the data set is subsets as the value of the dependent variable.
partitioned over and over again until you reach a state of For regression trees, variance reduction method is employed
homogeneity. Classification trees are usually fast and easily for splitting. This involves measuring the net change in
interpreted. variance when you split a node.
They aim to classify objects into one or more categories by For final leaf nodes, take the average of the output variable
taking the mode of the subset’s dependent variable. and assign it to any new observation that falls in the leaf
Homogeneity – It is used to assess whether a subsequent partitions of a dataset is homogenous (pure) or not
Impurity – Is opposite of homogeneity i.e. as you subset further or go down the tree, impurity should decrease.
Both Gini and Entropy are measures to quantify impurity of a node. A node having multiple classes is impure whereas a node
having only one class is pure.
Gini Impurity is used in decision tree algorithms to decide the optimal split from a root node, and subsequent splits. IT tells us
the probability of misclassifying an observation
GI = ∑ P (X) * (1 – P (X)) - GI lies between 0 to 0.5, lower the GI, higher is the purity.
Entropy quantifies the amount of impurity present in any random variable.
Entropy = ∑ - (P (X) * log P (X)) - Entropy lies between 0 to 1, lower the better
Information gain is calculated by comparing the net change in entropy before and after a split is created in the decision tree.
Information Gain = Entropy (parent) – [Weighted avg] * Entropy (children), higher the better
Decision Tree Classifier function in sklearn provides the Some commonly used algorithms for DT are as follows:
following hyperparameters which control: CART(Classification and Regression Tree) – is a classic
1. max_depth – This measures the max depth of given supervised learning model and produces only binary splits
tree ID3 (Iterative dichotomizer) – it maximises info gain or
2. min_samples_split – measures min no. of samples minimizes entropy
needed in order to split the node C4.5 & C5 – both are advanced versions of ID3 and can
3. min_samples_leaf – measures min allowed sample size handle both continuous and categorical variables
of the resultant leaves CHAID (Chi-Squared Auto Iterative detector) - it is an
4. criterion – measures impurity Ex. GI or Entropy unsupervised model producing multiple splits
Model Selection
Model selection is a process of selecting the best model for a problem statement from a plethora of ML models
Classification of Models
Based on learning algorithm (K-nn, Naive bayes,Decision tree etc.)
Based on hyperparameters (K in K-nn, Max tree depth in decision tree etc.)
Underfitting means not being able to extract information from the data. An underfit model fails to significantly grasp the
relationship between the input values and target variables.
Ex. The model function does not have enough complexity (parameters) to fit the true function correctly
What are the causes of underfitting? How to fix an underfitted model?
1. Due to scarcity of data, model may not capture the 1. Including more parameters and information to the data
true relationship between source and target variable set i.e. gather useful and more information to help the
2. The model is not good enough to learn the underlying model understand the trend and make better predictions
trend in the data 2. Add complexity by inbuilt parameters or completely
3. For a true polynomial relationship, trying to implement choose a new more complex model like KNN or decision
a linear regression model would lead to a bad model tree to capture the trend in the data
Overfitting refers to a model that models the training data too well. It happens when a model learns the detail and noise in
the training data to the extent that it negatively impacts the performance of the model. This means that the noise or
random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these
concepts do not apply to new data and negatively impact the models ability to generalize.
Ex, Decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training
data. This problem can be addressed by pruning a tree
Causes of overfitting? How to fix an overfitted model?
1. ML model overfits when it captures the noise of the data. 1. Adding more training data to the model would
2. Rather than focusing on the underlying trend, it moves increase the model’s understanding of an outlier vs
towards fitting every point in the dataset in the model. a signal, thus helps it to fit better
3. Highly complex models are vulnerable to overfitting 2. Removing features would reduce the complexity of
4. Large numbers of input features reduce the flexibility and the model and helps it to learn better through more
affect the versatility of the model. important features.
Underfitting Overfitting
1. It fails to extract the trend in the data 1. Model tends to learn the data instead of capturing the trend
2. An under fitted model shows high errors in 2. An over fitted model shows good prediction accuracy on the
prediction over similar datasets but consistently train dataset but high errors in prediction over test dataset
3. The model has a biased assumption that its 3. The model is highly sensitive to the change in information. The
limited features are highly related to the target high sensitivity is a result of learning predictions from the
variable noise in the dataset
4. The errors in prediction are always consistently 4. A complex prediction pattern implies that the target variable is
high over every dataset not highly dependent over a single feature
Signal is the data which supports or transmits the underlying trend in the dataset. These data points occur more
frequently and establishes the underlying trend of the data
Noise is that data which is basically an outlier. It occurs less frequently and does not contribute in establishing the
underlying trend of the data
1. Bias is a phenomenon in which the predictions of the 1. Variance is a phenomenon in which the predictions of
model are not close to the actual answer but the the model are close to the actual answer for one dataset
difference between the prediction and the actual but show high errors when tested on another similar
answer is consistent. dataset.
2. Bias can be understood as a model which is not 2. Variance could be understood as a model with good
accurate but is precise in its predictions accuracy about its predictions but shows high variations
What is bias variance trade-off?
1. Objective - of any supervised model is to
attain low variance and low bias to make
better predictions
2. The trade-off - increasing the bias will
decrease the variance and vice versa
3. Complexity selection - The parameter
selection for algorithm is always a
challenge to maintain a bias-variance
equilibrium
Model Stability
Model Optimisation is fine tuning the model in order to generate the high model scores for the given dataset
Model validation is a method to evaluate the trained model against a test data set. The main objective of using the test data set
is to test the ability of a trained model for generalizing underlying logic
What are the drawbacks of train_test_split?
Randomness of split - Since the split of data is random, there is a possibility that the split might be unhealthy for the model.
o For example, the data split is such that the training data has all signal points while test data holds majority of the
noise points.
Model performance - Model scores are not consistent and changes with every iteration of the function. Thus, you cannot
decide which model fits best for the particular dataset
KFold is a cross validation technique used to evaluate the model in an unbiased way. K in KFold defines the number of splits
in the cross-validation step.
1. Data Splitting: The initial sample is randomly divided into k Advantages of using KFold cross-validation?
equal size subsamples 1. Uniformity: All observations are used for both
2. Train vs Test: Single subsample is retained as validation data, training and validation, hence the issue of
remaining k-1 subsamples are used as training data irregular data split is removed
3. Re-iteration: Process is repeated k times, each of the k 2. Unbiased validation: Every observation is used
subsample being used exactly once as test data exactly once for validation, which avoids a bias
4. Averaging: The k values can then be summed to provide a towards some data points and minimizes the
single estimate chance of overfitting
Feature selection - is a process to select the best feature Different filter methods to perform FS:
combination to improve the efficiency of the ML model SelectKBest
Advantages of feature selection? 1. Use appropriate test for the variable types (Chisquare,
1. Noise reduction A large number of features and low ANOVA, Correlation) to evaluate importance of different
sample / feature ratios adds noise to the dataset features to the target variable.
2. Effort reduction Reducing the number of features 2. select ‘K’ best features out of all which are highly related
would minimize running time and saves good amount to the target variable
of computational power Correlation Matrix
3. Simplicity A reduced feature set is easier for humans to 1. Establish a correlation matrix amongst variables of the
understand and Allows to concentrate on the key features of the dataset
predictability sources 2. Eliminate redundant features based on the absolute
correlation values to the target variable
Ensembles combine the results of various models together to get a model with better accuracy. The outputs from
different base learners can be combined using three different techniques as given in the image below:
Max voting (Classification problems) Average (regression problems) Weighted Average (regression)
The final output is obtained by voting The final o/p is obtained by The final o/p is obtained by taking the
through the outputs of all the models. taking the mean of the weighted mean of predicted outputs
The class with max votes is considered predicted outputs of all the of all the models. The base models
as the predicted class models with higher accuracy tend to have
more weights.
An optimal way of assigning the weights to different base learners while combining the outputs is an automated
approach known as stacking. Stacking uses different ML algorithms to build first-level learners and then the outputs
from these are given as inputs to another model called the meta-learner to form a second level learner. It is also
called Stacked Generalization.
Why Stacking? – combining o/p of multiple models manually is not optimal.
Solution – Multiple models make the intermediate predictions on the training data. Then a new model uses
these predictions to make the final prediction. The model is said to be stacked on top of other models.
Steps for the working of the stacking algorithm are:
1. Train-test Split – Split he data into training (learning model) and testing (prediction model) dataset
2. Base Model Building – Base model is built using different ML algorithms like KNN, SVM etc. to make
predictions.
3. Data Set Creation – A new data set is created using the prediction made by base models as independent
variables and target variables as final prediction variables
4. Aggregating the models – a new model is built on the new dataset created known as Meta estimator to
make predictions.
The first two types of ensemble models, classified on the basis of their model type are:
Homogenous - Ensembles method using same learning algos for all base learners i.e. learners of same type
Heterogeneous - Ensembles method which use different learning algorithms for the base learners
The next two types of ensemble models, classified on the basis of the algorithm:
Parallel
Sequential
Parallel Ensembles
Bootstrapping is a technique of choosing random samples of observations from a dataset with replacement. These
samples are then used to train each of the base learners in bagging. The methods of combining the outputs of base
learners for each of these is given in the image below:
Bagging is a parallel ensemble which uses homogeneous Random Forest is a variant of bagging. It combines several
base learners, the base learners are independent of each decision trees as a base learner to build an ensemble learner.
other. Random forest selects a random sample of data points
It uses bootstrapped aggregation, i.e. a technique of (bootstrap sample) to build each tree, and a random sample
choosing random samples of observations from dataset with of features while splitting a node. Randomly selecting
replacement, i.e. observations can be repeated in the features ensures that each tree is diverse.
subsets. The o/p from each base learners are then Please note that bagging is a general technique and does not
aggregated using a specific method to predict the result. pertain to random forests only. On the other hand, random
Bagging is used for both regression (mean/avg) and forest is a specific type of bagging algorithm that only uses
classification (max-voting/mode) problems. decision trees as the base learners
The process flow of bagging algorithm The process flow of forest algorithm
1. Train-test Split – Split data into training and testing 1. Train-test Split – Split data into training and testing
2. Bootstrapping – Split training data into subsets by 2. Bootstrapping – Split training data into subsets by
randomly selecting data points with replacement. randomly selecting data points with replacement.
3. Build Base learners – For each sample, build a base 3. Feature subsets – For each sample, build a decision tree
learner using same algorithm & same hyperparameters by choosing random subsets of features for splitting
4. Aggregation– The final prediction is made by averaging 4. Aggregation– The final prediction is made by averaging
or voting among all the predictions made by base or voting among all the predictions made by each
learners. decision trees
Benefits – Why random forest?
1. Works well with high variance algos like Decision Trees, 1. Decision trees tend to overfit
KNN, SVM, Neural Networks 2. RF introduces more generalization in the model
2. Easy to parallelize 3. Introduces diversity by random sampling of training
3. Faster execution data points and features
Limitations - 4. Combining the o/p of several decision trees.
1. Loss of interpretability
2. If some features dominate, then models end up being
similar.
Sequential Ensembles
Boosting is a sequential ensemble technique which combines several weak learners iteratively to build a strong
learner. Each learner at a given iteration depends upon the learner at the previous iteration.
There are following three popular boosting methods:
1. Adaptive boosting
2. Gradient boosting
3. XGBoost
Adaptive boosting leverages the ability to build a series of models specifically targeted at the data points that have
been incorrectly predicted by the other models in the ensemble.
The important steps in the working of the adaptive boosting algorithm are mentioned below:
1. Equal weights are assigned to all training samples St.
2. Base learner generated Bt: X -> Y by calling the base learning algorithm
3. The weights of the incorrectly classified examples are increased
4. Updated weight distribution St+1 is obtained and another base learner is generated.
5. Process is repeated T times and the final learner is derived by weighted majority voting of the T base
learners