Lecture Material 11
Lecture Material 11
Lecture Material 11
1 Linear Regression
Linear regression is an algorithm that provides a linear relationship between an independent variable
and a dependent variable to predict the outcome of future events.
It is a statistical method used in data science and machine learning for predictive analysis
1
Multiple Linear Regression
This involves more than one independent variable and one dependent variable
[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
[ ]: df=sns.load_dataset('tips')
df.head()
2
[ ]: # Use standatd scaler to scale the features
# scaler=StandardScaler()
# X=scaler.fit_transform(X)
[ ]: y=df["tip"]
# Use standatd scaler to scale the features
# scaler=StandardScaler()
# y=scaler.fit_transform(y)
# Or use min mix scaler to scale the target
# scaler=MinMaxScaler()
# y=scaler.fit_transform(y)
[ ]: print(model.coef_)
print(model.intercept_)
[ ]: model.predict([[15]])
3
4 Mean Absolute Error (MAE)
Mean Absolute Error is an evaluation metric used to calculate the accuracy of a regression model.
MAE measures the average absolute difference between the predicted values and actual values.
# Root
Mean Squared Error (RMSE) The square root of the residuals’ variance is the Root Mean Squared
Error. It describes how well the observed data points match the expected values, or the model’s ab-
[ ]: print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))
4
[ ]: # load the saved model
import pickle
model_load=pickle.load(open('linearmodel.pk', 'rb')) # rb=read binary
[ ]: model_load.predict([[10]])
[ ]: plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred, color='r')
Note: Run the model by using standard scaler and MinMax scaler for X variale
[ ]: df.head()
[ ]: X=df[["total_bill", "size"]]
y=df["tip"]
[ ]: model_multi=LinearRegression()
model_multi.fit(X_train, y_train)
[ ]: model_multi.coef_
[ ]: print(model_multi.intercept_)
[ ]: model_multi.predict([[10, 2]])
[ ]: print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))
6 Logistic Regression
Logistic regression is a supervised machine learning algorithm used for classification tasks where
the goal is to predict the probability that an instance belongs to a given class or not. Logistic
regression is a statistical algorithm which analyze the relationship between two data factors.
For example, we have two classes Class 0 and Class 1 if the value of the logistic function for an input
is greater than 0.5 (threshold value) then it belongs to Class 1 it belongs to Class 0. It’s referred
5
to as regression because it is the extension of linear regression but is mainly used for classification
problems.
Key Points:
Logistic regression predicts the output of a categorical dependent variable. Therefore, the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and
1. In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic function,
which predicts two maximum values (0 or 1).
Logistic Function – Sigmoid Function
The sigmoid function is a mathematical function used to map the predicted values to probabilities.
It maps any real value into another value within a range of 0 and 1. The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like
the “S” form. The S-form curve is called the Sigmoid function or the logistic function. In logistic
regression, we use the concept of the threshold value, which defines the probability of either 0 or 1.
Such as values above the threshold value tends to 1, and a value below the threshold values tends
to 0.
6
Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types
of the dependent variable, such as “cat”, “dogs”, or “sheep”
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as “low”, “Medium”, or “High”.
[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix,␣
↪classification_report, f1_score, precision_score, recall_score
[ ]: titanic=sns.load_dataset('titanic')
titanic.head()
[ ]: titanic.info()
[ ]: # convert all the categorical and object columns to numerical using for loop
for col in titanic.columns:
if titanic[col].dtype=="object" or titanic[col].dtype=="category":
le=LabelEncoder()
titanic[col]=le.fit_transform(titanic[col])
[ ]: titanic.head()
[ ]: titanic.isnull().sum()
[ ]: X=titanic.drop('survived', axis=1)
y=titanic['survived']
[ ]: model=LogisticRegression()
7
[ ]: # fit the model
model.fit(X_train, y_train)
[ ]: y_predict=model.predict(X_test)
[ ]: print(classification_report(y_test, y_predict))
[ ]: plt.figure(figsize=(5, 5))
sns.heatmap(confusion_matrix(y_test, y_predict), annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Actual')
9 Accuracy score:
Accuracy score is a metric used to evaluate the performance of a classification model. It measures
the proportion of correctly classified instances (or samples) out of the total number of instances in
the dataset.
8
10 Recall Score:
Recall score, also known as sensitivity or true positive rate, is a metric used to evaluate the per-
formance of a classification model, particularly in scenarios where the identification of positive in-
stances is critical. Recall score measures the proportion of actual positive instances (true positives)
that are correctly identified by the model out of the total number of positive instances in the dataset.
A high recall score indicates that the model is effectively identifying most of the positive instances
in the dataset
11 Precision score:
Precision score measures the proportion of correctly predicted positive instances (true positives) out
of all instances that are predicted as positive by the model, regardless of whether they are actually
positive or negative.A high precision score indicates that the model is effectively minimizing false
positives and making accurate positive predictions.
9
The F1 score is the harmonic mean of precision and recall. It provides a single score that takes
into account both false positives and false negatives. In other words, the F1 score is the weighted
average of precision and recall, where the harmonic mean is used instead of a simple arithmetic
10
mean
12 Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of a classification
model on a set of test data for which the true values are known. It allows visualization of the
performance of an algorithm and provides insights into the types of errors made by the classifier.
The confusion matrix is especially useful for evaluating the performance of a classification model
in binary classification problems, where there are only two classes (e.g., positive and negative).
However, it can also be extended to multiclass classification problems. A confusion matrix typically
consists of four entries: True Positives (TP): The number of instances that are correctly predicted
as belonging to the positive class. False Positives (FP): The number of instances that are incorrectly
predicted as belonging to the positive class when they actually belong to the negative class. Also
known as Type I error. False Negatives (FN): The number of instances that are incorrectly predicted
as belonging to the negative class when they actually belong to the positive class. Also known as
Type II error. True Negatives (TN): The number of instances that are correctly predicted as
belonging to the negative class.
11
tasks, such as text classification, image classification, spam detection, handwriting identification,
gene expression analysis, face detection, and anomaly detection. SVMs are adaptable and efficient
in a variety of applications because they can manage high-dimensional data and nonlinear relation-
ships
The main objective of the SVM algorithm is to find the optimal hyperplane in an N-dimensional
space that can separate the data points in different classes in the feature space. The hyperplane
tries that the margin between the closest points of different classes should be as maximum as
possible
Hyperplane: Hyperplane is the decision boundary that is used to separate the data points of
different classes in a feature space. In the case of linear classifications, it will be a linear equation
i.e. wx+b = 0. Support Vectors: Support vectors are the closest data points to the hyperplane,
which makes a critical role in deciding the hyperplane and margin. Margin: Margin is the distance
between the support vector and hyperplane. The main objective of the support vector machine
algorithm is to maximize the margin. The wider margin indicates better classification performance.
Kernel: Kernel is the mathematical function, which is used in SVM to map the original input data
points into high-dimensional feature spaces, so, that the hyperplane can be easily found out even if
the data points are not linearly separable in the original input space. Some of the common kernel
functions are linear, polynomial, radial basis function(RBF), and sigmoid.
Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a hyperplane
that properly separates the data points of different categories without any misclassifications.
Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits a
12
soft margin technique. Each data point has a slack variable introduced by the soft-margin SVM
formulation, which softens the strict margin requirement and permits certain misclassifications or
violations. It discovers a compromise between increasing the margin and reducing violations.
C: Margin maximisation and misclassification fines are balanced by the regularisation parameter
C in SVM. The penalty for going over the margin or misclassifying data items is decided by it. A
stricter penalty is imposed with a greater value of C, which results in a smaller margin and perhaps
fewer misclassifications.
Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect classifications or
margin violations. The objective function in SVM is frequently formed by combining it with the
regularisation term.
Dual Problem: A dual Problem of the optimisation problem that requires locating the Lagrange
multipliers related to the support vectors can be used to solve SVM. The dual formulation enables
the use of kernel tricks and more effective computing.
[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import SVC, SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix,␣
↪classification_report, f1_score, precision_score, recall_score
[ ]: # import datset
iris=sns.load_dataset('iris')
iris.head()
[ ]: X=iris.drop('species', axis=1)
y=iris['species']
[ ]: y_predict
13
[ ]: # Evaluate the model
print(accuracy_score(y_test, y_predict))
print(confusion_matrix(y_test, y_predict))
[ ]: print(classification_report(y_test, y_predict))
[ ]: # import datset
iris=sns.load_dataset('iris')
iris.head()
[ ]: y=iris['petal_width']
14