Lecture Material 11

Lecture_material_11
March 16, 2024
1 Linear Regression
Linear regression is an algorithm that provides a linear relationship between an independent variable
and a dependent variable to predict the outcome of future events.
It is a statistical method used in data science and machine learning for predictive analysis
Types of Linear Regression

Simple Linear Regression
This is the simplest form of linear regression, and it involves only one independent vari-
able and one dependent variable. The equation for simple linear regression is Y is
the dependent variable X is the independent variable �0 is the intercept �1 is the slope
1
Multiple Linear Regression
This involves more than one independent variable and one dependent variable
[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
[ ]: df=sns.load_dataset('tips')
df.head()
[ ]: # Seperate the features and the target/label

X=df[["total_bill"]]
2
[ ]: # Use standatd scaler to scale the features
# scaler=StandardScaler()
# X=scaler.fit_transform(X)
# Or use min mix scaler to scale the target

# scaler=MinMaxScaler()
# X=scaler.fit_transform(X)
[ ]: y=df["tip"]
# Use standatd scaler to scale the features
# scaler=StandardScaler()
# y=scaler.fit_transform(y)
# Or use min mix scaler to scale the target
# scaler=MinMaxScaler()
# y=scaler.fit_transform(y)
[ ]: # Splitting the data into training and testing data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)
[ ]: # Build the model

model=LinearRegression()
# Train the model
model.fit(X_train, y_train)
[ ]: print(model.coef_)
print(model.intercept_)
[ ]: # Predict the model

y_pred=model.predict(X_test)
y_pred
[ ]: model.predict([[15]])
[ ]: from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
2 Evaluation Metrics for Linear Regression
3 Mean Square Error (MSE)

Mean Squared Error (MSE) is an evaluation metric that calculates the average of the squared
differences between the actual and predicted values for all the data points. The differ-
ence is squared to ensure that negative and positive differences don’t cancel each other out.
3
4 Mean Absolute Error (MAE)
Mean Absolute Error is an evaluation metric used to calculate the accuracy of a regression model.
MAE measures the average absolute difference between the predicted values and actual values.
# Root
Mean Squared Error (RMSE) The square root of the residuals’ variance is the Root Mean Squared
Error. It describes how well the observed data points match the expected values, or the model’s ab-
solute fit to the data.

# Coeﬀicient of Determination (R-squared) R-Squared is a statistic that indicates how much vari-
ation the developed model can explain or capture. It is always in the range of 0 to 1. In general,
the better the model matches the data, the greater the R-squared number.
[ ]: print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))
[ ]: # save the model

import pickle
pickle.dump(model, open('linearmodel.pk', 'wb')) # wb=write binary
4
[ ]: # load the saved model
import pickle
model_load=pickle.load(open('linearmodel.pk', 'rb')) # rb=read binary
[ ]: model_load.predict([[10]])
[ ]: plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred, color='r')
Note: Run the model by using standard scaler and MinMax scaler for X variale
5 Multiple linear regression

[ ]: # Write code to use multiple features to predict the target
# Use the same dataset
# Use the same model
[ ]: df.head()
[ ]: X=df[["total_bill", "size"]]
y=df["tip"]
[ ]: # split the data

↪random_state=42)
[ ]: model_multi=LinearRegression()
model_multi.fit(X_train, y_train)
[ ]: model_multi.coef_
[ ]: print(model_multi.intercept_)
[ ]: model_multi.predict([[10, 2]])
[ ]: print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))
6 Logistic Regression
Logistic regression is a supervised machine learning algorithm used for classification tasks where
the goal is to predict the probability that an instance belongs to a given class or not. Logistic
regression is a statistical algorithm which analyze the relationship between two data factors.
For example, we have two classes Class 0 and Class 1 if the value of the logistic function for an input
is greater than 0.5 (threshold value) then it belongs to Class 1 it belongs to Class 0. It’s referred
5
to as regression because it is the extension of linear regression but is mainly used for classification
problems.
Key Points:
Logistic regression predicts the output of a categorical dependent variable. Therefore, the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and
1. In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic function,
which predicts two maximum values (0 or 1).
Logistic Function – Sigmoid Function
The sigmoid function is a mathematical function used to map the predicted values to probabilities.
It maps any real value into another value within a range of 0 and 1. The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like
the “S” form. The S-form curve is called the Sigmoid function or the logistic function. In logistic
regression, we use the concept of the threshold value, which defines the probability of either 0 or 1.
Such as values above the threshold value tends to 1, and a value below the threshold values tends
to 0.
7 Types of Logistic Regression

On the basis of the categories, Logistic Regression can be classified into three types:
6
Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types
of the dependent variable, such as “cat”, “dogs”, or “sheep”
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as “low”, “Medium”, or “High”.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix,␣
↪classification_report, f1_score, precision_score, recall_score
[ ]: titanic=sns.load_dataset('titanic')
titanic.head()
[ ]: titanic.drop("deck", axis=1, inplace=True)
[ ]: # impute the missing values in age and fare

titanic["age"].fillna(titanic["age"].mean(), inplace=True)
titanic["fare"].fillna(titanic["fare"].mean(), inplace=True)
titanic["embarked"].fillna(titanic["embarked"].mode()[0], inplace=True)
titanic["embark_town"].fillna(titanic["embark_town"].mode()[0], inplace=True)
[ ]: titanic.info()
[ ]: # convert all the categorical and object columns to numerical using for loop
for col in titanic.columns:
if titanic[col].dtype=="object" or titanic[col].dtype=="category":
le=LabelEncoder()
titanic[col]=le.fit_transform(titanic[col])
[ ]: titanic.head()
[ ]: titanic.isnull().sum()
[ ]: X=titanic.drop('survived', axis=1)
y=titanic['survived']
[ ]: # train test split the data

↪random_state=42)
[ ]: model=LogisticRegression()
7
[ ]: # fit the model
[ ]: y_predict=model.predict(X_test)
[ ]: # Evaluate the model

print(accuracy_score(y_test, y_predict))
print(confusion_matrix(y_test, y_predict))
print(f1_score(y_test, y_predict))
[ ]: print(classification_report(y_test, y_predict))
[ ]: plt.figure(figsize=(5, 5))
sns.heatmap(confusion_matrix(y_test, y_predict), annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Actual')
8 Evaluation Metrics for Logistic Regression
9 Accuracy score:
Accuracy score is a metric used to evaluate the performance of a classification model. It measures
the proportion of correctly classified instances (or samples) out of the total number of instances in
the dataset.
8
10 Recall Score:
Recall score, also known as sensitivity or true positive rate, is a metric used to evaluate the per-
formance of a classification model, particularly in scenarios where the identification of positive in-
stances is critical. Recall score measures the proportion of actual positive instances (true positives)
that are correctly identified by the model out of the total number of positive instances in the dataset.
A high recall score indicates that the model is effectively identifying most of the positive instances
in the dataset
11 Precision score:
Precision score measures the proportion of correctly predicted positive instances (true positives) out
of all instances that are predicted as positive by the model, regardless of whether they are actually
positive or negative.A high precision score indicates that the model is effectively minimizing false
positives and making accurate positive predictions.
9
The F1 score is the harmonic mean of precision and recall. It provides a single score that takes
into account both false positives and false negatives. In other words, the F1 score is the weighted
average of precision and recall, where the harmonic mean is used instead of a simple arithmetic
10
mean
12 Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of a classification
model on a set of test data for which the true values are known. It allows visualization of the
performance of an algorithm and provides insights into the types of errors made by the classifier.
The confusion matrix is especially useful for evaluating the performance of a classification model
in binary classification problems, where there are only two classes (e.g., positive and negative).
However, it can also be extended to multiclass classification problems. A confusion matrix typically
consists of four entries: True Positives (TP): The number of instances that are correctly predicted
as belonging to the positive class. False Positives (FP): The number of instances that are incorrectly
predicted as belonging to the positive class when they actually belong to the negative class. Also
known as Type I error. False Negatives (FN): The number of instances that are incorrectly predicted
as belonging to the negative class when they actually belong to the positive class. Also known as
Type II error. True Negatives (TN): The number of instances that are correctly predicted as
belonging to the negative class.
13 Support vector machines

SVM is used for classification (SVC) as well as regression (SVR). It is also usd for outliers detection
Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or non-
linear classification, regression, and even outlier detection tasks. SVMs can be used for a variety of
11
tasks, such as text classification, image classification, spam detection, handwriting identification,
gene expression analysis, face detection, and anomaly detection. SVMs are adaptable and eﬀicient
in a variety of applications because they can manage high-dimensional data and nonlinear relation-
ships
The main objective of the SVM algorithm is to find the optimal hyperplane in an N-dimensional
space that can separate the data points in different classes in the feature space. The hyperplane
tries that the margin between the closest points of different classes should be as maximum as
possible
Hyperplane: Hyperplane is the decision boundary that is used to separate the data points of
different classes in a feature space. In the case of linear classifications, it will be a linear equation
i.e. wx+b = 0. Support Vectors: Support vectors are the closest data points to the hyperplane,
which makes a critical role in deciding the hyperplane and margin. Margin: Margin is the distance
between the support vector and hyperplane. The main objective of the support vector machine
algorithm is to maximize the margin. The wider margin indicates better classification performance.
Kernel: Kernel is the mathematical function, which is used in SVM to map the original input data
points into high-dimensional feature spaces, so, that the hyperplane can be easily found out even if
the data points are not linearly separable in the original input space. Some of the common kernel
functions are linear, polynomial, radial basis function(RBF), and sigmoid.
Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a hyperplane
that properly separates the data points of different categories without any misclassifications.
Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits a
12
soft margin technique. Each data point has a slack variable introduced by the soft-margin SVM
formulation, which softens the strict margin requirement and permits certain misclassifications or
violations. It discovers a compromise between increasing the margin and reducing violations.
C: Margin maximisation and misclassification fines are balanced by the regularisation parameter
C in SVM. The penalty for going over the margin or misclassifying data items is decided by it. A
stricter penalty is imposed with a greater value of C, which results in a smaller margin and perhaps
fewer misclassifications.
Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect classifications or
margin violations. The objective function in SVM is frequently formed by combining it with the
regularisation term.
Dual Problem: A dual Problem of the optimisation problem that requires locating the Lagrange
multipliers related to the support vectors can be used to solve SVM. The dual formulation enables
the use of kernel tricks and more effective computing.
import numpy as np
from sklearn.svm import SVC, SVR
from sklearn.metrics import accuracy_score, confusion_matrix,␣
↪classification_report, f1_score, precision_score, recall_score
[ ]: # import datset
iris=sns.load_dataset('iris')
iris.head()
[ ]: X=iris.drop('species', axis=1)
y=iris['species']

↪random_state=42) # 80% training and 20% testing, random state is used to get␣
↪the same result
[ ]: # call the model

model=SVC(kernel='rbf') # rbf=radial basis function (hyperparameter tuning)

[ ]: # predict the model

y_predict=model.predict(X_test)
[ ]: y_predict
13
print(accuracy_score(y_test, y_predict))
print(confusion_matrix(y_test, y_predict))
[ ]: print(classification_report(y_test, y_predict))
[ ]: sns.heatmap(confusion_matrix(y_test, y_predict), annot=True, fmt='d')
14 Assignment: use the support vector regressor (SVR) for any

kind of data
import numpy as np
from sklearn.svm import SVC, SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
[ ]: # import datset
iris=sns.load_dataset('iris')
iris.head()
[ ]: # Drop the species and petal_width columns

X=iris.drop(['species', 'petal_width'], axis=1)
[ ]: y=iris['petal_width']

↪random_state=42) # 80% training and 20% testing, random state is used to get␣
↪the same result
[ ]: # call the model

model=SVR(kernel='rbf') # rbf=radial basis function (hyperparameter tuning)

[ ]: # predict the model

y_predict=model.predict(X_test)

print(mean_squared_error(y_test, y_predict))
print(r2_score(y_test, y_predict))
14

Lecture Material 11

Uploaded by

Copyright:

Available Formats

Lecture Material 11

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Material 11

Uploaded by

Copyright:

Available Formats

Lecture_material_11

March 16, 2024

Types of Linear Regression

[ ]: # Seperate the features and the target/label

# Or use min mix scaler to scale the target

[ ]: # Splitting the data into training and testing data

[ ]: # Build the model

[ ]: # Predict the model

[ ]: from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

2 Evaluation Metrics for Linear Regression

3 Mean Square Error (MSE)

solute fit to the data.

[ ]: # save the model

5 Multiple linear regression

[ ]: # split the data

7 Types of Logistic Regression

[ ]: titanic.drop("deck", axis=1, inplace=True)

[ ]: # impute the missing values in age and fare

[ ]: # train test split the data

[ ]: # Evaluate the model

8 Evaluation Metrics for Logistic Regression

13 Support vector machines

from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler

[ ]: # train test split the data

↪the same result

[ ]: # call the model

[ ]: # fit the model

[ ]: # predict the model

[ ]: sns.heatmap(confusion_matrix(y_test, y_predict), annot=True, fmt='d')

14 Assignment: use the support vector regressor (SVR) for any

[ ]: # Drop the species and petal_width columns

[ ]: # train test split the data

↪the same result

[ ]: # call the model

[ ]: # fit the model

[ ]: # predict the model

[ ]: # Evaluate the model

You might also like