Lecture Material 11

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Lecture_material_11

March 16, 2024

1 Linear Regression
Linear regression is an algorithm that provides a linear relationship between an independent variable
and a dependent variable to predict the outcome of future events.
It is a statistical method used in data science and machine learning for predictive analysis

Types of Linear Regression


Simple Linear Regression
This is the simplest form of linear regression, and it involves only one independent vari-
able and one dependent variable. The equation for simple linear regression is Y is
the dependent variable X is the independent variable �0 is the intercept �1 is the slope

1
Multiple Linear Regression
This involves more than one independent variable and one dependent variable

[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

[ ]: df=sns.load_dataset('tips')
df.head()

[ ]: # Seperate the features and the target/label


X=df[["total_bill"]]

2
[ ]: # Use standatd scaler to scale the features
# scaler=StandardScaler()
# X=scaler.fit_transform(X)

# Or use min mix scaler to scale the target


# scaler=MinMaxScaler()
# X=scaler.fit_transform(X)

[ ]: y=df["tip"]
# Use standatd scaler to scale the features
# scaler=StandardScaler()
# y=scaler.fit_transform(y)
# Or use min mix scaler to scale the target
# scaler=MinMaxScaler()
# y=scaler.fit_transform(y)

[ ]: # Splitting the data into training and testing data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

[ ]: # Build the model


model=LinearRegression()
# Train the model
model.fit(X_train, y_train)

[ ]: print(model.coef_)
print(model.intercept_)

[ ]: # Predict the model


y_pred=model.predict(X_test)
y_pred

[ ]: model.predict([[15]])

[ ]: from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

2 Evaluation Metrics for Linear Regression

3 Mean Square Error (MSE)


Mean Squared Error (MSE) is an evaluation metric that calculates the average of the squared
differences between the actual and predicted values for all the data points. The differ-
ence is squared to ensure that negative and positive differences don’t cancel each other out.

3
4 Mean Absolute Error (MAE)
Mean Absolute Error is an evaluation metric used to calculate the accuracy of a regression model.
MAE measures the average absolute difference between the predicted values and actual values.

# Root
Mean Squared Error (RMSE) The square root of the residuals’ variance is the Root Mean Squared
Error. It describes how well the observed data points match the expected values, or the model’s ab-

solute fit to the data.


# Coefficient of Determination (R-squared) R-Squared is a statistic that indicates how much vari-
ation the developed model can explain or capture. It is always in the range of 0 to 1. In general,
the better the model matches the data, the greater the R-squared number.

[ ]: print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

[ ]: # save the model


import pickle
pickle.dump(model, open('linearmodel.pk', 'wb')) # wb=write binary

4
[ ]: # load the saved model
import pickle
model_load=pickle.load(open('linearmodel.pk', 'rb')) # rb=read binary

[ ]: model_load.predict([[10]])

[ ]: plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred, color='r')

Note: Run the model by using standard scaler and MinMax scaler for X variale

5 Multiple linear regression


[ ]: # Write code to use multiple features to predict the target
# Use the same dataset
# Use the same model

[ ]: df.head()

[ ]: X=df[["total_bill", "size"]]
y=df["tip"]

[ ]: # split the data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

[ ]: model_multi=LinearRegression()
model_multi.fit(X_train, y_train)

[ ]: model_multi.coef_

[ ]: print(model_multi.intercept_)

[ ]: model_multi.predict([[10, 2]])

[ ]: print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))
print(np.sqrt(mean_squared_error(y_test, y_pred)))

6 Logistic Regression
Logistic regression is a supervised machine learning algorithm used for classification tasks where
the goal is to predict the probability that an instance belongs to a given class or not. Logistic
regression is a statistical algorithm which analyze the relationship between two data factors.
For example, we have two classes Class 0 and Class 1 if the value of the logistic function for an input
is greater than 0.5 (threshold value) then it belongs to Class 1 it belongs to Class 0. It’s referred

5
to as regression because it is the extension of linear regression but is mainly used for classification
problems.
Key Points:
Logistic regression predicts the output of a categorical dependent variable. Therefore, the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and
1. In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic function,
which predicts two maximum values (0 or 1).
Logistic Function – Sigmoid Function
The sigmoid function is a mathematical function used to map the predicted values to probabilities.
It maps any real value into another value within a range of 0 and 1. The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like
the “S” form. The S-form curve is called the Sigmoid function or the logistic function. In logistic
regression, we use the concept of the threshold value, which defines the probability of either 0 or 1.
Such as values above the threshold value tends to 1, and a value below the threshold values tends
to 0.

7 Types of Logistic Regression


On the basis of the categories, Logistic Regression can be classified into three types:

6
Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types
of the dependent variable, such as “cat”, “dogs”, or “sheep”
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as “low”, “Medium”, or “High”.

[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix,␣
↪classification_report, f1_score, precision_score, recall_score

[ ]: titanic=sns.load_dataset('titanic')
titanic.head()

[ ]: titanic.drop("deck", axis=1, inplace=True)

[ ]: # impute the missing values in age and fare


titanic["age"].fillna(titanic["age"].mean(), inplace=True)
titanic["fare"].fillna(titanic["fare"].mean(), inplace=True)
titanic["embarked"].fillna(titanic["embarked"].mode()[0], inplace=True)
titanic["embark_town"].fillna(titanic["embark_town"].mode()[0], inplace=True)

[ ]: titanic.info()

[ ]: # convert all the categorical and object columns to numerical using for loop
for col in titanic.columns:
if titanic[col].dtype=="object" or titanic[col].dtype=="category":
le=LabelEncoder()
titanic[col]=le.fit_transform(titanic[col])

[ ]: titanic.head()

[ ]: titanic.isnull().sum()

[ ]: X=titanic.drop('survived', axis=1)
y=titanic['survived']

[ ]: # train test split the data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

[ ]: model=LogisticRegression()

7
[ ]: # fit the model
model.fit(X_train, y_train)

[ ]: y_predict=model.predict(X_test)

[ ]: # Evaluate the model


print(accuracy_score(y_test, y_predict))
print(confusion_matrix(y_test, y_predict))
print(f1_score(y_test, y_predict))

[ ]: print(classification_report(y_test, y_predict))

[ ]: plt.figure(figsize=(5, 5))
sns.heatmap(confusion_matrix(y_test, y_predict), annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Actual')

8 Evaluation Metrics for Logistic Regression

9 Accuracy score:
Accuracy score is a metric used to evaluate the performance of a classification model. It measures
the proportion of correctly classified instances (or samples) out of the total number of instances in

the dataset.

8
10 Recall Score:
Recall score, also known as sensitivity or true positive rate, is a metric used to evaluate the per-
formance of a classification model, particularly in scenarios where the identification of positive in-
stances is critical. Recall score measures the proportion of actual positive instances (true positives)
that are correctly identified by the model out of the total number of positive instances in the dataset.
A high recall score indicates that the model is effectively identifying most of the positive instances

in the dataset

11 Precision score:
Precision score measures the proportion of correctly predicted positive instances (true positives) out
of all instances that are predicted as positive by the model, regardless of whether they are actually
positive or negative.A high precision score indicates that the model is effectively minimizing false
positives and making accurate positive predictions.

9
The F1 score is the harmonic mean of precision and recall. It provides a single score that takes
into account both false positives and false negatives. In other words, the F1 score is the weighted
average of precision and recall, where the harmonic mean is used instead of a simple arithmetic

10
mean

12 Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of a classification
model on a set of test data for which the true values are known. It allows visualization of the
performance of an algorithm and provides insights into the types of errors made by the classifier.
The confusion matrix is especially useful for evaluating the performance of a classification model
in binary classification problems, where there are only two classes (e.g., positive and negative).
However, it can also be extended to multiclass classification problems. A confusion matrix typically
consists of four entries: True Positives (TP): The number of instances that are correctly predicted
as belonging to the positive class. False Positives (FP): The number of instances that are incorrectly
predicted as belonging to the positive class when they actually belong to the negative class. Also
known as Type I error. False Negatives (FN): The number of instances that are incorrectly predicted
as belonging to the negative class when they actually belong to the positive class. Also known as
Type II error. True Negatives (TN): The number of instances that are correctly predicted as
belonging to the negative class.

13 Support vector machines


SVM is used for classification (SVC) as well as regression (SVR). It is also usd for outliers detection
Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or non-
linear classification, regression, and even outlier detection tasks. SVMs can be used for a variety of

11
tasks, such as text classification, image classification, spam detection, handwriting identification,
gene expression analysis, face detection, and anomaly detection. SVMs are adaptable and efficient
in a variety of applications because they can manage high-dimensional data and nonlinear relation-
ships
The main objective of the SVM algorithm is to find the optimal hyperplane in an N-dimensional
space that can separate the data points in different classes in the feature space. The hyperplane
tries that the margin between the closest points of different classes should be as maximum as
possible

Hyperplane: Hyperplane is the decision boundary that is used to separate the data points of
different classes in a feature space. In the case of linear classifications, it will be a linear equation
i.e. wx+b = 0. Support Vectors: Support vectors are the closest data points to the hyperplane,
which makes a critical role in deciding the hyperplane and margin. Margin: Margin is the distance
between the support vector and hyperplane. The main objective of the support vector machine
algorithm is to maximize the margin. The wider margin indicates better classification performance.
Kernel: Kernel is the mathematical function, which is used in SVM to map the original input data
points into high-dimensional feature spaces, so, that the hyperplane can be easily found out even if
the data points are not linearly separable in the original input space. Some of the common kernel
functions are linear, polynomial, radial basis function(RBF), and sigmoid.
Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a hyperplane
that properly separates the data points of different categories without any misclassifications.
Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits a

12
soft margin technique. Each data point has a slack variable introduced by the soft-margin SVM
formulation, which softens the strict margin requirement and permits certain misclassifications or
violations. It discovers a compromise between increasing the margin and reducing violations.
C: Margin maximisation and misclassification fines are balanced by the regularisation parameter
C in SVM. The penalty for going over the margin or misclassifying data items is decided by it. A
stricter penalty is imposed with a greater value of C, which results in a smaller margin and perhaps
fewer misclassifications.
Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect classifications or
margin violations. The objective function in SVM is frequently formed by combining it with the
regularisation term.
Dual Problem: A dual Problem of the optimisation problem that requires locating the Lagrange
multipliers related to the support vectors can be used to solve SVM. The dual formulation enables
the use of kernel tricks and more effective computing.

[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import SVC, SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix,␣
↪classification_report, f1_score, precision_score, recall_score

from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler

[ ]: # import datset
iris=sns.load_dataset('iris')
iris.head()

[ ]: X=iris.drop('species', axis=1)
y=iris['species']

[ ]: # train test split the data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42) # 80% training and 20% testing, random state is used to get␣

↪the same result

[ ]: # call the model


model=SVC(kernel='rbf') # rbf=radial basis function (hyperparameter tuning)

[ ]: # fit the model


model.fit(X_train, y_train)

[ ]: # predict the model


y_predict=model.predict(X_test)

[ ]: y_predict

13
[ ]: # Evaluate the model
print(accuracy_score(y_test, y_predict))
print(confusion_matrix(y_test, y_predict))

[ ]: print(classification_report(y_test, y_predict))

[ ]: sns.heatmap(confusion_matrix(y_test, y_predict), annot=True, fmt='d')

14 Assignment: use the support vector regressor (SVR) for any


kind of data
[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import SVC, SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler

[ ]: # import datset
iris=sns.load_dataset('iris')
iris.head()

[ ]: # Drop the species and petal_width columns


X=iris.drop(['species', 'petal_width'], axis=1)

[ ]: y=iris['petal_width']

[ ]: # train test split the data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42) # 80% training and 20% testing, random state is used to get␣

↪the same result

[ ]: # call the model


model=SVR(kernel='rbf') # rbf=radial basis function (hyperparameter tuning)

[ ]: # fit the model


model.fit(X_train, y_train)

[ ]: # predict the model


y_predict=model.predict(X_test)

[ ]: # Evaluate the model


print(mean_squared_error(y_test, y_predict))
print(r2_score(y_test, y_predict))

14

You might also like