Email Classification: Roll No-41463 (LP-3)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Roll No- 41463 (LP-3)

Email Classification

Classify the email using binary classification method. Email Spam detection has two
states: a) Normal State Not Spam b) Abnormal State Spam. Use K-Nearest Neighbors and
Support Vector Machine for Classification. Analyze their performance.

Dataset used: https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv


(https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv)

In [1]: import numpy as np


import pandas as pd

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_squared_error,mean_absolute_error
from sklearn.metrics import accuracy_score

In [2]: df = pd.read_csv("emails.csv")
df.head()

Out[2]:
Email
the to ect and for of a you hou ... connevey jay valued lay infrastructu
No.

Email
0 0 0 1 0 0 0 2 0 0 ... 0 0 0 0
1

Email
1 8 13 24 6 6 2 102 1 27 ... 0 0 0 0
2

Email
2 0 0 1 0 0 0 8 0 0 ... 0 0 0 0
3

Email
3 0 5 22 0 5 1 51 2 10 ... 0 0 0 0
4

Email
4 7 6 17 1 5 2 57 0 9 ... 0 0 0 0
5

5 rows × 3002 columns


In [3]: df.tail()

Out[3]:
Email
the to ect and for of a you hou ... connevey jay valued lay infrastru
No.

Email
5167 2 2 2 3 0 0 32 0 0 ... 0 0 0 0
5168

Email
5168 35 27 11 2 6 5 151 4 3 ... 0 0 0 0
5169

Email
5169 0 0 1 1 0 0 11 0 0 ... 0 0 0 0
5170

Email
5170 2 7 1 0 2 1 28 2 0 ... 0 0 0 0
5171

Email
5171 22 24 5 1 6 5 148 8 2 ... 0 0 0 0
5172

5 rows × 3002 columns

In [4]: df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 5172 entries, 0 to 5171

Columns: 3002 entries, Email No. to Prediction

dtypes: int64(3001), object(1)

memory usage: 118.5+ MB

In [5]: df.describe()

Out[5]:
the to ect and for of

count 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5172.00000

mean 6.640565 6.188128 5.143852 3.075599 3.124710 2.627030 55.51740

std 11.745009 9.534576 14.101142 6.045970 4.680522 6.229845 87.57417

min 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00000

25% 0.000000 1.000000 1.000000 0.000000 1.000000 0.000000 12.00000

50% 3.000000 3.000000 1.000000 1.000000 2.000000 1.000000 28.00000

75% 8.000000 7.000000 4.000000 3.000000 4.000000 2.000000 62.25000

max 210.000000 132.000000 344.000000 89.000000 47.000000 77.000000 1898.00000

8 rows × 3001 columns


In [6]: df.isnull().sum()

Out[6]: Email No. 0

the 0

to 0

ect 0

and 0

for 0

of 0

a 0

you 0

hou 0

in 0

on 0

is 0

this 0

enron 0

i 0

be 0

that 0

will 0

have 0

with 0

your 0

at 0

we 0

s 0

are 0

it 0

by 0

com 0

as 0

..

decisions 0

produced 0

ended 0

greatest 0

degree 0

solmonson 0

imbalances 0

fall 0

fear 0

hate 0

fight 0

reallocated 0

debt 0

reform 0

australia 0

plain 0

prompt 0

remains 0

ifhsc 0

enhancements 0

connevey 0

jay 0

valued 0

lay 0

infrastructure 0

military 0

allowing 0

ff 0

dry 0

Prediction 0

Length: 3002, dtype: int64

Splitting Train and Test dataset

In [7]: x = df.iloc[:,1:3001]
y = df.iloc[:,-1].values

In [8]: x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2,

a) Using K-Nearest Neighbours


In [9]: knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)

In [ ]: ​

Analyzing performance

In [10]: print("MSE: ", mean_squared_error(y_test, y_pred))


print("MAE: ", mean_absolute_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score: ", metrics.r2_score(y_test, y_pred))
print("Accuracy Score for KNN: ", accuracy_score(y_test, y_pred))

MSE: 0.12560386473429952

MAE: 0.12560386473429952

RMSE: 0.3544063553807966

R2 Score: 0.40780091899790494

Accuracy Score for KNN: 0.8743961352657005

b) Using Support Vector Machine(SVM)


In [11]: svc = SVC(C=1.0, gamma='auto', kernel='rbf')
svc.fit(x_test, y_test)
y_pred = svc.predict(x_test)

Analyzing Performance
In [12]: print("MSE: ", mean_squared_error(y_test, y_pred))
print("MAE: ", mean_absolute_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score: ", metrics.r2_score(y_test, y_pred))
print("Accuracy Score for KNN: ", accuracy_score(y_test, y_pred))

MSE: 0.07149758454106281

MAE: 0.07149758454106281

RMSE: 0.2673903224521464

R2 Score: 0.6629020615834228

Accuracy Score for KNN: 0.9285024154589372

In [ ]: ​

You might also like