Sms Spam Filtering Pres
Sms Spam Filtering Pres
Sms Spam Filtering Pres
AL2018
1 / 18
Outline
Introduction
Context
Data Exploration
Feature Extraction
The Bag-of-Words Representation
The Term Frequency-Inverse Document Frequency Weighting
Comparison of Several Classifiers
Considered Classifiers
Experimental Settings
Results
On Sensitivity and Specificity
Increasing the Sensitivity
Results
Online Learning Strategies
Motivations and Strategies
Experiments
Conclusion
AL2018
2 / 18
Introduction
Context
AL2018
3 / 18
Introduction
Data Exploration
Number of occurences
1500
0.1 Big
1250
I Dataset composed of 4825
1000
ham (86 %) and 747 spam (14 %)
750
Most represented words in the dataset 500
I Stop-words: Most commonly used 250
english words 0
to
you
I
a
the
and
in
is
i
u
for
my
of
your
me
on
have
2
that
it
are
call
or
be
at
with
not
will
get
can
I Uninformative words that have to be
removed
AL2018
4 / 18
Introduction
Data Exploration – Ham
AL2018
5 / 18
Introduction
Data Exploration – Spam
AL2018
6 / 18
Feature Extraction
The Bag-of-Words Representation
Motivation
A document considered as a sequence of symbols cannot be used as input of a
machine learning algorithm ú need a representation
The Bag-Of-Words representation
I Document composed of D words and vocabulary composed of M words:
D
X
x= xd i ,
i=1
7 / 18
Feature Extraction
The Term Frequency-Inverse Document Frequency Weighting
Motivation
I Preferable to work with frequencies than occurrences
I Need for normalized features
Tf-idf weighting
I Term-frequency of the document n of size Dn in the collection
1
tfn =
Dn
I Inverse-document-frequency of the word di in the collection
N
idfdi = log
dfdi
dfdi = # documents containing di
I Tf-idf weighting of the word di in the document n
T (n, di ) = tfn × idfdi
AL2018
8 / 18
Comparison of Several Classifiers
Considered Classifiers
AL2018
9 / 18
Comparison of Several Classifiers
Experimental Settings
General remarks:
I No dimensionality reduction of the data
I Even though data imbalanced we use a threshold of 0.5 for our Bayes plug-in
estimator
I Dataset divided into a training set (80 %) and test set (20 %)
I Hyper-parameter grid search using 10-fold cross-validation on training set
Metrics:
Nt
I Misclassification error (ME): ME = 1
I (c (hi ) 6= γ (i))
P
Nt
i=1
I Sensitivity (SE): Probability of predicting spam given that it is spam
Nt
1
I (c (hi ) = γ (i)) × I (γ (i) = spam)
P
SE = #spam
i=1
I Specificity (SP): Probability of predicting ham given that it is ham
Nt
1
I (c (hi ) = γ (i)) × I (γ (i) = ham)
P
SP = #ham
i=1
AL2018
10 / 18
Comparison of Several Classifiers
Results
AB: 2.152% DT: 2.870% LR-l1: 2.063% LR-l2: 1.614% LR: 1.614% MLP: 1.614%
ham 0.996 0.004 0.988 0.012 0.993 0.007 0.999 0.001 0.999 0.001 0.998 0.002
spam 0.153 0.847 0.153 0.847 0.122 0.878 0.130 0.870 0.130 0.870 0.122 0.878
MNB: 2.601% RF: 2.601% SVM-L: 1.794% SVM-R: 1.614% SVM-S: 1.614% kNN: 4.574%
ham 1.000 0.000 0.998 0.002 0.996 0.004 0.998 0.002 0.998 0.002 1.000 0.000
spam 0.221 0.779 0.206 0.794 0.122 0.878 0.122 0.878 0.122 0.878 0.389 0.611
ham spam ham spam ham spam ham spam ham spam ham spam
AL2018
11 / 18
On Sensitivity and Specificity
Increasing the Sensitivity
Motivations
I Optimization of a symmetric loss on an imbalanced dataset ú good predictive
performance on the majority class and poor result on the other class
I One may want to increase sensitivity
Methods
I Decreasing the threshold of the Bayes classifier
I Resampling:
I Downsampling of the majority class
I Upsampling of the minority class
Proposed experiments
1. Random downsampling of the ham class to have a balanced dataset
2. Random upsampling with replacement of the spam class to have a balanced
dataset
AL2018
12 / 18
On Sensitivity and Specificity
Results – Downsampling the Majority Class
AB: 5.471% DT: 6.996% LR-l1: 3.318% LR-l2: 2.601% LR: 2.960% MLP: 1.973%
ham 0.956 0.044 0.932 0.068 0.970 0.030 0.979 0.021 0.975 0.025 0.983 0.017
spam 0.137 0.863 0.084 0.916 0.053 0.947 0.061 0.939 0.061 0.939 0.038 0.962
MNB: 5.381% RF: 2.063% SVM-L: 3.049% SVM-R: 2.511% SVM-S: 3.139% kNN: 3.587%
ham 0.944 0.056 0.989 0.011 0.973 0.027 0.980 0.020 0.970 0.030 0.974 0.026
spam 0.038 0.962 0.092 0.908 0.053 0.947 0.061 0.939 0.038 0.962 0.107 0.893
ham spam ham spam ham spam ham spam ham spam ham spam
AL2018
13 / 18
On Sensitivity and Specificity
Results – Upsampling the Minority Class
AB: 1.794% DT: 3.857% LR-l1: 1.794% LR-l2: 1.614% LR: 1.704% MLP: 1.435%
ham 0.997 0.003 0.973 0.027 0.995 0.005 0.998 0.002 0.997 0.003 0.997 0.003
spam 0.130 0.870 0.122 0.878 0.115 0.885 0.122 0.878 0.122 0.878 0.099 0.901
MNB: 1.973% RF: 1.883% SVM-L: 1.614% SVM-R: 1.614% SVM-S: 1.614% kNN: 4.753%
ham 0.984 0.016 1.000 0.000 0.998 0.002 0.998 0.002 0.998 0.002 1.000 0.000
spam 0.046 0.954 0.160 0.840 0.122 0.878 0.122 0.878 0.122 0.878 0.405 0.595
ham spam ham spam ham spam ham spam ham spam ham spam
AL2018
14 / 18
Online Learning Strategies
Motivations and Strategies
Motivations
I Spam content is obviously changing with time: spammers try to fool spam
filtering algorithms
I Useful to design algorithm able to adapt to new content labeled by users
I Main difficulty in designing an online strategy: the vocabulary size
Naive strategy
I Increasing vocabulary size using all the words of the training set
I Not viable in longer-term due to memory and computational issues
Windowed-time strategy
I Fixed vocabulary size using only the most frequent words of the training set
I Training set windowed in time
I Should be quite efficient
I But requires to be fitted quite often
AL2018
15 / 18
Online Learning Strategies
Experiments
Considerations
I We do not have access to time LR-l2
12 MNB
information
I Not possible to analyze the evolution 10
AL2018
16 / 18
Conclusion
I Several SMS spam filtering methods have been studied from the feature
extraction step to the classification
I State-of-the-art classifiers work well with a misclassification error of less than 5 %
I Resampling methods can be used to counter class imbalance but impact the
classification error ú trade-off required
I Online strategies may have to be adapted to deal with the increasing vocabulary
size
I More advanced feature extraction methods (e.g. word embeddings) and
classification methods (e.g. deep neural networks)
AL2018
17 / 18
THANK YOU FOR YOUR ATTENTION!
18 / 18