Sms Spam Filtering Pres

Spam Filtering Techniques for Short Message Service
Adrien Besson and Dimitris Perdios
Signal Processing Laboratory (LTS5)

École Polytechnique Fédérale de Lausanne (EPFL)
Adaptation and Learning Project

Lausanne, Switzerland
June 4, 2018
AL2018
1 / 18
Outline
Introduction
Context
Data Exploration
Feature Extraction
The Bag-of-Words Representation
The Term Frequency-Inverse Document Frequency Weighting
Comparison of Several Classifiers
Considered Classifiers
Experimental Settings
Results
On Sensitivity and Specificity
Increasing the Sensitivity
Results
Online Learning Strategies
Motivations and Strategies
Experiments
Conclusion
AL2018
2 / 18
Introduction
Context
I Short message service (SMS) very popular communication service ú 15 M SMS

sent per minute in the world in 2017
I Spam is unwanted message sent in bulk to a group of recipients
I Intrusive
I Waste of money, time and network bandwidth
I SMS filtering still at relatively early stage
I No publicly available dataset of sufficient size
I Current methods inherit from Email spam filtering techniques
I Supporting code: https://github.com/dperdios/sms-spam-filtering
AL2018
3 / 18
Introduction
Data Exploration
SMS spam collection dataset

I 5572 tagged SMS extracted from
2000
Grumbletext Web site / Caroline Tag’s
1750
PhD thesis / SMS Spam Corpus v.
Number of occurences
1500
0.1 Big
1250
I Dataset composed of 4825
1000
ham (86 %) and 747 spam (14 %)
750
Most represented words in the dataset 500
I Stop-words: Most commonly used 250
english words 0
to
you
I
a
the
and
in
is
i
u
for
my
of
your
me
on
have
2
that
it
are
call
or
be
at
with
not
will
get
can
I Uninformative words that have to be
removed
AL2018
4 / 18
Introduction
Data Exploration – Ham
AL2018
5 / 18
Introduction
Data Exploration – Spam
AL2018
6 / 18
Feature Extraction
The Bag-of-Words Representation
Motivation
A document considered as a sequence of symbols cannot be used as input of a
machine learning algorithm ú need a representation
The Bag-Of-Words representation
I Document composed of D words and vocabulary composed of M words:
D
X
x= xd i ,
i=1
xdi ∈ RM one-hot vector with 1 at the entry of di in the vocabulary

I Given a collection of N documents → X ∈ RN×M
SMS dataset
I A SMS is a document
I The vocabulary is extracted from the training set
AL2018
7 / 18
Feature Extraction
The Term Frequency-Inverse Document Frequency Weighting
Motivation
I Preferable to work with frequencies than occurrences
I Need for normalized features
Tf-idf weighting
I Term-frequency of the document n of size Dn in the collection
1
tfn =
Dn
I Inverse-document-frequency of the word di in the collection
N
idfdi = log
dfdi
dfdi = # documents containing di
I Tf-idf weighting of the word di in the document n
T (n, di ) = tfn × idfdi
AL2018
8 / 18
Considered Classifiers
Type Method Main parameters

k-nearest neighbors kNN Number of neighbors
Multinomial Naive Bayes MNB Additive smoothing parameter
LR –
Logistic Regression LR-`2
Regularization parameter
LR-`1
SVM-L
Support Vector Machine SVM-R Reg. param. on slack variable
SVM-S
DT Minimum number of samples at leaf
Trees RF Number of trees in the forest
AB Number of estimators
Neural Network MLP Hidden-layer sizes
AL2018
9 / 18
Experimental Settings
General remarks:
I No dimensionality reduction of the data
I Even though data imbalanced we use a threshold of 0.5 for our Bayes plug-in
estimator
I Dataset divided into a training set (80 %) and test set (20 %)
I Hyper-parameter grid search using 10-fold cross-validation on training set
Metrics:
Nt
I Misclassification error (ME): ME = 1
I (c (hi ) 6= γ (i))
P
Nt
i=1
I Sensitivity (SE): Probability of predicting spam given that it is spam
Nt
1
I (c (hi ) = γ (i)) × I (γ (i) = spam)
P
SE = #spam
i=1
I Specificity (SP): Probability of predicting ham given that it is ham
Nt
1
I (c (hi ) = γ (i)) × I (γ (i) = ham)
P
SP = #ham
i=1
AL2018
10 / 18
Results
AB: 2.152% DT: 2.870% LR-l1: 2.063% LR-l2: 1.614% LR: 1.614% MLP: 1.614%
ham 0.996 0.004 0.988 0.012 0.993 0.007 0.999 0.001 0.999 0.001 0.998 0.002
spam 0.153 0.847 0.153 0.847 0.122 0.878 0.130 0.870 0.130 0.870 0.122 0.878
MNB: 2.601% RF: 2.601% SVM-L: 1.794% SVM-R: 1.614% SVM-S: 1.614% kNN: 4.574%
ham 1.000 0.000 0.998 0.002 0.996 0.004 0.998 0.002 0.998 0.002 1.000 0.000
spam 0.221 0.779 0.206 0.794 0.122 0.878 0.122 0.878 0.122 0.878 0.389 0.611
ham spam ham spam ham spam ham spam ham spam ham spam
AL2018
11 / 18
Increasing the Sensitivity
Motivations
I Optimization of a symmetric loss on an imbalanced dataset ú good predictive
performance on the majority class and poor result on the other class
I One may want to increase sensitivity
Methods
I Decreasing the threshold of the Bayes classifier
I Resampling:
I Downsampling of the majority class
I Upsampling of the minority class
Proposed experiments
1. Random downsampling of the ham class to have a balanced dataset
2. Random upsampling with replacement of the spam class to have a balanced
dataset
AL2018
12 / 18
Results – Downsampling the Majority Class
ham 0.956 0.044 0.932 0.068 0.970 0.030 0.979 0.021 0.975 0.025 0.983 0.017
spam 0.137 0.863 0.084 0.916 0.053 0.947 0.061 0.939 0.061 0.939 0.038 0.962
ham 0.944 0.056 0.989 0.011 0.973 0.027 0.980 0.020 0.970 0.030 0.974 0.026
spam 0.038 0.962 0.092 0.908 0.053 0.947 0.061 0.939 0.038 0.962 0.107 0.893
AL2018
13 / 18
Results – Upsampling the Minority Class
ham 0.997 0.003 0.973 0.027 0.995 0.005 0.998 0.002 0.997 0.003 0.997 0.003
spam 0.130 0.870 0.122 0.878 0.115 0.885 0.122 0.878 0.122 0.878 0.099 0.901
ham 0.984 0.016 1.000 0.000 0.998 0.002 0.998 0.002 0.998 0.002 1.000 0.000
spam 0.046 0.954 0.160 0.840 0.122 0.878 0.122 0.878 0.122 0.878 0.405 0.595
AL2018
14 / 18
Motivations and Strategies
Motivations
I Spam content is obviously changing with time: spammers try to fool spam
filtering algorithms
I Useful to design algorithm able to adapt to new content labeled by users
I Main difficulty in designing an online strategy: the vocabulary size
Naive strategy
I Increasing vocabulary size using all the words of the training set
I Not viable in longer-term due to memory and computational issues
Windowed-time strategy
I Fixed vocabulary size using only the most frequent words of the training set
I Training set windowed in time
I Should be quite efficient
I But requires to be fitted quite often
AL2018
15 / 18
Experiments
Considerations
I We do not have access to time LR-l2
12 MNB
information
I Not possible to analyze the evolution 10
Misclassification error [%]

of the vocabulary 8
I Relatively small dataset
6
Online scenario
I Naive strategy (increasing training set) 4
I Labeled messages received by batches 2

of 200
0 1000 2000 3000 4000
I Test set fixed at the beginning Train set size [ ]
I Fast classifiers (LR-`2 , MNB)
AL2018
16 / 18
Conclusion
I Several SMS spam filtering methods have been studied from the feature
extraction step to the classification
I State-of-the-art classifiers work well with a misclassification error of less than 5 %
I Resampling methods can be used to counter class imbalance but impact the
classification error ú trade-off required
I Online strategies may have to be adapted to deal with the increasing vocabulary
size
I More advanced feature extraction methods (e.g. word embeddings) and
classification methods (e.g. deep neural networks)
AL2018
17 / 18
THANK YOU FOR YOUR ATTENTION!
Adrien Besson and Dimitris Perdios

https://github.com/dperdios/sms-spam-filtering
Signal Processing Laboratory (LTS5)
https://lts5www.epfl.ch
École Polytechnique Fédérale de Lausanne
AL2018
18 / 18

Sms Spam Filtering Pres

Uploaded by

Copyright:

Available Formats

Sms Spam Filtering Pres

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sms Spam Filtering Pres

Uploaded by

Copyright:

Available Formats

Spam Filtering Techniques for Short Message Service

Adrien Besson and Dimitris Perdios

Signal Processing Laboratory (LTS5)

Adaptation and Learning Project

I Short message service (SMS) very popular communication service ú 15 M SMS

SMS spam collection dataset

xdi ∈ RM one-hot vector with 1 at the entry of di in the vocabulary

Type Method Main parameters

Misclassification error [%]

I Labeled messages received by batches 2

I Fast classifiers (LR-`2 , MNB)

Adrien Besson and Dimitris Perdios

You might also like