Sensors 22 07726 With Cover
Sensors 22 07726 With Cover
Sensors 22 07726 With Cover
Article
Ricardo Alejandro Manzano Sanchez , Marzia Zaman, Nishith Goel, Kshirasagar Naik and Rohit Joshi
Special Issue
Communication, Security, and Privacy in IoT
Edited by
Dr. Kien Nguyen and Dr. Xiaoyan Wang
https://doi.org/10.3390/s22207726
sensors
Article
Towards Developing a Robust Intrusion Detection Model Using
Hadoop–Spark and Data Augmentation for IoT Networks †
Ricardo Alejandro Manzano Sanchez 1, *,‡ , Marzia Zaman 2,‡ , Nishith Goel 1 , Kshirasagar Naik 3
and Rohit Joshi 2
Abstract: In recent years, anomaly detection and machine learning for intrusion detection systems
have been used to detect anomalies on Internet of Things networks. These systems rely on machine
and deep learning to improve the detection accuracy. However, the robustness of the model depends
on the number of datasamples available, quality of the data, and the distribution of the data classes.
In the present paper, we focused specifically on the amount of data and class imbalanced since
both parameters are key in IoT due to the fact that network traffic is increasing exponentially. For
this reason, we propose a framework that uses a big data methodology with Hadoop–Spark to
Citation: Manzano Sanchez, R.A.;
train and test multi-class and binary classification with one-vs-rest strategy for intrusion detection
Zaman, M.; Goel, N.; Naik, K.; Joshi,
R. Towards Developing a Robust
using the entire BoT IoT dataset. Thus, we evaluate all the algorithms available in Hadoop–Spark
Intrusion Detection Model Using in terms of accuracy and processing time. In addition, since the BoT IoT dataset used is highly
Hadoop–Spark and Data imbalanced, we also improve the accuracy for detecting minority classes by generating more
Augmentation for IoT Networks. datasamples using a Conditional Tabular Generative Adversarial Network (CTGAN). In general,
Sensors 2022, 22, 7726. https:// our proposed model outperforms other published models including our previous model. Using our
doi.org/10.3390/s22207726 proposed methodology, the F1-score of one of the minority class, i.e., Theft attack was improved from
Academic Editors: Kien Nguyen and
42% to 99%.
Xiaoyan Wang
Keywords: IoT (internet of things) security; big data framework; imbalaced datasets; CTGAN;
Received: 15 September 2022 hadoop-spark; BoT-IoT
Accepted: 6 October 2022
Published: 12 October 2022
2. Related Work
This section is divided into three parts. The first part takes into account research
papers that use other multi-class classification algorithms to detect attacks using the short
version of the BoT-IoT dataset. The second part explains research papers that use sampling
methods to reduce class imbalance problem. In the third part, we present some papers that
use the big data framework to create intrusion detection systems using BoT-IoT dataset or
other similar datasets.
The following research papers consider many supervised machine learning algorithms
to detect attacks in IoT networks.
First, we summarized related work that uses the short version of the BoT-IoT dataset
to train an intrusion detection system using machine learning algorithms.
Kumar et al. [5] created a multi-class classification methodology to identify DoS, DDoS,
Reconnaissance, Theft attacks, and Normal network traffic. This methodology used feature
selection and multi-class classification algorithms. The authors used a hybrid approach
for feature selection in which they used Pearson’s Correlation Coefficient, Random Forest
mean, and Gain Ratio approach to select features. Then, they joined the results using an
AND operation. They applied correntropy to measure the accuracy distinguishing normal
and abnormal data samples. Finally, they trained and tested 3 classifiers named Random
Forest, XGBoost, and K-nearest neighbors (KNN). This work used the short version of the
dataset. The authors highlighted that the approach recognized Theft attacks with 93% of
accuracy even though the number of samples of this class was the lowest. XGBoost showed
the best results in detecting Reconnaissance, DoS, DDoS attacks with 100% of accuracy.
Shafiq et al. [6] trained and tested five algorithms to detect anomalous behavior using
the BoT IoT dataset. This paper is different from others because the authors evaluated
how accuracy, precision, TP rate, Recall, and training time were important to select the
best algorithm. They used a bijective soft set approach to evaluate the best algorithm
considering these five factors. They concluded that the Naïve Bayes algorithm reached
98% accuracy, precision, TP rate, Recall, and the training time was around 4s. The authors
used Weka to train and test the classifiers. This paper used the shorter version of the BoT
IoT dataset.
Soe et al. [15], in 2020, trained and tested a lightweight model to detect anomalous
behavior in IoT devices. This model was designed to run on Raspberry Pi and the authors
used the shorter version of BoT-IoT dataset. To train and test the models, the authors
created three sub-datasets. Each sub-dataset contained only one kind of attack and normal
datasamples. It is necessary to highlight that they considered only DDoS, Theft, and
Reconnaissance attacks. Therefore, they created 3 sub-datasets. Then, they extracted
the most important features for each subset using correlated-set threshold on gain-ratio
Sensors 2022, 22, 7726 4 of 17
(CST-GR). Each subset had different number of features. DDoS features were drate and total
number of bytes per destination IP while Theft features were state number, total number of packets
per protocol, and average rate per protocol per dport. The authors then trained 4 classifiers
named tree-based J48, Hoeffding Tree, logistic model tree, and random forest. The authors
reduced the number of datasamples of DDoS and Reconnaissance since it does fit on the
Raspberry Pi memory. The authors concluded that the model could detect all kind of attacks
with an accuracy of over 99.3% in all the cases. Random Forest was the best algorithm to
detect all kind of attacks. This paper has some drawbacks. First, when a new datasample
is the input, it should pass through 3 feature extraction and evaluation. This problem
increases processing time and unnecessary usage of resources. In addition, the authors
down-sampled the number of datasamples for training. For this reason, some important
statistical information could be missed.
Bagui and Li [16] developed a framework using Artificial Neural Networks (ANN)
and different methods of resampling to obtain models to detect anomalies in IoT networks.
Since IoT network datasets are imbalanced, it is difficult to obtain a model which recognizes
the minority class with high accuracy. The authors used a Spark cluster and a standalone
computer to run their experiments. They used a compact version of the BoT IoT dataset.
Their models only tackled a classification problem in which they tried to identify different
kinds of attacks and when the traffic was normal. When the experiment was run in a Spark
cluster provisioned in AWS, they obtained the best Macro F1-score of around 58% when
the sampling method was Random oversampling.
Fatani et al. [17], in 2021, used an innovative feature selection approach called SWARM
intelligence with AQUILA optimizer to detect IoT attacks using the BoT-IoT dataset. This
methodology is composed of some stages. The first stage is called feature extraction in
which they used a Convolutional Neural Network to extract meaningful features from
the original raw data. They extracted the features generated from the last fully connected
layer which had 64 neurons. Then, they ranked these features to select the most important
ones using AQUILA optimization. Finally, they used the final features to train a machine
learning algorithm. They used the shorter version of BoT IoT dataset reaching 99% of
accuracy in the training and testing dataset. However, if we look at the confusion matrix
the accuracy to detect normal class was 60.7% and 85.7% to detect Theft. The overall
accuracy was high since the dataset is highly imbalance in nature. However, the model
could not generalize well the minority classes.
We can conclude that the authors in references [5,6,15–17] can detect accurately the
attacks from BoT-IoT dataset. Nonetheless, all of them use the shorter version of the dataset.
Thus, we cannot expect that these models will perform well with unseen data.
Next, we summarize some research papers that use sampling methods to reduce the
class imbalance of the dataset. In addition, we include some papers that use GAN to
generate new datasamples.
Zixu et al. [18], in 2020, developed a novel approach to recognize anomalous behavior
locally in each IoT device. They used a GAN to find the best data distribution representation
of the data using normal network traffic in each device. The GAN network consisted of a
generator and a discriminator. The input to the generator was random data with normal
distribution. The authors defined 100 features as an input to the network. The output
of the generator corresponds to the number of features which would be the input to the
discriminator. Since the authors defined 9 features (flag, state, mean, stddev, max, min, rate,
srate, drate), the output of the generator was 9. The generator network was composed of
2 hidden layers with 1024 and 256 neurons on each layer. The discriminator neural network
was symmetrical with the generator. After training the discriminator and generator locally,
the weights of the generator were sent to the central authority. This entity aggregated
the weights of each of the local neural networks. The central authority with a random
input generated new samples. These samples were passed through an autoencoder. The
main goal of the autoencoder was to learn a representation of the distribution of the data
using backpropagation. The autoencoder had two parts - the encoder and the decoder. The
Sensors 2022, 22, 7726 5 of 17
encoder part encoded or reduced the size of the original input. On the other hand, the
decoder decoded the reduced input to create a vector with the original input. The error
was calculated with the predicted decoder output and the original input. Depending on
the error, the authors defined a threshold to identify benign and malicious traffic. After
creating the autoencoder model, this model was spread among all the nodes. This model
was able to discriminate between benign and malicious signals. The authors compared the
results with other anomaly detection techniques such as One-class SVM, Isolation Forest,
K-means clustering, and Local Outlier Factor. The results showed that the proposed model
achieved improved performance when compared to the other models.
Ferrag et al. [19] created a methodology to reduce the impact of imbalanced datasets in
IoT networks anomaly detection. They proposed a model which consisted of 3 models. In
the beginning, two models ran in parallel. The first model only identified between normal
and malicious behavior. The second model labeled all rows of the training dataset as benign
or one of the different categories of attacks. The classification outputs of the two models
were appended to the dataset as features. Then, the third model was created training the
features and the results of the two prior models. The classification algorithms used to build
the model were REP tree, JRIP, and Forest PA. The authors trained and tested the model
using the shorter version of the BoT-IoT dataset reaching a low false alarm rate and higher
detection rate.
Prabakaran et al. [20] proposed a methodology that used GAN to discriminate between
normal and malicious IoT traffic. The authors used the shorter version of the BoT IoT dataset
in their approach. In the beginning, they labeled all the rows as benign or attacked to
create one dataset. Then, they created another dataset labeling all rows as benign or one
category attack. Finally, they normalized and joined both datasets. The final dataset was
used to train a GAN network. The authors changed the discriminator loss function to
reach a good performance of the model. They showed that the accuracy reached by the
discriminator was greater than other deep learning models such as Convolutional Neural
Network (CNN), autoencoder, KNN, MLP, ANN, and Decision Trees (DT). The accuracy
shown in the paper was around 92%.
Ullah and Qusay [21] developed one of the most complete methodologies that used
GAN networks for anomaly detection in IoT devices. In their methodology, the authors
generated more data samples from the minority class using one class conditional GAN.
In addition, they generated normal and anomalous data samples training a conditional
binary GAN network. To train the binary GAN network, they reduced the size of abnormal
data samples to have a balanced dataset. Finally, they used a multiclass classification
GAN network which consisted of multiple binary GAN networks. After generating the
new data samples using each GAN network from the three configurations, they trained a
feed-forward neural network with a deep architecture.
Although the papers cited in [18–21] proposed the best methodologies to solve class
imbalance problem, the authors did not evaluate their models with the entire dataset.
Finally, we described next some research papers that have used big data frameworks
for intrusion detection.
Belouch et al. [22] used Apache Spark to train and test 4 classifiers using the UNSW-NB15
dataset for intrusion detection modeling. They concluded that Random Forest was the
most accurate algorithm with 97% of accuracy and 5.69 s training time while Naïve Bayes
was the worst algorithm with 74.19% accuracy and 0.18 s training time. The authors used
the shorter version of the dataset with 257,340 records.
Haggag et al. [23] proposed the usage of Spark platform for training and testing deep
learning models in a distributed way to detect intrusion detection attacks. The authors used
the NSL-KDD dataset to train MLP, RRN, and LSTM deep learning models. In addition,
the authors added one stage called class imbalance handling using SMOTE. It is necessary
to highlight that Spark does not have deep learning capability. Therefore, the authors used
Elephas to train and test the deep learning models. To use Elephas, the input should have
Sensors 2022, 22, 7726 6 of 17
3 dimensions. For this reason, RDD form as input to Elephas was the solution. The authors
showed that the average F1-score detection was 81.37%.
Morfino et al. [24] proposed an approach to train and test machine learning models
in Spark to detect SYN/DOS attacks in IoT networks. They used MLIB to train binary
classifiers. The data trained was around 2 million of instances. The authors demonstrated
the Random Forest is the algorithm provided the best accuracy of around 100% and the
training time was 215s. Our dataset is different since it contains more than 50 million
of records.
The following paper is the most relevant work we found in which the researchers used
Hadoop–Spark to train and test the entire BoT-IoT dataset.
Abushwereb et al. [25] used MLIB from Spark to train the shorter and larger version of
BoT-IoT dataset. The authors proposed a methodology in which they removed duplicated
values and rows with missing and unkown values, normalized the data with min-max
normalization and applied feature selection using chi-square. The authors then trained
machine learning algorithms named RF, DT, and NB using 70% of the data. Finally, they
evaluated the accuracy of the algorithm with the 30% of the remaining data. The framework
used by the authors was created on Google Cloud platform. The hadoop-spark cluster
consisted of eight Vms with an overall Ram of 16Gb. After training and testing the
multi-class classification problem, the authors concluded that the overall F1-score was
77% for DT and 73% for RF. The F1-score decreased since normal and theft had much less
number of datasamples when compared with other classes. The authors indicated that its
model could detect Theft attacks only 23% of F1-score and Normal datasamples with 71.8%.
This reference presents a similar approach as the present paper; however, the theft and
normal class accuracy are quite low.
3. Methodology
This section is divided into two subsections. Section 3.1 explains two methodologies
to train and test multi-class classification using multi-class and binary algorithms available
in Spark. Section 3.2 explains the data generation of the minority class using CTGAN.
3.1. Methodologies to Train and Test Multi-Class Classification Using Multi-Class and
Binary Algorithms
The proposed framework is composed of two main systems as we can see in Figure 1,
namely anomaly detection and machine learning for intrusion detection.
Machine
Determine type of
Learning
X Anomaly Is No attack: DoS, DDoS,
for
detection Normal? Theft or
Intrusion
Reconnaissance
detection
Yes
Datasample is
Normal
Figure 1. General framework.
that the sample belongs. In the present paper, we do not change the anomaly detection
system. We change the flowchart of the machine learning for intrusion which trains and
tests multi-class algorithms and multi-class using binary classification algorithms using
Hadoop–Spark cluster. In reference [13], we trained and tested a Random Forest which is a
multi-class classification algorithm. It is called multi-class because it handles more than
two labels. In this work, we expand the research by considering the evaluation of other
algorithms that were implemented in Hadoop–Spark. However, some of these algorithms
are binary; thus, it is necessary to wrap a binary classifier within One vs. Rest (OVR)
to classify multiple classes. We create two methodologies depending on the type of the
classifier available in Spark since it contains different steps:
• Multi-class classification algorithms: Logistic regression, Naive Bayes, Decision Tress,
and Random Forest.
• Binary classification algorithms: Decision Trees, Logistic Regression, Gradient boosted
tree, SVM Linear, Naive Bayes, and Random Forest.
Two methodologies are shown in Figure 2 to train and test multi-class classifiers in
Hadoop–Spark. The flowchart in Figure 2a shows the feature selection flowchart which
is the first stage in Methodology 1 and 2. Feature selection helps to reduce the number of
features in the dataset. The BoT-IoT dataset has around 29 features; thus, if we use all these
features to train a model, some features can introduce noise in the model. For this reason,
as it was shown in reference [13], 8 features were selected. To find the best 8 features, we
use the flowchart in Figure 2a. Feature encoding stage is used to transform categorical
features to numerical. Vector assembler stage is used to concatenate all the features and
labels in a spark format. Scaler stage is used to apply standard scaler to each feature of
the dataset. Standard scaler uses the mean and variance to normalize each feature of the
dataset. Finally, Random Forest algorithm is used to rank the features in the dataset.
X E A S Random F
Feature Vector Scaler Forest
encoding Assembler Feature
Ranking
a) Feature Selection
Multi-class
F Models
Feature Feature Vector Multi-Class
encoding Scaler
selection Assembler Classifiers
Pipeline 1 Pipeline 2
Multi-class
F Wrap Models
Rename features Define
Feature Feature Vector binary
Scaler as “features” and binary using
selection encoding Assembler classifier
labels as “label” classifier binary
with OVR
classifiers
c) Methodology 2: Multi-class classification using binary classification algorithms in Spark.
use one-hot encoding since we can obtain very sparse arrays which cannot be processed
for some algorithms. Next, we apply vector assembler to the features. In this stage, we
concatenated all 8 features in a single array. This stage is necessary for Spark. Then, we scale
the features using standard scaler normalization. This kind of normalization scales each
data sample according to the mean and standard deviation of the entire dataset. Finally,
we use four multi-class classifiers to train and test our algorithms which are explained
mathematically as follows:
Naïve Bayes Approach: It uses the Bayes Theorem to compute the conditional
probability distribution of each feature given each label. Each data sample is composed
of 8 features represented by a vector x = ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ). We assume that the
features are independent among them. Thus, the probability that a class Ck happens given
the features are given by the following formula [26]:
n =8
1
p(Ck | x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ) = p(Ck ) ∏ p( xi |Ck ) (1)
Z i =1
where
Z = p( x ) = ∑ p(Ck ) p(x|Ck ) (2)
k
k represents the number of classes that we have in this case 5. Since we use Multinomial
naive Bayes, we cannot pass negative values to the algorithm. For this reason, we use
min-max normalization. Min-max normalization transforms the features in the range 0 to 1.
Decision Trees: It is called a Decision Tree because it has many decision leaves. Each
decision leaf is built using a measurement of impurity. When a new data sample enters the
system, it goes through each leaf which has a conditional statement. Then the data sample
is classified.
Random Forest: A Random Forest algorithm is a set of decision trees in which a
decision is taken by all of them using a voting schema [27]. Each decision tree is built with
a bootstrapped dataset and a subset of the features to build the tree. We can obtain a variety
of trees which can take better decisions than a single tree. The more pure features will be at
the top of each decision tree. It means that these features are more important to identify
between normal and malicious network traces. Therefore, the random forest can rank the
importance of the features.
Logistic Regression: Multinomial logistic regression finds the correct class from more
than two classes given several samples N with several features M. If we have an input
matrix X, which is composed of several samples N and several features M. Then X has
dimension N × M. The idea is to find the best values of the matrix W to obtain the labels
that we have as ground truth. In the beginning, the values of W are selected randomly. To
find the best values of W, we have to find a loss function and a gradient. We explain how
we can obtain the gradient. First, we calculate the product of X and W. We denoted the
product as Z = ( XW ). We take the softmax function for each row of the new matrix Z. The
softmax function gives us the probability of each class given a sample. Thus, the row will
sum up 1. As we have known beforehand, our problem is supervised since we have the
labels for a given sample. Thus, it is possible to find the likelihood function of Y given X.
exp( xi ∗ Wk=Yi )
p(Yi | Xi , W ) = Pi,k=Yi = so f tmax ( Xi,k=Yi ) = (3)
∑C
k =0 exp ( xi wk )
The formula above was for a single data sample. Considering all the data samples in
the dataset, we have the following formula
N exp( xi ∗ wk=Yi )
p(Y | X, W ) = ∏ (4)
i =1 ∑C
k =0 exp ( xi ∗ wk )
Sensors 2022, 22, 7726 9 of 17
N
1 exp( xi ∗ wk )
∇Wk f (W ) = ∑ (XiT I|Yi =k| − XiT ) + 2µW (5)
N i =1 ∑C
k =0 exp( xi ∗ wk )
3.1.3. Pipeline 1
The first pipeline denoted as pre-processing has the same stages as methodology
1 to pre-process the input. The first stage is features selection in which we reduce the
number of features from 29 to 8 features. Then, we have feature encoding. This stage
transforms categorical to numerical features. The third stage of pre-processing is vector
assembler. In this stage, we concatenate the 8 features into 1 vector. The final stage of
pre-processing is the standard scaler in which we normalize all the features with the mean
and the standard deviation.
3.1.4. Pipeline 2
In this pipeline, we train and test 6 binary classifiers. In Pyspark, we can wrap binary
classifiers into an estimator denoted as One vs. Rest (OVR). Then, we can use these binary
classifiers to do multi-class classification. We use the One vs. Rest approach to train and
test Support Vector Machine (SVM) Linear, Gradient Boosted tree classifier, Random Forest,
Decision Trees, Naïve Bayes, and Logistic Regression. In One vs. Rest, we choose one class
of all classes. We label this class as positive while all the rest samples are labeled as negative.
For this reason, a model will be created for each class. If we have five classes as in our case,
we need to create five models. These five models will take the decision when a new sample
is input into the system. When a new input enters the models, our input is evaluated for
each classifier. The more confident classifier will be defined as the output label. One vs.
Rest estimator has some advantages such as: Parallelism: We can train each classifier in
a different node in the Hadoop-spark cluster Model interpretability: We can understand
which are the factors that affect each class separately. However, class imbalance is one of
the biggest disadvantages of OVR. In this paper, we evaluate how class imbalance impacts
the accuracy of the models. In this section, we explain the rest of the classifiers that were
not explained in Methodology 1.
Gradient boosted tree: A Gradient boosted tree takes a decision based on a consecutive
decision trees as week classifiers. In the beginning, a based model is created to find the
residuals for each datasample in the dataset. The residual is calculated substracting the
original classification label minus the probability to be positive or negative label. We
assume that we have binary classifiers. Then, a regression decision tree is built with the
feature inputs and the residuals. New residuals are calculated for each datasample. Then, a
new decision tree is trained using the features and the new residuals. This process continues
depending on the number of decision trees specified by the user. After creating many trees,
when a new datasample is the input to be classified, all decision trees take partial decision
on the final decision.
Support vector machine: This algorithm is different from the rest of the classification
algorithms since it tries to maximize the width of the gap between two categories. The
hyperplane is known as a threshold which is the main hyperplane that divides both classes.
Parallel to the hyperplane exists two additional hyper-planes which define the margin. The
Sensors 2022, 22, 7726 10 of 17
main goal of the algorithms is to maximize the distance between both classes minimizing
the classification loss. The advantage of this algorithm is that it can handle outliers and
admits misclassification. Thus, this algorithm can generalize better unseen data.
Datasample distribution
Normal=9543 Theft =1587
DoS =33005194
DDoS=38532480
Reconnaissance= 1821639
This dataset is highly imbalanced since normal and Theft are minority with a proportion
of 1/4037 and 1/24280, respectively, if we compare with DDoS number of datasamples. In
addition, the original dataset contains 32 network traffic features. In our previous work [13],
we applied feature selection with Random Forest reducing the number of feature to eight
named state, proto, bytes, dport, sbytes, dur, sum, and max. We concluded that eight features
are sufficient to avoid noise, reduce the time of training, and keep an accuracy over 90%.
As we described in Figure 1, first we implemented anomaly detection using one-class svm
in Hadoop–Spark to identify normal from malicious datasamples. We can conclude that
we can detect Normal and Theft attacks with 98.31% and 96.85% of accuracy, respectively;
although, we only selected two features for the training.
4.2. Train and Test Multi-Class Classification Using Multi-Class and Binary Algorithms Available
Before training binary and multi-class classification algorithms, we split the data of the
entire dataset in 70% for training and 30% for testing with stratification. It means that we
select the same proportion of datasamples depending on the number of datasamples of each
class. The metric used to measure the accuracy of all the approaches is the F1-score since
this metric takes into consideration the precision and recall in other words the false alarm
Sensors 2022, 22, 7726 12 of 17
rates. In addition, this metric is the best to evaluate imbalanced datasets. In addition, for
the purpose of comparison with other related work we chose this metric. As we described
in the methodology section, Spark has binary classification algorithms with OVR and
multi-class algorithms to train and test multi-class classification datasets. It is possible to
have the same algorithm available as multi-class or binary with OVR. To illustrate, we can
train and test a random forest using the multi-class algorithm or a binary random forest
wrapped with OVR in Spark. However, the time of training is different. For this reason, we
consider two evaluation parameters the training time and the accuracy. Table 1 shows the
value of hyperparameters for each algorithm that were used in our experiments.
Algorithm Hyper-Parameters
Random Forest numTrees = 30, maxDepth = 30, Impurity = Gini
Decision Tree maxDepth = 5, Impurity = Gini
Gradient Boosted Tree maxDepth = 5, Learning_rate = 0.1, Impurity = variance
SVM Linear regParam = 0.1, kernel = Linear, HingeLoss
Naive Bayes N/A
Logistic Regression elasticNetParam = 0.8, penalty = Elasticnet
As we can see in Figure 4, Random Forest algorithm performs with 93.55% of accuracy
which is 5% greater than the Gradient Boosted tree. In addition, Decision Trees results are
at least 10% less than Random Forest. If we compare Naive Bayes and Logistic Regression
results, Random Forest outperforms them with around 40% of accuracy.
Figure 4. Accuracy for Multi-class vs. Binary classifiers with OVR in Spark.
After comparing the accuracy, we conclude that Random Forest algorithm using
multi-class and binary with OVR classification provides us with the best accuracy. However,
as we can see in Figure 5, the training time for Random Forest if we use binary classifier
with OVR is five times more. Thus, we conclude that multi-class Random Forest algorithm
provides the best accuracy and acceptable training time. Since the BoT-IoT dataset is
extremely imbalanced, the overall accuracy although is high, the accuracy to identify
minority classes is low. In the results found, the F1-score accuracy to detect Theft attacks is
41.83%. We can conclude that the number of datasamples for theft attacks is not enough to
Sensors 2022, 22, 7726 13 of 17
create a robust model. This problem is solved in the next section in which we use CTGAN
to generate more datasamples of minority classes.
Figure 5. Training time for Multi-class vs. Binary classifiers with OVR in Spark.
Finally, we compared the CTGAN results with ADASYN, and SMOTE oversampling
methods. We generated the same number of minority datasamples as in experiment 2.
The results of the comparison are shown in Figure 7. We can conclude that when we
generate more data samples with CTGAN, the theft detection increases by around 50%. We
can see that ADASYN is better than SMOTE to generate new datasamples. However, the
CTGAN generates better datasamples than that of ADASYN and thereby provides the best
classification results.
Figure 7. F1 measure after Oversampling. Comparison among SMOTE, ADASYN, and CTGAN.
We compare the best model obtained in the present paper with other related work in
Table 2. As we can see references [5,6,15,17] use the shorter version of the dataset; thus,
the F1-score to detect DoS and DDoS outperforms our approach because the quantity of
datasamples is much less; thus, the classifiers can determine better boundaries. However,
our methodology can detect Theft attacks better than all the other related work as shown
in Table 2. In addition, our approach can detect Normal datasamples better than other
approaches except [5] with only 2% difference. As we described before that comparing the
accuracy of our approach with papers that use the shorter version of the BoT-IoT dataset
is not fair since the shorter version does not contain enough statistic characteristics that
represent the population.
Ref Normal [%] DDoS [%] DoS [%] Reconnaissance [%] Theft [%] Algorithm Dataset
Shafiq et al. [6] 75 98 100 81 93 NB Short version BoT-IoT
Soe et al. [15] - 99.9 - 99.9 98.18 Random Forest Short version BoT-IoT
Kumar et al. [5] 100 100 100 100 93 XGBoost Short version BoT-IoT
Fatani et al. [17] 60.7 99 99 99 85.7 Aquila optimizer (AQU) Short version BoT-IoT
Abushwereb et al. [25] 71.8 99.9 99.13 88.83 23.2 MLIB(RF) Large version BoT-IoT
Our approach 98 94 93 99.86 99 Random Forest Entire BoT-IoT dataset
We can make a fair comparison only with reference [25] since the authors of this paper
use the entire BoT-IoT dataset and Hadoop–Spark framework to process the data. We can
see that our approach outperforms F1-score accuracy to detect Normal and Theft attacks
with significant differences. Our approach detects Normal datasamples with around 26.2%
F1-score more than [25]. We need to highlight that our approach is 4 times more accurate to
detect Theft attacks than reference [25] since we use CTGAN to generate real datasamples
for augmentation. Our approach is not so good in detecting DDoS and DoS attack samples;
nonetheless, the difference is not significant. The overall F1-score accuracy of our work is
Sensors 2022, 22, 7726 15 of 17
96.77% while for reference [25], it is 76.57%. If we compare the overall accuracy of our work
with other papers except reference [15] since this work does not evaluate Normal and DoS,
only reference [5] surpasses our approach by 1.8%; however, it uses the short version of the
dataset. We cannot compare our approach in terms of computational efficiency with [25]
since the authors used a pre-configured model in Google cloud that had eight VMs. In our
case, we implemented the hadoop spark cluster in our lab with only has 3 workers with
2 cores and 8GB of RAM.
Finally, we ran the experiment five times with different seed to select the training and
testing datasets using stratification. The results are shown in Figure 8. As we can see, the
results do not change much if we compare them with Table 2 considering the size of this
dataset this was expected.
Figure 8. F1 measure after running the five different experiment with different seeds
Author Contributions: R.A.M.S. implemented the methodology and experiments. M.Z. collaborated
in the design of the methodology and was project leader. R.J. collaborated in the related work. K.N.
and N.G. were the project leaders and set the basics for the experiments. All authors collaborated in
the manuscript revisions. All authors have read and agreed to the published version of the manuscript
Funding: This researchwas partly funded by a grant from the Natural Sciences and Engineering
Research Council of Canada, received by Kshirasagar Naik. In addition, it was funded by Cistech
Limited and University of Waterloo.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: We use for analysis the BoT IoT dataset available in https://cloudstor.
aarnet.edu.au/plus/s/umT99TnxvbpkkoE. The data generated after analysis was not published
publicly.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
References
1. Cisco. Cisco Annual Internet Report (2018–2023); White Paper; Cisco: San Francisco, CA, USA, 2020.
2. Hung, M. Leading the IoT. Technical Report; Gartner Research, 2017. Available online: https://www.gartner.com/imagesrv/
books/iot/iotEbook_digital.pdf (accessed on 14 September 2022).
3. Soe, Y.; Feng, Y.; Santosa, P.; Hartanto, R.; Sakurai, K. Rule Generation for Signature Based Detection Systems of Cyber Attacks in
IoT Environments. Bull. Netw. Comput. Syst. Softw. 2019, 8, 93–97.
4. Filus, K.; Domańska, J.; Gelenbe, E. Random neural network for lightweight attack detection in the iot. In Proceedings of
the Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, Nice, France, 17–19
November 2020; Springer: Berlin, Germany, 2020; pp. 79–91.
5. Kumar, P.; Gupta, G.P.; Tripathi, R. Toward design of an intelligent cyber attack detection system using hybrid feature reduced
approach for iot networks. Arab. J. Sci. Eng. 2021, 46, 3749–3778. [CrossRef]
6. Shafiq, M.; Tian, Z.; Sun, Y.; Du, X.; Guizani, M. Selection of effective machine learning algorithm and Bot-IoT attacks traffic
identification for internet of things in smart city. Future Gener. Comput. Syst. 2020, 107, 433–442. [CrossRef]
7. Khraisat, A.; Gondal, I.; Vamplew, P.; Kamruzzaman, J.; Alazab, A. A Novel Ensemble of Hybrid Intrusion Detection System for
Detecting Internet of Things Attacks. Electronics 2019, 8, 1210. [CrossRef]
8. Shyam, R.; HB, B.G.; Kumar, S.; Poornachandran, P.; Soman, K. Apache spark a big data analytics platform for smart grid.
Procedia Technol. 2015, 21, 171–178. [CrossRef]
9. Koroniotis, N.; Moustafa, N.; Sitnikova, E.; Turnbull, B. Towards the development of realistic botnet dataset in the Internet of
Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst. 2019, 100, 779–796. [CrossRef]
10. Ibitoye, O.; Shafiq, O.; Matrawy, A. Analyzing Adversarial Attacks against Deep Learning for Intrusion Detection in IoT Networks.
In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019;
pp. 1–6. [CrossRef]
Sensors 2022, 22, 7726 17 of 17
11. Alsamiri, J.; Alsubhi, K. Internet of things cyber attacks detection using machine learning. Int. J. Adv. Comput. Sci. Appl. 2019, 10.
[CrossRef]
12. Ferrag, M.A.; Maglaras, L. DeepCoin: A Novel Deep Learning and Blockchain-Based Energy Exchange Framework for Smart
Grids. IEEE Trans. Eng. Manag. 2020, 67, 1285–1297. [CrossRef]
13. Manzano Sanchez, R.; Goel, N.; Zaman, M.; Joshi, R.; Naik, K. Design of a Machine Learning Based Intrusion Detection
Framework and Methodology for IoT Networks. In Proceedings of the 2022 IEEE 12th Annual Computing and Communication
Workshop and Conference (CCWC), Virtual, 26–29 January 2022; pp. 0191–0198. [CrossRef]
14. Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling Tabular data using Conditional GAN. In Proceedings
of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle,
H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32.
15. Soe, Y.N.; Feng, Y.; Santosa, P.I.; Hartanto, R.; Sakurai, K. Towards a lightweight detection system for cyber attacks in the IoT
environment using corresponding features. Electronics 2020, 9, 144. [CrossRef]
16. Bagui, S.; Li, K. Resampling imbalanced data for network intrusion detection datasets. J. Big Data 2021, 8, 1–41. [CrossRef]
17. Fatani, A.; Dahou, A.; Al-Qaness, M.A.; Lu, S.; Elaziz, M.A. Advanced feature extraction and selection approach using deep
learning and Aquila optimizer for IoT intrusion detection system. Sensors 2021, 22, 140. [CrossRef] [PubMed]
18. Zixu, T.; Liyanage, K.S.K.; Gurusamy, M. Generative adversarial network and auto encoder based anomaly detection in
distributed IoT networks. In Proceedings of the GLOBECOM 2020—2020 IEEE Global Communications Conference, Taipei,
Taiwan, 7–11 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–7.
19. Ferrag, M.A.; Maglaras, L.; Ahmim, A.; Derdour, M.; Janicke, H. Rdtids: Rules and decision tree-based intrusion detection system
for internet-of-things networks. Future Internet 2020, 12, 44. [CrossRef]
20. Prabakaran, P.; Mohana, R.; Kalaiselvi, S. Enhancing the Cyber Security Intrusion Detection based on Generative Adversarial
Network. Elem. Educ. Online 2021, 20, 7401.
21. Ullah, I.; Mahmoud, Q.H. A Framework for Anomaly Detection in IoT Networks Using Conditional Generative Adversarial
Networks. IEEE Access 2021, 9, 165907–165931. [CrossRef]
22. Belouch, M.; El Hadaj, S.; Idhammad, M. Performance evaluation of intrusion detection based on machine learning using Apache
Spark. Procedia Comput. Sci. 2018, 127, 1–6. [CrossRef]
23. Haggag, M.; Tantawy, M.M.; El-Soudani, M.M. Implementing a deep learning model for intrusion detection on apache spark
platform. IEEE Access 2020, 8, 163660–163672. [CrossRef]
24. Morfino, V.; Rampone, S. Towards near-real-time intrusion detection for IoT devices using supervised learning and apache spark.
Electronics 2020, 9, 444. [CrossRef]
25. Abushwereb, M. An accurate IoT intrusion detection framework using Apache Spark. Ph.D. Thesis, Princess Sumaya University
for Technology, Amman, Jordan, 2020.
26. Rish, I. An empirical study of the naive Bayes classifier. In Proceedings of the IJCAI 2001 Workshop on Empirical Methods in
Artificial Intelligence, Seattle, WA, USA, 4–6 August 2001; Volume 3, pp. 41–46.
27. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
28. Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional gan. Adv. Neural Inf.
Process. Syst. 2019, 32, 7335–7345.
29. Brandt, J.; Lanzén, E. A Comparative Review of SMOTE and ADASYN in Imbalanced Data Classification. 2021. Available online:
https://www.diva-portal.org/smash/get/diva2:1519153/FULLTEXT01.pdf (accessed on 14 September 2022).