Anomaly Detection On IoT Network Intrusion Using Machine Learning
Anomaly Detection On IoT Network Intrusion Using Machine Learning
Machine Learning
Zhipeng Liu Niraj Thapa Addison Shaver
Department of Computer Science Dept . a/Computational Science & Engineering Depa rtment ofComputer Science
North Carolina Agricultural and North Carolina Agricultural and North Carolina Agricultural and
Technology State University Technology State University Technology State University
Greensboro, USA Greensboro, USA Greensboro, USA
zliu2@aggies.ncat.edu nthapa@aggie s.ncat.edu awshaver@aggies .ncat.edu
Abstract- Enhancing the security of loT networks is means 20% of all services offered around the globe involves
trending as one of the most crucial issues the information technology related to loT . With the massive increase in loT
technology community faces. With large scales of loT devices devices usage, many researchers have raised the concerns
being developed and deployed, the ability for these devices to over information privacy and security for loT users [4]. In
communicate securely without compromising performance is 2016, a series of Distributed Denial-of-Service attacks [5]
chaUenging. The challenges exist because most of loT devices targeting loT networks and websites such as Twitter, Netflix
are limited by power hence constrained to less computational and PayPal really put the world on notice just how pressing it
ability. Subsequently, encryption and authentication are is to develop a strong security solution. Developing an IDS
difficult to be applied to fence off malicious cyber-attacks.
for loT environments to mitigate these malicious attacks is
Intrusion Detection System (IDS) logically becomes the
forefront security solution. Anomaly-based network intrusion crucial. An IDS designed for loT networks should be able to
detection plays a major role in safeguarding networks against analyze data and generate instant responses in real time and
different malicious activities. In this paper, we apply different adapt to different technologies in the loT environment.
machine learning algorithms to efficiently detect anomalies on Many researchers have shown promising results in
the loT Network Intrusion Dataset . The results show promise detecting network intrusions [6][7][8][9][10]; however, only
as we were able to achieve 99%-100% accuracy while having a limited number targets their research on loT scenarios
high efficiency.
[11][12][13] and even fewer conduct their research on
Keywords-loT security, anomaly detection, intrusion
datasets that originate from loT networks [14]. This barrier
detection system, machine learning continues to exist because of obvious challenges [15]. The
network complexity increases with loT networks since many
manufacturers use various transmission protocols [16], which
I. INTRODUCTION makes applying encryption and authentication schemes
For the past decade, the world has witnessed an difficult. Furthermore, lack of publicly available loT specific
astounding increase in the connected Internet-of- Things. loT dataset makes it incrementally harder for researchers to
connects all kinds of smart devices through the internet, conduct experiments. The most commonly used datasets by
creating an integration that allows various applications to researchers to design new IDS are the NSL-KDD dataset and
operate in both physical and cyber worlds . These applications DARPA dataset [17]. The problem with these two datasets is
can range from a simple smart home appliance to that neither of them was designed for the purpose of
sophisticated equipment for an industrial plant of important resembling an loT network. Another issue with these datasets
infrastructure. An loT based smart grid could potentially for experiments is that both datasets were created more than a
consist of up to millions of nodes, which may lead to a decade ago, as a result neither can reflect the network
focused cyber-attack due to the large attack surface. The behaviors of current networks nor consist of new cyber-
focused attack may potentially shut down the electricity grid, attacks such as MlRAI botnet [18].
considering most public utilities, transportation services, and
Recently, machine learning algorithms have shown
home devices run on electricity, the consequences of such
promising performance in detecting abnormal and malicious
attacks would be catastrophic. The rule-of-thumb is that once
activities in the loT network. Alrashdi et al. conducted their
one single device is compromised, then the rest of the grid
research on designing an anomaly detection IDS for Smart
becomes vulnerable to cyberattacks as well. What's even
city [19]. For binary classification, they were able to get 99%
worse, the chain-effect to such attacks can halt entire cities
Accuracy rate and 99% Precision rate on the UNSW-NBI5
and hence lead to unimaginable financial and economic
dataset, which may not be designed to resemble loT network.
losses [1].
The authors in [19] made the assumption that the network
Statista estimated a staggering 30.73 billion connected traffic mimic that of an loT network traffic flows [12].
loT devices in 2020 and the number will double in less than Another limitation to this research seems to be the
4 years [2]. Forbes published an article that states the world application of only one classifier, which is Random Forest
currently has a 20% adoption rate for loT devices and (RF).
expects a jump to 80% in less than a decade [3], which
Authorized licensed use limited to: Staffordshire University. Downloaded on February 14,2024 at 19:08:20 UTC from IEEE Xplore. Restrictions apply.
In [14], Bakhtiar, Pramukantoro and Nihri applied J48 number of pcap files contained overwhelmingly more attack
algorithm to create a lightweight IDS for loT Middleware. packets than normal packets. The file size for the packets
Althou gh they achieved an anomaly detection accuracy of ranges from 476.8 KB to 133 MB. A dataset description file
100% using self-collected data, however, on average their with filter rules is provided to help identify the attack packets
IDS could on ly capture 73.52% of the transmitted packets from the benign packets. The dataset consists of 5 main
and the average rate of accurate intrusion alerts was on ly categories and 11 sub-categories. The 5 categories include a
18.05%. Their experiment was capable of detecting only the category of normal transmissions and 4 groups of cyber-
Denial-of-Service attacks. attack. The 4 attack group s cover the most commonly seen
attacks for loT network s, which are Scanning attacks, Man in
The contribution of this paper is two-fo ld:
the Middle (MlT M) attacks, Denial of Service (DoS) attacks,
• Explore a new dataset, namely the loT Network and Mirai Botnet attacks. Furthermore, 11 sub-categ ories are
Intrusion Dataset [20l , designed specifica lly using loT derived from the 5 main categories, includ ing I sub-category
devices for nonnal transmission and 10 unique types of attack.
Details of these categories are provided in Tab le I . Each
• Conduct experiments using multiple machine learning pcap file contains 7 features. The first feature is the sequence
approaches and compare the performances number for the packet. Second feature is Time, which is the
The rest of the paper is organized as follows. Section 2 transmission duration of the packet. Third feature is the
introduces our methodolo gy. In Section 3, we present our source IP address of the packet. Fourth feature is the
findings with analysis. Section 4 concludes the paper and destination IP address of the packet. f ifth feature is the
discusses future wor k. protocol of which the packet is transmitted in. Sixth feature is
the length of the packet, measured in bytes. Seventh feature
is Info, which contains further details related to the packet.
II . METHODOLOGY For privacy purpo ses, the wireless header information was
In this research, we develop an anomaly-based IDS since removed using Aircrack-ng tool suite. Mirai Botn et attacks
we differentiate the anomalies from normal flow traffic based were simulated and injected from a laptop and disguised as if
on the underlying network behavior. An advantage to this packets were originated from the loT devices. All other
approach is that when a zero-day attack occurs, the signature attack types were simulated using tools such as Nmap. The
of the attack won 't be recognized, but the subsequent dataset was last updated on September 20, 20 19.
network behavior wi ll fall off the normal traffic patterns,
allowing the IDS to detect the anomaly and react [2 1]. We B. Pre-processing
have applied five commonl y used machine learning In order to conduct our experiments, we need to pre·
approaches on the dataset for real-time and accurate anomaly
process the raw data files. Since the rules to filter attacks
detection. The approaches include Logistic Regression (LR),
from normal transmissions were provided, we perform the
Support Vector Machine (SVM) , Kcnearest Neighbors following steps to process the data. Firstly, we load each
(KNN), RF, and XGBoost. We apply Binary Classification pcap data file, apply the corresponding filtering rule and
by classifying anomalies as 0 and normal packet as I . export the attack data to a comma-se parated-by-value (csv)
file and the normal data to a separate csv file. Then we label
A. Dataset Used the attacks, combine the 2 csv files and remove redundant
In this paper, we train and test our models on the loT data. We repeat the process for all 42 peap files. Lastly, we
Network Intrusion Dataset, created by a group of researchers concatenate all files into onc file and filled all missing values,
from the Hacking and Countermeasure Research Lab [20]. which are the labels for normal transmissions. We created a
This publicly available dataset is des igned specifica lly to label column named Class in which 0 is labeled for attack
simulate an loT environment. The research group created the packets and I is labeled for normal packets. Before
dataset using mainly two common smart home devices, processing, the dataset is approximately 1 GB in size. After
whi ch are SKT NUGU Model NU 100 AI Speaker and concatenation, the resulted csv data file is 339 MB in size.
EZVIS Wi·Fi Camera Model C2C Mini 0 Plus with 108Op. The csv data file is eventuall y 130 MB in size post labeling
The two devices were conn ected through Wi-Fi with other and dropping redundant data. Fig. I and Fig. 2 demonstrate
smart phones and laptops to create the data. the details of this process.
sza
M iral Bo1IIf'l elnee BlUlelolCe 1.924
M irol1 Bou1o't UDP Flooding 949.21W.
Labellhe attacks.
M irol1~ ACK Floodln 75-:m
."
tornbine ee2 csv ~ lt'l
M lral~ 'Hftl' Floodln 10,464-.
and retIlOVt' redo,nIant
Authorized licensed use limited to: Staffordshire University. Downloaded on February 14,2024 at 19:08:20 UTC from IEEE Xplore. Restrictions apply.
Add IltW binary class F« multicbss number of correct predictions, which include posmv e and
coIum 'C!m', NomwI. M~!I(.llion negative predictions, divided by the total number of
A"" e>qX'f'imerlts, li'op prediction s made. Accuracy = True Positive (TP) + True
'ClaY' coIum
Negative (TN) I True Positive (TP) + True Negative (TN) +
False Positive (FP) + False Negative (FN). Recall calculates
how many percentage of the actual positives our model
captures through labeling as positive. Recall = TP I TP + FN.
Precision captures how precise/accurate a model is, i.e., out
of those predicted positive, how many of them are actual
positive. Precision = TP I TP + FP. FI score is a balance
between precision and recall. Accuracy sometimes may be
lf ~ ~ in Forbinary largely contributed by a large number of true negatives. As a
'Attae:k' coIum.libtI
lht same in ·C!.1sI'
clas~fK.atioo
result, we may overlook the importance of false negatives
e>:perirntf1l5, drop
coknm.tlselabtlas 'Attae:k'column
and false positives. Ther efore, FI score is valuable to find
'Attack' balance between precision and recall when there' s an uneven
class distribution between true positives and true negatives.
F 1 = 2 ... (Precision'" Recall) I (Precision + Recall).
Trials C PU Time
ts 2524 s
25.09 e
25,55 s
Fig. 3. Screenshot of pcap file for Benign packets C. Results & Discussion
The experiments were carried out in 15 trials, 3 trials for
... nm. SotJ". De$ 1I1U111on POC>lOC:oI l en gl h Inlo Clan each approach. Table II, lll, IV, V and VI show the results.
Since we classified attack packet as 0 and normal packet as I,
as... aeecn
0 1m 4.064537
"'" '" 102 0 Type I error occurs when we identify packet as a malicious
, "" ...
""""
'" 10' """ packet when that packet is really a normal packet. Type II
1 e.oeeeo 26915 0
"26 4.064739 26915
'" 10' 365709 0 error occurs when we accept a packet as normal packet when
"...
in fact that packet is an attack packet. The optimal scenario is
a eesec 10' aeena
,
12>1 4.064$14 26915
'" 10' ''''''
0
that the Type II errors are minimum.
'"
" >l 4 0648')1 26915 0
As can be noticed from Table Il, the Logistic Regression
Fig. 4 . Head view of datase t after pre-processing approach received an accuracy of 86%, Recall rate of 75%
and FI score of 80%. The performanc e is relatively lower
III. EXPERtM ENTAL RES ULTS & O tSC USS ION than the other approaches . The outcomes do not change
much when the parameter for max iterations is modified from
In this section, we present our experimental setup and trial to trial. Fig. 3 shows us that the numb er of Type I error
results.
is more than double to the number of Type II error, while
both errors have relatively high number of cases,
A. Experimental Setup
The experiments were conducted on our Cyber Identity COnfuSIOn matnx
and Biometrics (CIB) lab workstation with the following Predicted label
4000 00
configuration. The operating system was Ubuntu 18.04.4
LTS 64 bit. The workstation had 25 1.4 OB of RAM, 2 TB of
hard drive size, 4 RTX8000 graphics cards and a processor of 320000
Intel Xeon W-2 195 CPU @2.30GHzwith36cores.
o
"'''
B. Train ing and Testing
As seen in Table I, 1,756,276 packets were labeled as
normal transmission and 1,229,718 packets were labeled as 160000
attack data. Therefore, out of a total of 2,985,994 packets,
normal packets contribute to 58.82% of the overall data and
malicious packets make up the rest at 41.1 8%. We used the - """0
conca tenated file as data input in all of our experiments. For
training and testing, we applied an unbiased split of 3:1 ratio
o
for the dataset. We used 75% of data for training and 25% of
data for testing.
We evaluated the perfonnance of our models based on Fig. S. Co nfusion Matri x for Logistic Regression approa ch
confusion matrix and other measures such as accuracy,
recall, Fl score, max iterations, and run time. Accuracy is the
Authorized licensed use limited to: Staffordshire University. Downloaded on February 14,2024 at 19:08:20 UTC from IEEE Xplore. Restrictions apply.
TABLE III. SUPPORT vecroe MACHINE Table IV and Fig. 5 display the results for the KNN
approach. For the first two trials, we received a rate of 99%
consistently for Accuracy, Recall and Fl score. As seen from
Trials A«ut"O/ fl Recal Max IIr CPU T"""
the table, the runtime s were relatively short, with an average
15\ 0 70 0 42 0 50 1000 11 min 125 of 2 and a half minutes. Modifying the number of neighbors
2nd 0 72 0 45 0 52 1000 12 mln24 5 did not impact much on the results.
0 .79 0.71 0,78 H'" l 1 mon21 s
TAB LEV. RANOOM FOREST
..
'"
2 ~0 00
,~
19 ""n 3 5
..
0
,~ ,~
19 mon 36 5
2'00000 19 ""n 57 5
I'
,~
~
~
"2 l~OOO O
....
-"" "
-.
Tn...s Recall CPU Trne
TAB LE IV.
""""
K -N EARES T N EIGHBOURS
,
..
ts 0,9 7 0 95 0 .95
- -00 0 0 0 0
--
o • .0000 '60000
- ,",,000
\6000(1
o
"
Fig. 9. Confusion Matrix for XGBoost approach
Authorized licensed use limited to: Staffordshire University. Downloaded on February 14,2024 at 19:08:20 UTC from IEEE Xplore. Restrictions apply.
grow as many levels as needed. The runtimes, however, were 802.11 w data base . In 2018 International Joint Conference on Neural
the highest among all trials. Networks (IJCNN) (pp. 1-7). IEEE.
[8] Mustafa, H., Xu, W., Sadeghi, A. R., & Schulz , S. (2016) . End-to-end
The results from the XGBoost approach is found in Table detection of caller ID spoofing attack s. IEEE Transactions on
VI and Fig. 7. We kept the default values for all parameters. Dependable and Secure Computing, 15(3),423-436.
The trials achieved a reasonable Accuracy rate of 97%, [9] Satria , D., & Ahmadian, H. (2018) . Designing Home Security
Recall score of 96% and that the FI score of 96% while Monitoring System Based Internet of Things (loTs) Model. Jurnal
Serambi Engineering, 3(1).
taking less computational efforts. The fastest runtime
[10] Tong, Z., & Ying, H. (2018, June) . Application of frequent item set
recorded, as indicated in Table VI, for trial 2 was lO.8 min ing algorithm in IDS based on Hadoop framework. In 2018
seconds . Chinese Control And Decision Conference (CCDC) (pp . 1908-1911).
IEEE .
Overall, even though the RF approach gave us the highest
[II] Shunnan, M. M., Khra is, R. M ., & Yate em, A. A. (2019, December).
metric scores, it took the highest computational effort. The loT Denial-of-Service Attack Detection and Prevention Using Hybrid
SVM approach required just as much computational resource IDS . In 2019 International Arab Conference on Information
and yet returned minimal satisfactory results. While the Technology (ACIT) (pp . 252 -254) . IEEE .
Logistic Regression approach consumed less computation [12] Alrashdi, I., Alqa zzaz, A., Aloufi, E., Alharthi, R., Zohdy, M., &
power, the accuracy of this approach is not satisfactory. KNN Ming , H. (2019 , January). AD-loT: anomaly detect ion of loT
and XGBoost approaches performed very well in terms of cyberattacks in smart city using machine learning. In 2019 IEEE 9th
Annual Computing and Communication Work shop and Conference
accuracy and other measures. For real time detection, (CCWC) (pp . 0305-0310). IEEE .
XGBoost approach is preferred to others .
[13] Mishra, A., & Dixit, A. (2018 , July). Resol ving threats in iot: Id
spoofing to ddos . In 2018 9Th international conference on computing,
IV. CONCLUSION communication and networking technologies (lCCCNT) (pp. 1-7).
IEEE .
In this paper, we attempted to enhance loT security by [14] Bakhtiar, F. A., Pramukantoro, E. S., & Nihri , H. (2019, March). A
experimenting anomaly detections on the loT Network Lightweight IDS Based on J48 Algorithm for Detecting DoS Attacks
Intrusion Dataset using multiple machine learning on loT Middleware. In 2019 IEEE 1st Global Conference on Life
approaches. We were able to achieve high accuracies while Sciences and Technologies (LifeTech) (pp. 41-42). IEEE .
maintaining high efficiencies on the loT Intrusion Network [15] Tabassum, A., Erbad, A., & Guizani, M. (2019, June). A Survey on
Recent Approaches in Intrusion Detection System in loT s. In 2019
Dataset. We received the second highest accuracy of 99% 15th International Wireless Communications & Mobile Computing
using KNN while the runtime was an average of 2 minutes. Conference (lWCMC) (pp. 1190-1197). IEEE .
XGBoost showed us promising result with an accuracy of [16] Benkhelifa, E., Welsh , T., & Hamouda, W. (2018) . A critical review
97% with just 10.8 seconds of run time . Fl scores obtained of pract ices and challenges in intrus ion detection systems for loT :
via different machine learning algorithms were consistent Toward universal and resilient systems. IEEE Communications
with accuracies. Using all features in the dataset, our Surv eys & Tutorials, 20(4), 3496-3509.
preliminary experimental results show great promise as we [17] Ferrag, M. A., Maglaras, L., Moschoyiannis, S., & Janicke, H. (2020) .
Deep learning for cyber security intrusion detection: Approaches,
wish to extend our work from binary-classification to multi-
datasets, and comparative study . Journal of Information Security and
class classification. We will attempt to normalize the data to Appl ications, 50,102419.
overcome the low accuracy we received from the LR [18] Fruhlinger, J. (2018) . The Mirai botnet explained: How teen
approach . As future work, we aim to provide a secure loT scammers and CCTV cameras almost brought down the internet. CSO
framework built with efficient, reliable and convenient IDS Online. Disponible en: https :llwww. csoonline.
for Smart environments. com!article/3258748/security/the-mirai-botnet-explainedhow-teen-
scammers-and-cctv-cameras-almost-brought-down-the-internet. html
[Consultado 23/08/18].
A CKNOWLEDGMENT [19] N. Mou stafa and J. Slay, "UNSW-NBI5: a comprehensive data set for
network intrusion detect ion systems (UNSW-NBI5 network data
This work is supported by Cisco Systems, Inc . Any set)," 2015 Military Communications and Information Systems
opinions, findings, and conclusions or recommendations Conference (MilClS), Canberra, ACT, 2015, pp . 1-6.
expressed in this material are those of the author(s) and do [20] Hyunjae Kang, Dong Hyun Ahn, Gyung Min Lee, Jeong Do Yoo,
not necessarily reflect the views of Cisco Systems, Inc. Kyung Ho Park, and Huy Kang Kim, "loT Network Intrusion
Dataset.", http ://ocslab.hksecurity.net/Datasets/iot-network-intrusion-
REFERENCES dataset, 2019 .
[21] Jyothsna, V. V. R. P. V., Prasad , V. R., & Prasad, K. M . (2011) . A
[I] Kiman i, K., Oduol, V., & Langat, K. (2019) . Cyber security
challenges for loT -based smart grid networks. International Journal of rev iew of anomaly based intrusion detect ion systems. International
Critical Infrastructure Protection, 25, 36-49. Journal of Computer Applications, 28(7), 26-3 .
[2] Statista . (2019) . Internet of Things (loT) connected devices installed
base worldwide from 2015 to 2025 (in bill ions) . Retrieved from
https:/Iwww.statista.com!statistics/471264/iot-number-of- connected-
device s-worldwidel
[3] Columbus, L. (2018) . 2018 Roundup Of Internet Of Things Forecasts
And Market Estimates. Comput. Commun., 54 (2014), pp. 1-31
[4] Shakdhe, A., Agrawal, S., & Yang, B. (2019, May) . Security
Vulnerabilities in Consumer loT Applications. In 2019 IEEE 5th Inti
Conference on Big Data Security on Cloud (BigDataSecurity)(Pp. 1-
6).
[5] Kolias, C., Kambourakis, G., Stavrou, A., & Voas, J. (2017) . DDoS in
the loT : Mira i and other botncts. Computer, 50(7), 80-84 ..
[6] SAADI, C., & CHAOUI, H. (2019, April) . Proposed security by IDS-
AM in Android system . In 2019 5th International Conference on
Optimization and Applications (lCOA) (pp. 1-7). IEEE.
[7] Vilela, D. W., Lotufo, A. D. P., & Santos, C. R. (2018, Jul y). Fuzzy
ARTM AP Neural Network IDS Evaluation applied for real IEEE
Authorized licensed use limited to: Staffordshire University. Downloaded on February 14,2024 at 19:08:20 UTC from IEEE Xplore. Restrictions apply.