1 s2.0 S0925231220319032 Main
1 s2.0 S0925231220319032 Main
1 s2.0 S0925231220319032 Main
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: Context and rationale: Intrusion Detection, the ability to detect malware and other attacks, is a crucial
Received 9 April 2020 aspect to ensure cybersecurity. So is the ability to identify this myriad of attacks.
Revised 8 June 2020 Objective: Artificial Neural Networks (as well as other machine learning bio-inspired approaches) are an
Accepted 9 July 2020
established and proven method of accurate classification. ANNs are extremely versatile – a wide range of
Available online 19 December 2020
setups can achieve significantly different classification results. The main objective and contribution of
this paper is the evaluation of the way the hyperparameters can influence the final classification result.
Keywords:
Method and results: In this paper, a wide range of ANN setups is put to comparison. We have performed
Cybersecurity
Artificial Neural Network
our experiments on two benchmark datasets, namely NSL-KDD and CICIDS2017.
Machine learning Conclusions: The most effective arrangement achieves the multi-class classification accuracy of 99.909%
on an established benchmark dataset.
Ó 2020 Elsevier B.V. All rights reserved.
1. Introduction stolen. A recent report [4] suggests that the vulnerability has not
been addressed after all those years.
Every single day societies, businesses and citizens in person are Early 2018 was marked by a security violation that affected over
threatened by a wide array of cyberattacks including malware, 150 million users of a popular fitness app, MyFitnessPal. The media
worms, trojan horses, spyware, SQLI, XSS, ransomware [1] and a coverage of the breach tried to pass the event as ‘‘just another day
significant variety of other hazards. The proliferation of cyber- on the Internet” [5]. With the prevailing risk of both new and
threats can suggest that these malicious instances have become known cybersecurity threats, it is very tempting to just nod in
part of the daily routine of the contemporary citizen. At the begin- agreement with that assertion.
ning of 2018, a banking malware geared towards android targeted The wide range of cybersecurity violations resulted in pro-
unsuspecting bank app users [2]. The new BankBot strain was pelling research on an array of different detection methods. Two
altered to such a degree that it deemed safe by the Google Play main trends of research and development emerged, namely
Store antivirus protection, even though BankBot is a well-known signature-based and anomaly-based. Signature-based IDS (Intru-
malware. The trojan operated under the guise of what was a sion Detection Systems) operate by utilising a storehouse of recog-
benevolent application at a first glance, but once set up on an nised attacks, while the anomaly-based methods form a model of
Android device, it proceeded to appropriate the bank’s access ‘normal’ traffic and go into alert with every divergence from the
credentials. model [6]. The black-hat society utilises numerous obfuscation
A well recognised case of cross-site scripting (XSS) took place techniques to deceive the signature-based detectors. According to
when eBay turned out to be vulnerable to attack [3]. In 2014, Java- a recent analysis, known malevolent software can be made abso-
Script code was being included in costly items’ listings. The user lutely undetectable for contemporary anti-malware applications
only had to click a malicious, but benign-looking listing to have [7]. Frequently, research concerning the use of neural networks
the script seize control of their browser, become redirected to a site in intrusion detection offers arbitrarily chosen topologies and
that looked exactly like eBay, and have their credit card credentials hyperparameter setups, disregarding the immense influence the
hyperparameters can have over the achieved accuracy. The first
aim of this research is to highlight the untapped power hidden in
⇑ Corresponding author. setups that often remain untested. Secondly, the best possible
E-mail address: [email protected] (M. Choraś).
https://doi.org/10.1016/j.neucom.2020.07.138
0925-2312/Ó 2020 Elsevier B.V. All rights reserved.
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715
setup for the used intrusion detection benchmarks is sought after. In [15] Principal Component Analysis (PCA) is employed as a
Thirdly, the topology and hyperparameter mixes that come closely feature extractor, before feeding the data to the ANN, as opposed
to the best setup are noted to find the smallest topologies which to providing the inputs directly from the dataset. As the article
require the least computational power. The last two points should illustrates, this drops the memory requirements of the method sig-
be considered as the main motivation for this work, as the findings nificantly, along with the time of training necessary. The two eval-
will be used to achieve better intrusion detection in a deployable uated methods displayed comparable results as far as the accuracy
system. The stated motivation suggests the following research is concerned. This makes applying PCA clearly the better option.
question: what mix of hyperparameters and topology settings Using a Kernel PCA betters the training time of ANN, but uses sig-
can provide the highest accuracy for an ANN classifier used on nificantly more memory than the traditional PCA. Both methods
benchmark IDS datasets?. have similar accuracy measures, so the authors of [16] conclude
To achieve those abovementioned objectives, a hyperparameter that using a mix of different algorithms is preferable. There has
tuning procedure is proposed and employed. The procedure, called been research on utilising Graphical Processing Units to accelerate
gridsearch, searches through hyperparameter space, exhaustively ANN based IDS, since GPUs are a good fit for ANN computations. An
generating candidates of the settings of the neural network and increase in performance has been proven [17]. The authors of [18]
ranks them according to a chosen scoring function. The results of evaluate an ANN with one hidden layer in comparison with a Sup-
that procedure are gathered, listing the results the selected hyper- port Vector Machine, a Naive Bayes and a C4.5 algorithm. The ANN
parameter mixes have achieved for a given topology. The results achieves comparable, or better results in malware detection, but
highlight the importance of optimising the topology and hyperpa- thanks to the simpler nature of a 3-layer ANN framework requires
rameters for data in the cybersecurity domain. As noted in [8], fewer computations than other tested algorithms. The experiments
hyperparameter optimisation is necessary to find the best possible were performed on the NSL-KDD dataset, which is the current
fit for a given dataset. A range of hyperparameter optimisation benchmark, and the successor of KDD’99. In [19] the authors use
methods exist, including methods employing evolutionary a convolutional neural network to extract spatial features and Bi-
approaches, particle swarm optimisation and even random directional long short-term memory to extract temporal features,
searches. This research used gridsearch for its transparency, and they name their approach the deep hierarchical network model.
the possibility to guide the experiments into most promising areas The solution is used on the NSL-KDD and UNSW-NB15 datasets,
manually [8]. the well known cybersecurity benchmarks. The training dataset
This paper is structured as follows: in Section 2 an overview of is prepared with the use of one-sided selection and synthethic
related work in the domain with the emphasis on ANN for intru- minority oversampling technique to first reduce noise in the
sion and malware detection is provided. There are parts of the benign class and oversample the minority classes, forming bal-
ANN setup that cannot be inferred from data in the way parame- anced datasets. The authors report 83.58% accuracy on NSL-KDD
ters like weights and biases are. Those features of the ANN, among and 77.16% on UNSW-NB15. In [20] the authors aim to resolve
them the activation function, or the optimizer, have to be put in the problem of high dimensionality and the amount of noise in
manually. They are collectively referred to as ‘‘hyperparameters” cybersecurity data. To this end, they employ a combination of deep
[9]. In Section 3, the proposed method based on ANN is described belief network (DBN) with feature-weighted support vector
in detail, with the detailed description of the pipeline and chosen machine (WSVM). The DBN is trained with the use of an adaptive
hyperparameters. In Section 4, the experimental setup and results learning rate, which is used as a feature extractor. Then the fea-
are given, while the threats to validity and conclusions are placed tures are inputted to a particle-swarm-optimized WSVN. The solu-
thereafter in Sections 5 and 6, respectively. tion is tested on NSL-KDD, achieving the accuracy of 85.73% for
binary classification. The authors of [21] propose a model they call
BAT, which mixes a Bidirectional Long Short-term memory net-
2. Related work work with an attention mechanism. The attention mechanism is
utilised to scan the features obtained by the BLSTM. This is
2.1. Artificial neural networks in intrusion detection inputted to a CNN with multiple convolutional layers, achieving
a model that does not require feature engineering. The approach
Cybersecurity is an immensely broad topic, with different mea- achieves 85.25% accuracy.
sures designed to counter different attack vectors [10]. The appli-
cation of Artificial Neural Networks (ANN) for intrusion detection
systems (IDS) and malware detection is hardly a new concept. 3. The proposed method based on artificial neural network
There have been evaluations of the notion of using ANN to aid
anomaly detection and malware detection as far as in 2009 [11]. Artificial Neural Networks (ANN) are an all-purpose utility for
In [12], the authors attempt to address the problems of overfitting, modelling. The initial concept [22] found a variety of modifications,
high memory consumption and high overhead of standard IDS/ like Convolutional Neural Networks (CNN) [23], Radial Basis Func-
malware detection with a feed-forward ANN. Specifically, a 2- tion Networks [24], Radial Basis Probabilistic Function Neural Net-
layered feed-forward ANN was recommended. The aforementioned works [25,26], Recurrent Neural Networks (RNN) [27] and many
problems were handled through a conjugated training function more. With a myriad of applications, including Natural Language
and validation dataset. The authors claim that their method Processing [28], Biometrics [29], finding polynomial roots [30–
achieves similar results to classical procedures, but with less com- 32], intrusion detection [33] and many more, they are an accepted
putational overhead. The procedure was tested on the benchmark and renowned tool for data mining, with classification, regression,
KDD’99 dataset. The conclusion of the paper states that less data is clustering and time series analysis abilities. The basic assumption
better because of the time the machine needs to crunch it. In [13], of an ANN is that it imitates, to a certain extent, the learning com-
pruning of the ANN is evaluated as part of the optimisation of the petencies of a biological neural network, stressing by principle the
network. It is basically the deletion of neural nodes of either the properties of neural networks found in human brains, although
input or the hidden layers. This makes the ANN faster, as fewer strongly streamlined [34].
computations have to be processed. In [14], an Artificial Neural The surprising modelling capacity of ANN in pattern recognition
Network also showed promise as an IDS when evaluated. In fact, derives from its strong malleability as it fits to data. This extensive
the results were very encouraging. approximation capacity is markedly important when handling
706
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715
real-world data, when the information is plentiful, but the patterns that it experiences significant difficulty in performing on unfore-
buried in it remain uncovered. The optimisation of the setup can seen data, as the approximation is not sufficiently generalised [38].
play an important part in the results the setup achieves [35]. In this work the influence of the number of hidden layers as
In an ANN, knowledge is gained through updating weights with well as the number of neurons in those hidden layers on the
consecutive batches of data instances. The algorithm can recognise ANN performance has been subjected to scrutiny (in addition to
the associations among the variables, as well as generalise in a way other hyperparameters). While it can be considered as topology
that allows for high performance on new, unforeseen data [36]. It is or network structure, those aspects could also be considered
basically like fitting a line, or a plane, or a hyper-plane through a hyperparameters.
set [37].
An artificial neural network with a sole computational layer is 3.1. The usage of backpropagation
dubbed a perceptron. It consists of an input and an output compu-
tational layer. After the data points are fed to the input layer, they Having to train a single-layer perceptron is straightforward -
are issued to the computational layer. The input layer contains d the loss function is a function of the weights. With multiple layers,
nodes that speak for d features X ¼ ½x1 . . . xd and edges of weight the procedure becomes more demanding as it makes many layers
P of weights influence one another. Backpropagation calculates the
W ¼ ½w1 . . . wd . The output neuron computes W X ¼ di¼1 ðwi xi ).
In case of the perceptron, the forecast is binary, and is delineated Error Gradient as the sum of local-gradients over multiple paths
by the sign of the value that is the result of the output layer com- to the output node [38]. The algorithm consists of two phases –
putation. To help deal with distribution imbalance, bias can be the forward and the backward phase. In the forward phase, the
added. data points are served to the input nodes, and one after one the
The prediction of y ^ is the result of the following equation: results at consecutive layers are computed with the current
weights. The result of this prediction is compared to the training
X
d instance. The backward phase uncovers the gradient of the loss
^ ¼ signfW X þ bg ¼ signf
y wi xi þ bg function for all the weights. The gradients update the weights,
i¼1
starting from the output layer, stepping back all the way to the first
As seen in the equation, the sign is the activation function Uðv Þ. layer. This weight updating process iterates over the training data
Numerous activation functions can be utilised in artificial neural – each iteration is called an epoch – ANNs can often necessitate
networks with multiple hidden layers. For easier training, it is com- thousands of those iterations to attain convergence.
monly either the Rectified Linear Unit (ReLU) or Hard Tanh in mul-
tilayered networks. The error of the regression can be indicated as 3.2. The used technologies: TensorFlow and Keras
the difference between the real-life test value and the predicted
value, so EðXÞ ¼ y y^. If the error is not equal to 0, the weights In this article TensorFlow has been used, which is a high perfor-
should be amended. Thus, the purpose of the perceptron is to min- mance, open source library, provided by the developers and engi-
imise the least-squares between y and y ^, for all data points in data- neers of the Google Brain team. It serves as a capable support for
set D. This objective is named the loss function. machine and deep learning, and it is currently implemented in
X an array of scientific and industry applications [39]. Keras, which
ðy signfW XgÞ operates on top of TensorFlow and a myriad of other machine
ðX;yÞ2D learning libraries, brings an astounding speed of experimentation
along with incomparable user experience. This is attained via mod-
The loss function is defined over the whole dataset X, the weights W ular, expandable design. Keras was brought into existence more as
are updated with the learning rate a, and the algorithm iterates over an interface than an autonomous library. Keras received full sup-
the entire dataset until it converges. This algorithm was named port in the TensorFlow library, and enables intuitive coding of both
stochastic gradient-descent, also expressed by: machine and deep learning procedures [40].
W ( W þ aEðXÞX
3.3. Improving the selected algorithms with hyperparameter
[38] optimization
A multi-layer neural network is created via multiple computa-
tional layers, also named the hidden layers. The title itself hints One of the most important parts of the Artificial Neural Net-
to the black-box character of those layers, as the computations work design comes in the role of the activation function, as the
are shrouded from the user’s perspective. The data points are car- effect it carries over the achievable results is straightforward. The
ried from the input layer to subsequent layers with computations network can accommodate diverse types of activation functions.
at every stage, down to the output layer. The decision on the type of an activation function plays a crucial
The aforementioned procedure is referred to as the feed- role especially in the multi-layer networks, as each layer can have
forward neural network [38]. The exact count of nodes in the fore- its own non-linear activation function [38]. Each distinctive func-
most computational layer usually does not reach the count of tion can have a special influence on the results of the ANN, as well
nodes of the input layer. The particular number of neurons and as how the ANN converges, and the comprehensive nature of the
the number of hidden layers is in proportion to the intricacy of network. Out of a wide range of activation functions Uðv Þ, the four
the necessary model and on the data [36]. While in some special most often appearing in the current literature were selected:
cases utilising a fully-connected layer is the norm, the use of hid-
den layers with the count of neurons below that of the input’s Sigmoid
grants a loss in representation, which oftentimes increases the net- Hard Sigmoid
work’s performance. This is very likely the result of getting rid of Rectified Linear Unit (ReLU)
the noise in data [38]. Hyperbolic Tangent (tanh)
A network built with too many neurons can display unwanted
behaviour known as overfitting, also named overtraining. This par- The optimal network setup is found by using a grid search pro-
ticular phenomenon happens when the artificial neural network cedure, which completes an all-encompassing search over the
fitted the exact patterns found in the training dataset so tightly, hyperparameter’s space. The grid search parameters included:
707
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715
Table 3
2 hidden layers, 25 neurons each, 300 epochs, NSL-KDD dataset.
Table 4
4 hidden layers, 25 neurons each, 300 epochs, NSL-KDD dataset.
Table 5 The third table (Table 3) presents the hyperparameter mixes for
7 layers, 25 neurons, 300 epochs, NSL-KDD dataset. a topology of two hidden layers of 25 neurons each, trained for 300
Accuracy Activation Optimizer Batch_size epochs. The best result for this setup was almost exactly 99.9%
0.968288 tanh rmsprop 10 with the ReLU activation function, ADAM optimizer and batch size
0.966866 tanh adam 10 of 10. It is worth noticing that the eight best results in this table
0.964502 relu rmsprop 10 exceed 99.9% accuracy, and all the results in that table exceed
0.962138 relu adam 10 and all the preceding tables (Tables 1 and 2) exceed 98.9%.
0.958355 tanh adam 100
0.956455 relu adam 100
In Table 4 the finest accuracy exceeded 99.89%. In this table for
0.955988 relu rmsprop 100 the first time some of the accuracies fall below 90%, with the worst
0.955514 tanh rmsprop 100 performer falling below 6%. This trend will continue for more com-
0.95125 tanh SGD 10 plex topologies.
0.946037 relu SGD 10
In Table 5 the best accuracy was 96.83%, with a mix of tanh,
0.902984 tanh SGD 100
0.884989 hard_sigmoid adam 10 rmsprop and batch size of 10. Half of the setups did not reach
0.879798 sigmoid rmsprop 10 90%. One more time the hyperbolic tangent takes the first spot as
0.876011 hard_sigmoid rmsprop 10 one of the contributors of the best accuracy. In the summary table
0.870312 sigmoid adam 10 (Table 7) the tanh function appears in six out of ten different set-
0.8495 relu SGD 100
ups for ten best evaluated topologies.
0.849028 sigmoid rmsprop 100
0.842881 hard_sigmoid rmsprop 100 The accuracies collected in Table 6 illustrate that increasing the
0.782832 hard_sigmoid adam 100 depth of the neural network does not bring any additional benefit
0.640515 sigmoid adam 100 for NSL-KDD. The four worst results did not even reach 50% accu-
0.497387 sigmoid SGD 10
racy. The highest-scoring setup used tanh and ADAM. The ADAM
0.497387 hard_sigmoid SGD 10
0.497387 sigmoid SGD 100 optimizer appears in seven out of ten best setups in Table 7.
0.497387 hard_sigmoid SGD 100 Out of the best performing topologies the highest scoring
hyperparameter setups have been collected in Table 7. As noted
The first table (Table 1) presents how the achieved accuracy in preceding paragraphs, the tanh activation function and the
fluctuated in the smallest of the evaluated topologies. The best ADAM optimizer contribute to the best results more frequently
result – which came from combining the sigmoid activation func- than their counterparts. Rather surprisingly, the smaller setups
tion, the ADAM optimizer and a batch size of 10, with 300 epochs – lead to better results than the deeper architectures. The findings
reached almost 99.9%. That constitutes the fourth best result in the of this part of the research will be elaborated upon more fully in
summary table (Table 7). the statistical significance section.
The second table (Table 2) presents the accuracies and hyperpa- In a batch of experiments, the algorithm has been applied to the
rameter mixes for one hidden layer with 25 neural nodes. The CICIDS2017 dataset. The initial results found in Table 8, 9 were
number of epochs is 300. The best accuracy in this table exceeds very encouraging, with the accuracy exceeding 99% (0.9936). The
99.9%, and is at the same time the best of the accuracies attained recall of 0.54 and f1-score of 0.70 in one of the classes signified that
in all of the experiments (Table 7). This particular mix of hyperpa- there might be a balancing problem in the dataset. A closer inspec-
rameters attracted the authors’ attention. The hyperbolic tangent tion revealed that there are over 43000 records of benign netflows
activation function is a sigmoidal function and is often omitted in the set, but only slightly over 1300 attack records. This is shown
in contemporary neural networks (with ReLU being the go-to func- in Table 8 in the ‘Support’ column. To counteract that the majority
tion). The newest deep learning models usually use micro-batches class was randomly subsampled with the number of samples
of 1 or 2. Thus, a mix of tanh, ADAM and batch size of 100 is a very matching the sum of the attack records. The initial results for the
interesting option, especially for such a small topology. balanced dataset are represented in Table 9. The accuracy of the
Table 6
10 layers, 25 neurons, NSL-KDD dataset.
710
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715
Table 7
Summary of best hyperparameter setups for ten best topologies on NSL-KDD.
Table 8
CICIDS2017 initial results, Tuesday subset.
Table 9
Balanced CICIDS2017 initial results, Tuesday subset.
procedure on the balanced dataset exceeded 97% over the test set, with gaussian kernel (SVM), the Naive Bayes classifier and ADA-
and 95% mean accuracy in the 10-fold cross-validation. The results Boost. Table 10 illustrates the differences among the results the
of the setup over the full set are displayed in Table 13. classifiers have achieved with the use of CICIDS2017 Tuesday data.
In Table 12 the results the Naive Bayes classifier has achieved
on the full CICIDS2017 dataset have been showcased, with the pre-
5.1. Comparison to other state of the art machine learning algorithms
cision, recall and f1-score of each of the classes found in the data-
set. The support column signifies the number of instances of the
In this section the performance of other ML approaches is pre-
particular classes in the dataset. The table is further extended to
sented. To place the performance of the illustrated approach in
show the results of other methods over the whole CICIDS2017
context, tests were performed using a Support Vector Machine
Table 10
The results of other ML methods over the CICIDS2017 dataset, Tuesday subset.
711
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715
set. After examining the results of the classifiers over all the classes Table 13
it is apparent that the samples contained in the Tuesday dataset 4 hidden layers, 25 neurons each, CICIDS2017 dataset top setup.
constitute an easier task for ML classifiers than the dataset as a Precision Recall f1-score Support
whole. 0 1.00 0.98 0.99 162154
1 0.50 0.63 0.56 196
2 1.00 1.00 1.00 12803
5.2. The influence of hyperparameter ooptimisation 3 0.98 0.98 0.98 1029
4 0.90 0.99 0.95 23012
5 0.90 0.99 0.94 550
Hyperparameter optimisation is performed on each of the set-
6 0.97 0.98 0.97 580
ups. The gridsearch method evaluates each of the possible permu- 7 0.99 0.98 0.98 794
tations of the selected hyperparameters. Namely, the used epochs 8 1.00 1.00 1.00 1
count, the batch size, the optimiser and the activation function are 9 1.00 1.00 1.00 15880
consecutively permutated in order to achieve the highest accuracy. 10 0.97 0.49 0.65 590
11 0.59 0.23 0.33 301
The Tables 1–6 illustrate the way the accuracy fluctuates on vari-
12 0.00 0.00 0.00 4
ous ANN setups. The Table 4 displays the results of the gridsearch 13 0.80 0.03 0.06 130
procedure of an ANN with 4 hidden layers, 25 neurons on each Accuracy 0.977
layer. Table 2 shows how gridsearch performed for a 1-hidden Macro avg 0.83 0.73 0.74 218024
Weighted avg 0.99 0.98 0.98 218024
layer, 25-neural nodes ANN. The remaining Tables 3 and 1 illus-
trate gridsearch of ANN with 2 hidden layers and 25 neurons, 1
hidden layer and 10 neurons. The optimal setup for the algorithm
in the CICIDS2017 case has been established with the gridsearch 5.3. Statistical analysis
procedure as well (Table 13. A sample of the results over the whole
dataset can be seen in Table 11–13. It is apparent that the results The standard and widely used Wilcoxon Signed-Rank Test was
acquired with different parameter setups vary greatly, just as it utilised to evaluate the statistical significance of this research.
was in the case of NSL-KDD. The best performing setups for NSL-KDD are gathered in Table 7.
All those setups were tested against the best if the best top ranking
setup and the test did not find statistical significance (with-p val-
Table 11
ues ranging from 0.4764 to 0.1829), leading to the conclusion that
4 hidden layers, 25 neurons each, 30 Epochs CICIDS2017 dataset.
the topology of the neural network does not bear a significant
Accuracy Activation Batch Size Optimizer impact on the accuracy if the hyperparameters are chosen cor-
0.976983 relu 50 rmsprop rectly for this dataset. However, different hyperparameter setups
0.976742 relu 100 rmsprop impacted the results of the chosen topologies immensely. For
0.975176 relu 50 adam example, the best and worst setups for the 4 layers and 25 neurons
0.974252 relu 100 adam
0.973408 hard_sigmoid 50 rmsprop
topology when subjected to the Wilcoxon test reject the null
0.972645 hard_sigmoid 100 rmsprop hypothesis with p = 0.0093, thus the top setup is clearly the better
0.972444 hard_sigmoid 50 adam option.
0.971922 hard_sigmoid 100 adam
0.970597 relu 50 SGD
0.970516 sigmoid 50 adam
0.969793 sigmoid 50 rmsprop
5.4. Lessons learned
0.96911 sigmoid 100 adam
0.965575 sigmoid 100 rmsprop
0.958184 relu 100 SGD The gathered results and investigating the statistical signifi-
0.930227 sigmoid 50 SGD cance of the results achieved by particular setups illustrated that
0.708456 hard_sigmoid 50 SGD neither increasing the depth of neural networks, nor inflating the
0.585017 sigmoid 100 SGD
0.499819 hard_sigmoid 100 SGD
number of neurons in layers contribute to increasing the ANN
accuracy for the evaluated benchmark dataset, beyond a certain
Table 12
CICIDS2017 – reference classifiers.
712
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715
References [30] D.-S. Huang, H.H.-S. Ip, K.C.K. Law, Z. Chi, Zeroing polynomials using modified
constrained neural network approach, IEEE Trans. Neural Networks 16 (3)
(2005) 721–732.
[1] G. McGraw, G. Morrisett, Attacking malicious code: a report to the infosec
[31] D.-S. Huang, H.H. Ip, Z. Chi, A neural root finder of polynomials based on root
research council, IEEE Softw. 17 (5) (2000) 33–41, https://doi.org/10.1109/
moments, Neural Comput. 16 (8) (2004) 1721–1762.
52.877857.
[32] D.-S. Huang, A constructive approach for finding arbitrary roots of polynomials
[2] A. Bielec, analysis of a polish bankbot. https://www.cert.pl/en/news/
by neural networks, IEEE Trans. Neural Networks 15 (2) (2004) 477–491.
single/analysis-of-a-polish-bankbot/..
[33] J. Ryan, M.-J. Lin, R. Miikkulainen, Intrusion detection with neural networks,
[3] L. Kelion, ebay redirect attack puts buyers’ credentials at risk. http://www.
in: Advances in Neural Information Processing Systems, 1998, pp. 943–949..
bbc.com/news/technology-29241563..
[34] O. Maimon, L. Rokach, Data Mining and Knowledge Discovery Handbook,
[4] P. Mutton, hackers still exploiting ebay’s stored xss vulnerabilities in 2017.
second ed., 2010.
https://news.netcraft.com/archives/2017/02/17/hackers-still-exploiting-
[35] D.-S. Huang, J.-X. Du, A constructive hybrid structure optimization
ebays-stored-xss-vulnerabilities-in-2017.html..
methodology for radial basis probabilistic neural networks, IEEE Trans.
[5] D. Lee, myfitnesspal breach affects millions of under armour users. http://
Neural Networks 19 (12) (2008) 2099–2115.
www.bbc.com/news/technology-43592470..
[36] I.N. da Silva, D.H. Spatti, R.A. Flauzino, L.H.B. Liboni, S.F. dos Reis Alves,
[6] N. Idika, A. Mathur, A Survey of Malware Detection Techniques, Purdue
Artificial Neural Networks A Practical Course, 2017. doi:10.1007/978-3-319-
University..
43162-8..
[7] G. Canfora, A. Di Sorbo, F. Mercaldo, C.A. Visaggio, Obfuscation techniques
[37] S. Bassis, A. Esposito, F.C. Morabito, E. Pasero, Adv. Neural Networks (2016),
against signature-based detection: aa case study, Mobile Syst. Technol.
https://doi.org/10.1007/978-3-319-33747-0.
Workshop (MST) 2015 (2015) 21–26, https://doi.org/10.1109/MST.2015.8.
[38] C.C. Aggarwal, Neural Networks and Deep Learning a Textbook, 2018.
[8] M. Feurer, F. Hutter, Hyperparameter Optimization, Springer International
doi:10.1007/978-3-319-94463-0.
Publishing, Cham, 2019, pp. 3–33. doi:10.1007/978-3-030-05318-5_1..
[39] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A.
[9] S. Skansi, Introduction to Deep Learning: From Logical Calculus to Artificial
Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M.
Intelligence, Springer, 2018.
Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga,
[10] M. Choraś, R. Kozik, Machine learning techniques applied to detect cyber
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K.
attacks on web applications, Logic J. IGPL 23 (1) (2015) 45–56, https://doi.org/
Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden,
10.1093/jigpal/jzu038.
M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine
[11] Y. Sani, A. Mohamedou, K. Ali, A. Farjamfar, M. Azman, S. Shamsuddin, An
learning on heterogeneous systems, software available from tensorflow.org
overview of neural networks use in anomaly intrusion detection systems, IEEE
(2015). https://www.tensorflow.org/..
Student Conference on Research and Development (SCOReD) 2009 (2009) 89–
[40] F. Chollet, et al., Keras, https://github.com/fchollet/keras (2015)..
92, https://doi.org/10.1109/SCORED.2009.5443289.
[41] I. Sharafaldin, A.H. Lashkari, A.A. Ghorbani, Toward generating a new intrusion
[12] F. Haddadi, S. Khanchi, M. Shetabi, V. Derhami, Intrusion detection and attack
detection dataset and intrusion traffic characterization, in: Proceedings of the
classification using feed-forward neural network, Second International
4th International Conference on Information Systems Security and Privacy –
Conference on Computer and Network Technology 2010 (2010) 262–266,
Volume 1: ICISSP, INSTICC, SciTePress, 2018, pp. 108–116. doi:10.5220/
https://doi.org/10.1109/ICCNT.2010.28.
0006639801080116.
[13] W. Gong, W. Fu, L. Cai, A neural network based intrusion detection data fusion
[42] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation
model, in: 2010 Third International Joint Conference on Computational
and model selection, Ijcai (1995) 1137–1145.
Science and Optimization, vol. 2, 2010, pp. 410–414. doi:10.1109/
[43] G. James, D. Witten, T. Hastie, R. Tibshirani, An introduction to statistical
CSO.2010.62..
learning, in: Cluster Comput, 2018, 2013.
[14] I. Mukhopadhyay, M. Chakraborty, S. Chakrabarti, T. Chatterjee, Back
[44] P. Branco, L. Torgo, R. Ribeiro, Relevance-based evaluation metrics for multi-
propagation neural network approach to intrusion detection system,
class imbalanced domains, in: Pacific-Asia Conference on Knowledge
International Conference on Recent Trends in Information Systems 2011
Discovery and Data Mining, Springer, 2017, pp. 698–710..
(2011) 303–308, https://doi.org/10.1109/ReTIS.2011.6146886.
[45] R. Kozik, M. Choraś, J. Keller, Balanced efficient lifelong learning (B-ELLA) for
[15] H.A. Sonawane, T.M. Pattewar, A comparative performance evaluation of
cyber attack detection, J. UCS 25 (1) (2019) 2–15, http://www.jucs.org/
intrusion detection based on neural network and pca, International Conference
jucs_25_1/balanced_efficient_lifelong_learning.
on Communications and Signal Processing (ICCSP) 2015 (2015) 0841–0845,
[46] M. Choraś, M. Pawlicki, R. Kozik, The feasibility of deep learning use for
https://doi.org/10.1109/ICCSP.2015.7322612.
adversarial model extraction in the cybersecurity domain, in: H. Yin, D.
[16] T.M. Pattewar, H.A. Sonawane, Neural network based intrusion detection using
Camacho, P. Tino, A.J. Tallón-Ballesteros, R. Menezes, R. Allmendinger (Eds.),
bayesian with pca and kpca feature extraction, in: 2015 IEEE International
Intelligent Data Engineering and Automated Learning – IDEAL 2019, Springer
Conference on Computer Graphics, Vision and Information Security (CGVIS),
International Publishing, Cham, 2019, pp. 353–360.
2015, pp. 83–88. doi: 10.1109/CGVIS.2015.7449898.
[47] M. Choraś, M. Pawlicki, D. Puchalski, R. Kozik, Machine learning – the results
[17] N.T.T. Van, T.N. Thinh, Accelerating anomaly-based ids using neural network
are not the only thing that matters! what about security, explainability and
on gpu, International Conference on Advanced Computing and Applications
fairness?, in: International Conference on Computer Recognition Systems,
(ACOMP) 2015 (2015) 67–74, https://doi.org/10.1109/ACOMP.2015.30.
Springer, 2020
[18] B. Subba, S. Biswas, S. Karmakar, A neural network based system for intrusion
[48] M. Pawlicki, M. Choraś, R. Kozik, Defending network intrusion detection
detection and attack classification, Twenty Second National Conference on
systems against adversarial evasion attacks, Fut. Gen. Comput. Syst. 110
Communication (NCC) 2016 (2016) 1–6, https://doi.org/10.1109/
(2020) 148–154, https://doi.org/10.1016/j.future.2020.04.013, http://
NCC.2016.7561088.
www.sciencedirect.com/science/article/pii/S0167739X20303368.
[19] K. Jiang, W. Wang, A. Wang, H. Wu, Network intrusion detection combined
[49] R. Kozik, M. Choraś, A. Flizikowski, M. Theocharidou, V. Rosato, E. Rome,
hybrid sampling with deep hierarchical network, IEEE Access 8 (2020) 32464–
Advanced services for critical infrastructures protection, J. Ambient Intell.
32476.
Human. Comput. 6 (6) (2015) 783–795.
[20] Y. Wu, W.W. Lee, Z. Xu, M. Ni, Large-scale and robust intrusion detection
[50] M. Szczepański, M. Choraś, M. Pawlicki, R. Kozik, Achieving explainability of
model combining improved deep belief network with feature-weighted svm,
intrusion detectionsystem by hybrid oracle-explainer approach, in:
IEEE Access 8 (2020) 98600–98611.
International Joint Conference on Neural Networks (IJCNN) 2020, IEEE, 2020..
[21] T. Su, H. Sun, J. Zhu, S. Wang, Y. Li, Bat: Deep learning methods on network
intrusion detection using nsl-kdd dataset, IEEE Access 8 (2020) 29575–29585.
[22] W.S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous
activity, Bull. Math. Biophys. 5 (4) (1943) 115–133. Michał Choraś holds the professor position at Univer-
[23] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to sity of Science and Technology in Bydgoszcz (UTP)
document recognition, Proc. IEEE 86 (11) (1998) 2278–2324. where he is the Head of Teleinformatics Systems Divi-
[24] J. Moody, C.J. Darken, Fast learning in networks of locally-tuned processing sion and PATRAS Research Group. He has been involved
units, Neural Comput. 1 (2) (1989) 281–294.
in many EU projects (e.g. SocialTruth, CIPRNet, QRapids,
[25] D.-S. Huang, Radial basis probabilistic neural networks: Model and
INSPIRE). His interests include data science and pattern
application, Int. J. Pattern Recogn. Artif. Intell. 13 (07) (1999) 1083–1101.
recognition in several domains e.g. cyber security,
[26] D.-S. Huang, W.-B. Zhao, Determining the centers of radial basis probabilistic
neural networks by recursive orthogonal least square algorithms, Appl. Math. image processing, software engineering, prediction,
Comput. 162 (1) (2005) 461–473. correlation, biometrics and critical infrastructures pro-
[27] C. Goller, A. Kuchler, Learning task-dependent distributed representations by tection. Currently he coordinates H2020 SIMARGL and is
backpropagation through structure, in: Proceedings of International the Programme Leader (SAFAIR) in H2020 SPARTA. He is
Conference on Neural Networks (ICNN’96), vol. 1, 1996, pp. 347–352.. an author of over 250 reviewed scientific publications.
[28] Y. Goldberg, A primer on neural network models for natural language
processing, J. Artif. Intell. Res. 57 (2016) 345–420.
[29] L. Shang, D.-S. Huang, J.-X. Du, C.-H. Zheng, Palmprint recognition using fastica
algorithm and radial basis probabilistic neural network, Neurocomputing 69
(13–15) (2006) 1782–1786.
714
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715
715