Identifying the most accurate machine learning classification technique to detect network threats

Farouk, Mohamed; Sakr, Rasha Hassan; Hikal, Noha

doi:10.1007/s00521-024-09562-9

Identifying the most accurate machine learning classification technique to detect network threats

Review
Open access
Published: 05 March 2024

Volume 36, pages 8977–8994, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Identifying the most accurate machine learning classification technique to detect network threats

Download PDF

Mohamed Farouk¹,
Rasha Hassan Sakr² &
Noha Hikal³

1433 Accesses
Explore all metrics

Abstract

Insider threats have recently become one of the most urgent cybersecurity challenges facing numerous businesses, such as public infrastructure companies, major federal agencies, and state and local governments. Our purpose is to find the most accurate machine learning (ML) model to detect insider attacks. In the realm of machine learning, the most convenient classifier is usually selected after further evaluation trials of candidate models which can cause unseen data (test data set) to leak into models and create bias. Accordingly, overfitting occurs because of frequent training of models and tuning hyperparameters; the models perform well on the training set while failing to generalize effectively to unseen data. The validation data set and hyperparameter tuning are utilized in this study to prevent the issues mentioned above and to choose the best model from our candidate models. Furthermore, our approach guarantees that the selected model does not memorize data of the threats occurring in the local area network (LAN) through the usage of the NSL-KDD data set. The following results are gathered and analyzed: support vector machine (SVM), decision tree (DT), logistic regression (LR), adaptive boost (AdaBoost), gradient boosting (GB), random forests (RFs), and extremely randomized trees (ERTs). After analyzing the findings, we conclude that the AdaBoost model is the most accurate, with a DoS of 99%, a probe of 99%, access of 96%, and privilege of 97%, as well as an AUC of 0.992 for DoS, 0.986 for probe, 0.952 for access, and 0.954 for privilege.

Machine Learning Boosted Trees Algorithms in Cybersecurity: A Comprehensive Review

Cyber Attack Detection on IoT Using Machine Learning

Mitigating cyber threats through integration of feature selection and stacking ensemble learning: the LGBM and random forest intrusion detection perspective

Article 15 September 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Insiders, such as employees, have legal access to an enterprise's resources in order to perform their job duties; as a result, detecting insider threats is one of the most difficult challenges facing security administrators and makes it difficult to identify these internal threats [1, 2]. That is why this study employed a variety of supervised machine learning classifiers with specific criteria to find the most accurate classifier to predict these insider threats, mainly LAN attacks from the NSL-KDD data set [3,4,5].

According to [6], 94% of firms have had insider data breaches in the last 12 months, and 84% have encountered security difficulties caused by nontechnical errors (Insider Data Breach Survey, 2021). Humans are the leading cause of disastrous insider data breaches. Therefore, malicious insiders are the major concern of department heads, with 28% agreeing to the previous statement.

[7] published a report stating that insider threat occurrences increased by 44% over the last two years, with a cost climb of more than a third to USD 15 million (Cost of Insider Threats Global Report, 2022). In addition, the cost of corporate credential theft rose by 65% since 2020, from USD 2 million to USD 4 million today. Furthermore, the time to contain an insider threat incident increased from 77 to 85 days, implying that organizations spent more on containment operations. When issues take more than 90 days to settle, communities incur an average yearly cost of USD 17 million [5].

The data breach attacks are classified into different categories [8]: passive attacks, active attacks, close-in attacks, insider attacks, and distribution attacks. Insider attacks are among the most significant threats to information systems because of their impact on confidentiality, integrity, and availability (CIA), especially if they occur on a LAN. These attacks can impact businesses, reputations, and finances [9].

The purpose of the study is to find the most accurate classifier for identifying insider attacks that occur on LANs. Additionally, the significance of the study lies in locating irregular 'attacked' LAN traffic by developing a Python code that uses scikit-learn for backend machine learning, then by plotting the charts with the Plotly open-source, Seaborn, and Matplotlib frameworks. To eliminate bias, a random search algorithm (RSA) is used to tune the hyperparameter using K-fold and stratified cross-validation methods to avoid overfitting.

This study is divided into four sections. Section one and two summarizes related articles and previous studies. Section 3 discusses the proposed framework. Finally, Sect. 4 analyzes the study’s findings.

1.1 Tuning hyperparameters and risk minimization

Hyperparameters, which are also known as nuisance parameters, are values that must be specified outside of the training procedure. A regularization hyperparameter is a process to determine the optimal hyperparameter to send as input to the estimator, such as the decision tree classifier's criteria and maximum depth values. It also indexes the method in many learning issues. Because the optimum hyperparameter for one data set is not always the best for other data sets, the settings must be adjusted for each task. Before evaluating potential estimators, the hyperparameter must be modified to decrease the expected risk [10,11,12].

1.2 Avoid overfitting and model selection

Using a test data set in the model selection procedure can introduce an overfitting problem due to unseen data leaking into the model, which depends on the model selection procedure on the best evaluation metrics. In addition, utilizing the training data set in the model test performance will also cause an overfitting problem. The overfitting model produces inaccurate predictions and cannot handle and generalize all forms of new input (unseen data). As a result, the model may become useless [13,14,15,16].

As a solution, a technique called cross-validation (CV) is employed to mitigate overfitting. It is a powerful tool for developing and selecting ML models. Not only does it ensure that the data is suitable for the dependability of used classifiers, but it also prevents the need to split the data set. So, we are not having the underfitting issue caused by data division, a lack of samples, or insufficient learning of the model [13, 14].

Cross-validation randomly separates the training set into two logical parts: a training set and a validation set. When the existing test set, as in our NSL-KDD data set, is added, we will have three sets in total. Each set serves a different purpose: the training set is used to teach the model, the validation set is used to solve the above issues such as choosing the best model and generalizing to new data, and the test set is used to evaluate the model's performance [13, 14].

1.3 Background of the study

ML has proven to be the ideal solution for situations like anomaly detection and network intrusion detection [17, 18]. Therefore, supervised ML algorithms are used to solve the problem of the study due to their speed of response in detecting threats. Supervised ML algorithms are divided into two types [19]: classification algorithms and regression algorithms. First, classification algorithms address this issue since they can distinguish between two or more classes (normal, attack), as in the framework, the outcomes predicted are discrete class labels [20,21,22]. The following are the supervised ML classification algorithms: linear support vector machines (SVMs), decision trees (DTs), and logistic regression (LR).

In addition to the ensemble algorithms, they aid in solving both classification and regression problems. The goal of ensemble techniques is to combine many prediction models and enhance the outcomes. The following are the supervised ML classification ensemble algorithms: adaptive boosting (AdaBoost), gradient boost (GB), extremely randomized trees (ERTs), and random forests (RFs) [22].

Insiders exhaust an organization's resources significantly, resulting in huge financial and human losses. Because of this, insider activities on a LAN must be recognized and their impact on security politics (CIA) must be identified. Due to the various motivations of the insider which can be personal, political, or economic [17, 23, 24], a plan must be prepared to detect all possible disasters and security concerns. As a result, the study aims to immediately characterize these risks on a LAN and instantly support the security administrators by identifying the most accurate ML classifier.

There is a need to comprehend the link between insiders and their threats. Insiders can gain access rights to networks, either legally or illegally. Legally, various departments can get access to each other because of variances across departments, joint ventures, outsourcing, and the potential of recruiting temporary employees such as consultants. Thus, there are certainly different levels of authorization granted to these insiders [5]. Their threats involve the misuse of legitimate access rights. However, there are several types of insiders. Each type has its own procedures, risks, and data sets. The NSL-KDD data set targets anyone connected either internally or remotely to a LAN [4, 17].

2 Related articles

This section discusses the previous studies and articles related to the study at hand. In [25], the theoretical obstacles to detecting insider threats are addressed, which helped define the research topic. The study also lists the existing insider threat data set types that include emails, authentication, login, HTTP, and files but exclude insider attacks. Consequently, the importance of our research contributions is highlighted. [26] proves that a one-hot-encoding approach is capable of converting categorical features into new individual binary features to train the classification models on them. [3] offers a review of existing insider threat approaches that use NSL-KDD to detect DOS attacks. [27] provides context-specific definitions of ML model hyperparameters as well as their impact on tuning model hyperparameters for decision-making performance and various approaches to obtain optimal values. In [28], the sensitivity of hyperparameter adjustment to eliminate bias in performance prediction is explored. The researchers carry out a detailed investigation, but the results are unsatisfactory, and the classifier cannot be generalized to another test data set. As a result, there is a need to conduct additional research to locate the best classifier during the creation of a machine learning system The problems of both overfitting and underfitting are presented in [14] along with their impact on the performance model for decision-making, as well as cross-validation methods as a solution. [29] extended and simplified significant CV approaches in developing a final model based on ML. The researcher stresses the importance of generalizing to unseen data to maximize the potential of predictive models and avoid overfitting. Consequently, the researcher concludes that generalizing to unseen data cannot be overlooked and should not become only limited to training and testing to build models and extract results. [23] explored an in-depth examination of the NSL-KDD data set. The study also analyzes the issues found in the kdd99 data set in addition to evaluation metrics. [17] presents a comprehensive review and in-depth understanding of insider threats based on previously published articles and statistical data on both insiders and methodologies employed to detect them. However, the reported results of supervised machine learning are disappointing. Furthermore, most articles in the review focus on outside attacks from emails, HTTP, illegal file access, and devices while ignoring dangers from within the network. In contrast, this study is concerned about LAN insider attacks since they are more common, easily motivated, and cause maximum damage. [30] details the NSL-KDD data set features as well as both the concerns observed in kdd99 and the attack type classifications. The researchers in [27] conduct a survey assessment of insider threat concerns. They state that the extent of the insider threat is a complicated challenge since it is usually difficult to distinguish between insiders and outsiders of a community while operating within a LAN. Furthermore, some insiders can initiate attacks from the outside, for example, an employee who left the organization. The article discusses the challenge of identifying internal attacks that took place on the internal network. Therefore, the main motive behind the current study is to find the most accurate classifier that identifies these threats correctly. [19] demonstrated supervised machine learning methods and the significance of classification models in classifying anomalous behavior from ideal traffic. [31] highlights the insider threat aspects and the methods to confront these threats using either machine learning or non-machine learning techniques. In [32], it is shown that the restoration of missing values in data sets uses zero as a solution.

After surveying the above-listed studies, we conclude that ML techniques are the best solution for insider threat identification. Consequently, one of the reasons behind conducting our research is to locate the best ML technique to address the challenge of identifying insider threats.

3 Case study

In this section, the study focuses on clarifying the methodology utilized in this research paper. Figure 1 depicts the process framework which consists of seven stages: (1) collect data set, (2) preprocessing, (3) tuning models, (4) feature selection, (5) avoid overfitting, (6) training models, and (7) final evaluation. In the following subsections, each stage is explained in detail.

3.1 Data set description

MIT Lincoln Labs developed and managed the 1998 DARPA intrusion detection evaluation program, and built a LAN that simulated a US Air Force LAN, conducted several attacks, and gathered raw TCP dump data. Data flowed from the source IP address to the destination IP address depending on a specific protocol to distinguish between normal and malicious connections [13, 17]. A connection was defined as a sequence of TCP packets transmitted at certain times. Afterward, MIT Lincoln Labs extracted the features from raw DARPA and packaged them into the first ready-to-use version, known as KDD99. However, more issues were discovered in the KDD99 data set [30, 33, 34]. A lot of these issues were solved in the updated NSL-KDD data set such as removing redundant records which lead to reducing the size of training and test sets and doing experiments easier and faster [21, 30, 35].

3.2 Data set analysis

The NSL-KDD data set contains 41 features and is divided into two files: the training data set file and the test data set file. On one hand, the test data set file has 125,973 entries, and on the other hand, the test data set file has 22,544 records [30].

As shown in Fig. 2, the Python code retrieved by Pandas, a data analysis tool, classifies the feature data types into an object (nominal), int64, and float64 [13, 19, 23]. Figure 2 also displays the counting and variation in the unique values among the nominal features in the training and test data sets. Finally, Fig. 2 illustrates the service feature in the training data set equals 70 unique values, whereas the service feature in the test data set equals 64 unique values. In the preprocessing phase, the researcher tries to tackle this issue.

The class label contains five main categories of classifications [21, 30, 33]:

i.
Normal: normal connections.
ii.
DoS: denial-of-service, for example, Smurf.
iii.
Probing: surveillance, such as port sweep.
iv.
Access: unauthorized remote machine access, e.g., spying.
v.
Privilege: unauthorized access to local superuser (root) privileges, e.g., Rootkit.

The probability distribution of the training data set is different from the test data set. The test data set should contain a lot more attacks than the training set just until the estimators can predict new offensives and the system can simulate reality [21, 30, 33]. Figure 3 shows the class label sizes in the NSL-KDD training data set, whereas Fig. 4 illustrates the class label sizes in the NSL-KDD test data set.

3.3 Data preprocessing

This stage is one of the most critical phases in the machine learning approach. Figure 5 exhibits the data flow diagram (DFD) for the data preprocessing procedure. As shown in Fig. 5, we first separate the numerical and categorical features. Then, we repair the missing in-service feature between the training and test sets. Afterward, we apply transformational techniques to the categorical features. Finally, we perform scaling methods on all features before recombining them. The preprocessing principle attempts to transform the raw data set into a beneficial format while also ensuring that the data set is clean and noise-free, as a result, the estimator's decision is not affected [32]. The following section describes the preprocessing methods applied to the training and test data sets:

3.3.1 Data transformation

After checking the purity of the data, transformation techniques are applied because most machine learning models do not accept categorical features. First, categorical features are converted into numbers, a process known as 'encoding' the category features [32]. There are four categorical features in the NSL-KDD data set, namely protocol type, service, flag, and class label. After that, all the features are standardized by assigning them the same weight until the classifiers are not able to choose the values based on the greater weight. The data transformation methods used are the following: one-hot-encoding and class label.

One-hot-encoding is also known as dummy encoding. This process converts categorical features into binary. This process is carried out in two stages [26]. First, the unique values of the categorical features are transformed into new binary features. Then, the feature's unique value that the connection detected is assigned the value of 1, and the remainder is assigned the value of 0. This methodology was only used for categorical features [36, 37], namely protocol type, service, and flag. The class label was handled otherwise.

The protocol type, service, and flag features, which are displayed in Fig. 2, denote that there is a striking difference between the service feature offered by the training data set on one hand and the testing data set on the other. To equate the testing data set with the training data set, a zero value is used to compensate for the missing values [32, 38]. Table 1 exhibits samples of the protocol-type feature after applying dummy encoding.

Table 1 An extract of the protocol-type feature

Identifying the most accurate machine learning classification technique to detect network threats

Abstract

Similar content being viewed by others

Machine Learning Boosted Trees Algorithms in Cybersecurity: A Comprehensive Review

Cyber Attack Detection on IoT Using Machine Learning

Mitigating cyber threats through integration of feature selection and stacking ensemble learning: the LGBM and random forest intrusion detection perspective

Explore related subjects

1 Introduction

1.1 Tuning hyperparameters and risk minimization

1.2 Avoid overfitting and model selection

1.3 Background of the study

2 Related articles

3 Case study

3.1 Data set description

3.2 Data set analysis

3.3 Data preprocessing

3.3.1 Data transformation

3.3.2 Scaling

3.4 Tuning model

3.5 Features selection

3.5.1 Univariate feature selection (UFS)

3.5.2 Recursive Feature Elimination (RFE)

3.6 Cross-validation

3.6.1 K-Fold CV

3.6.2 Stratified K-fold CV

3.7 Training models

3.8 Final evaluation

3.8.1 Comparison with related works

4 Conclusion and future work

Data availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation