Abstract
Insider threats have recently become one of the most urgent cybersecurity challenges facing numerous businesses, such as public infrastructure companies, major federal agencies, and state and local governments. Our purpose is to find the most accurate machine learning (ML) model to detect insider attacks. In the realm of machine learning, the most convenient classifier is usually selected after further evaluation trials of candidate models which can cause unseen data (test data set) to leak into models and create bias. Accordingly, overfitting occurs because of frequent training of models and tuning hyperparameters; the models perform well on the training set while failing to generalize effectively to unseen data. The validation data set and hyperparameter tuning are utilized in this study to prevent the issues mentioned above and to choose the best model from our candidate models. Furthermore, our approach guarantees that the selected model does not memorize data of the threats occurring in the local area network (LAN) through the usage of the NSL-KDD data set. The following results are gathered and analyzed: support vector machine (SVM), decision tree (DT), logistic regression (LR), adaptive boost (AdaBoost), gradient boosting (GB), random forests (RFs), and extremely randomized trees (ERTs). After analyzing the findings, we conclude that the AdaBoost model is the most accurate, with a DoS of 99%, a probe of 99%, access of 96%, and privilege of 97%, as well as an AUC of 0.992 for DoS, 0.986 for probe, 0.952 for access, and 0.954 for privilege.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Insiders, such as employees, have legal access to an enterprise's resources in order to perform their job duties; as a result, detecting insider threats is one of the most difficult challenges facing security administrators and makes it difficult to identify these internal threats [1, 2]. That is why this study employed a variety of supervised machine learning classifiers with specific criteria to find the most accurate classifier to predict these insider threats, mainly LAN attacks from the NSL-KDD data set [3,4,5].
According to [6], 94% of firms have had insider data breaches in the last 12 months, and 84% have encountered security difficulties caused by nontechnical errors (Insider Data Breach Survey, 2021). Humans are the leading cause of disastrous insider data breaches. Therefore, malicious insiders are the major concern of department heads, with 28% agreeing to the previous statement.
[7] published a report stating that insider threat occurrences increased by 44% over the last two years, with a cost climb of more than a third to USD 15 million (Cost of Insider Threats Global Report, 2022). In addition, the cost of corporate credential theft rose by 65% since 2020, from USD 2 million to USD 4 million today. Furthermore, the time to contain an insider threat incident increased from 77 to 85 days, implying that organizations spent more on containment operations. When issues take more than 90 days to settle, communities incur an average yearly cost of USD 17 million [5].
The data breach attacks are classified into different categories [8]: passive attacks, active attacks, close-in attacks, insider attacks, and distribution attacks. Insider attacks are among the most significant threats to information systems because of their impact on confidentiality, integrity, and availability (CIA), especially if they occur on a LAN. These attacks can impact businesses, reputations, and finances [9].
The purpose of the study is to find the most accurate classifier for identifying insider attacks that occur on LANs. Additionally, the significance of the study lies in locating irregular 'attacked' LAN traffic by developing a Python code that uses scikit-learn for backend machine learning, then by plotting the charts with the Plotly open-source, Seaborn, and Matplotlib frameworks. To eliminate bias, a random search algorithm (RSA) is used to tune the hyperparameter using K-fold and stratified cross-validation methods to avoid overfitting.
This study is divided into four sections. Section one and two summarizes related articles and previous studies. Section 3 discusses the proposed framework. Finally, Sect. 4 analyzes the study’s findings.
1.1 Tuning hyperparameters and risk minimization
Hyperparameters, which are also known as nuisance parameters, are values that must be specified outside of the training procedure. A regularization hyperparameter is a process to determine the optimal hyperparameter to send as input to the estimator, such as the decision tree classifier's criteria and maximum depth values. It also indexes the method in many learning issues. Because the optimum hyperparameter for one data set is not always the best for other data sets, the settings must be adjusted for each task. Before evaluating potential estimators, the hyperparameter must be modified to decrease the expected risk [10,11,12].
1.2 Avoid overfitting and model selection
Using a test data set in the model selection procedure can introduce an overfitting problem due to unseen data leaking into the model, which depends on the model selection procedure on the best evaluation metrics. In addition, utilizing the training data set in the model test performance will also cause an overfitting problem. The overfitting model produces inaccurate predictions and cannot handle and generalize all forms of new input (unseen data). As a result, the model may become useless [13,14,15,16].
As a solution, a technique called cross-validation (CV) is employed to mitigate overfitting. It is a powerful tool for developing and selecting ML models. Not only does it ensure that the data is suitable for the dependability of used classifiers, but it also prevents the need to split the data set. So, we are not having the underfitting issue caused by data division, a lack of samples, or insufficient learning of the model [13, 14].
Cross-validation randomly separates the training set into two logical parts: a training set and a validation set. When the existing test set, as in our NSL-KDD data set, is added, we will have three sets in total. Each set serves a different purpose: the training set is used to teach the model, the validation set is used to solve the above issues such as choosing the best model and generalizing to new data, and the test set is used to evaluate the model's performance [13, 14].
1.3 Background of the study
ML has proven to be the ideal solution for situations like anomaly detection and network intrusion detection [17, 18]. Therefore, supervised ML algorithms are used to solve the problem of the study due to their speed of response in detecting threats. Supervised ML algorithms are divided into two types [19]: classification algorithms and regression algorithms. First, classification algorithms address this issue since they can distinguish between two or more classes (normal, attack), as in the framework, the outcomes predicted are discrete class labels [20,21,22]. The following are the supervised ML classification algorithms: linear support vector machines (SVMs), decision trees (DTs), and logistic regression (LR).
In addition to the ensemble algorithms, they aid in solving both classification and regression problems. The goal of ensemble techniques is to combine many prediction models and enhance the outcomes. The following are the supervised ML classification ensemble algorithms: adaptive boosting (AdaBoost), gradient boost (GB), extremely randomized trees (ERTs), and random forests (RFs) [22].
Insiders exhaust an organization's resources significantly, resulting in huge financial and human losses. Because of this, insider activities on a LAN must be recognized and their impact on security politics (CIA) must be identified. Due to the various motivations of the insider which can be personal, political, or economic [17, 23, 24], a plan must be prepared to detect all possible disasters and security concerns. As a result, the study aims to immediately characterize these risks on a LAN and instantly support the security administrators by identifying the most accurate ML classifier.
There is a need to comprehend the link between insiders and their threats. Insiders can gain access rights to networks, either legally or illegally. Legally, various departments can get access to each other because of variances across departments, joint ventures, outsourcing, and the potential of recruiting temporary employees such as consultants. Thus, there are certainly different levels of authorization granted to these insiders [5]. Their threats involve the misuse of legitimate access rights. However, there are several types of insiders. Each type has its own procedures, risks, and data sets. The NSL-KDD data set targets anyone connected either internally or remotely to a LAN [4, 17].
2 Related articles
This section discusses the previous studies and articles related to the study at hand. In [25], the theoretical obstacles to detecting insider threats are addressed, which helped define the research topic. The study also lists the existing insider threat data set types that include emails, authentication, login, HTTP, and files but exclude insider attacks. Consequently, the importance of our research contributions is highlighted. [26] proves that a one-hot-encoding approach is capable of converting categorical features into new individual binary features to train the classification models on them. [3] offers a review of existing insider threat approaches that use NSL-KDD to detect DOS attacks. [27] provides context-specific definitions of ML model hyperparameters as well as their impact on tuning model hyperparameters for decision-making performance and various approaches to obtain optimal values. In [28], the sensitivity of hyperparameter adjustment to eliminate bias in performance prediction is explored. The researchers carry out a detailed investigation, but the results are unsatisfactory, and the classifier cannot be generalized to another test data set. As a result, there is a need to conduct additional research to locate the best classifier during the creation of a machine learning system The problems of both overfitting and underfitting are presented in [14] along with their impact on the performance model for decision-making, as well as cross-validation methods as a solution. [29] extended and simplified significant CV approaches in developing a final model based on ML. The researcher stresses the importance of generalizing to unseen data to maximize the potential of predictive models and avoid overfitting. Consequently, the researcher concludes that generalizing to unseen data cannot be overlooked and should not become only limited to training and testing to build models and extract results. [23] explored an in-depth examination of the NSL-KDD data set. The study also analyzes the issues found in the kdd99 data set in addition to evaluation metrics. [17] presents a comprehensive review and in-depth understanding of insider threats based on previously published articles and statistical data on both insiders and methodologies employed to detect them. However, the reported results of supervised machine learning are disappointing. Furthermore, most articles in the review focus on outside attacks from emails, HTTP, illegal file access, and devices while ignoring dangers from within the network. In contrast, this study is concerned about LAN insider attacks since they are more common, easily motivated, and cause maximum damage. [30] details the NSL-KDD data set features as well as both the concerns observed in kdd99 and the attack type classifications. The researchers in [27] conduct a survey assessment of insider threat concerns. They state that the extent of the insider threat is a complicated challenge since it is usually difficult to distinguish between insiders and outsiders of a community while operating within a LAN. Furthermore, some insiders can initiate attacks from the outside, for example, an employee who left the organization. The article discusses the challenge of identifying internal attacks that took place on the internal network. Therefore, the main motive behind the current study is to find the most accurate classifier that identifies these threats correctly. [19] demonstrated supervised machine learning methods and the significance of classification models in classifying anomalous behavior from ideal traffic. [31] highlights the insider threat aspects and the methods to confront these threats using either machine learning or non-machine learning techniques. In [32], it is shown that the restoration of missing values in data sets uses zero as a solution.
After surveying the above-listed studies, we conclude that ML techniques are the best solution for insider threat identification. Consequently, one of the reasons behind conducting our research is to locate the best ML technique to address the challenge of identifying insider threats.
3 Case study
In this section, the study focuses on clarifying the methodology utilized in this research paper. Figure 1 depicts the process framework which consists of seven stages: (1) collect data set, (2) preprocessing, (3) tuning models, (4) feature selection, (5) avoid overfitting, (6) training models, and (7) final evaluation. In the following subsections, each stage is explained in detail.
3.1 Data set description
MIT Lincoln Labs developed and managed the 1998 DARPA intrusion detection evaluation program, and built a LAN that simulated a US Air Force LAN, conducted several attacks, and gathered raw TCP dump data. Data flowed from the source IP address to the destination IP address depending on a specific protocol to distinguish between normal and malicious connections [13, 17]. A connection was defined as a sequence of TCP packets transmitted at certain times. Afterward, MIT Lincoln Labs extracted the features from raw DARPA and packaged them into the first ready-to-use version, known as KDD99. However, more issues were discovered in the KDD99 data set [30, 33, 34]. A lot of these issues were solved in the updated NSL-KDD data set such as removing redundant records which lead to reducing the size of training and test sets and doing experiments easier and faster [21, 30, 35].
3.2 Data set analysis
The NSL-KDD data set contains 41 features and is divided into two files: the training data set file and the test data set file. On one hand, the test data set file has 125,973 entries, and on the other hand, the test data set file has 22,544 records [30].
As shown in Fig. 2, the Python code retrieved by Pandas, a data analysis tool, classifies the feature data types into an object (nominal), int64, and float64 [13, 19, 23]. Figure 2 also displays the counting and variation in the unique values among the nominal features in the training and test data sets. Finally, Fig. 2 illustrates the service feature in the training data set equals 70 unique values, whereas the service feature in the test data set equals 64 unique values. In the preprocessing phase, the researcher tries to tackle this issue.
The class label contains five main categories of classifications [21, 30, 33]:
-
i.
Normal: normal connections.
-
ii.
DoS: denial-of-service, for example, Smurf.
-
iii.
Probing: surveillance, such as port sweep.
-
iv.
Access: unauthorized remote machine access, e.g., spying.
-
v.
Privilege: unauthorized access to local superuser (root) privileges, e.g., Rootkit.
The probability distribution of the training data set is different from the test data set. The test data set should contain a lot more attacks than the training set just until the estimators can predict new offensives and the system can simulate reality [21, 30, 33]. Figure 3 shows the class label sizes in the NSL-KDD training data set, whereas Fig. 4 illustrates the class label sizes in the NSL-KDD test data set.
3.3 Data preprocessing
This stage is one of the most critical phases in the machine learning approach. Figure 5 exhibits the data flow diagram (DFD) for the data preprocessing procedure. As shown in Fig. 5, we first separate the numerical and categorical features. Then, we repair the missing in-service feature between the training and test sets. Afterward, we apply transformational techniques to the categorical features. Finally, we perform scaling methods on all features before recombining them. The preprocessing principle attempts to transform the raw data set into a beneficial format while also ensuring that the data set is clean and noise-free, as a result, the estimator's decision is not affected [32]. The following section describes the preprocessing methods applied to the training and test data sets:
3.3.1 Data transformation
After checking the purity of the data, transformation techniques are applied because most machine learning models do not accept categorical features. First, categorical features are converted into numbers, a process known as 'encoding' the category features [32]. There are four categorical features in the NSL-KDD data set, namely protocol type, service, flag, and class label. After that, all the features are standardized by assigning them the same weight until the classifiers are not able to choose the values based on the greater weight. The data transformation methods used are the following: one-hot-encoding and class label.
One-hot-encoding is also known as dummy encoding. This process converts categorical features into binary. This process is carried out in two stages [26]. First, the unique values of the categorical features are transformed into new binary features. Then, the feature's unique value that the connection detected is assigned the value of 1, and the remainder is assigned the value of 0. This methodology was only used for categorical features [36, 37], namely protocol type, service, and flag. The class label was handled otherwise.
The protocol type, service, and flag features, which are displayed in Fig. 2, denote that there is a striking difference between the service feature offered by the training data set on one hand and the testing data set on the other. To equate the testing data set with the training data set, a zero value is used to compensate for the missing values [32, 38]. Table 1 exhibits samples of the protocol-type feature after applying dummy encoding.
In addition to the 38 original features [39], the data set has been increased to include a total of 122. The added features are three protocol-type features, 70 service features, and 11 flag features. Table 2 presents the complete number of features after encoding.
The class label contains sub-attacks that fall within the scope of five main categories, namely DoS, probe, access, privilege, and a normal connection. Each of these attacks is converted into a unique integer in the same class label column, as shown in Table 3. After converting all categorical attributes to integers, each group is handled individually. As for normal connections, traffic is introduced to each of them, allowing the models to differentiate between regular and irregular attacks or connections. Figure 6 depicts the magnitude of each attack type as well as normal traffic in both the training and test data sets.
3.3.2 Scaling
The feature scale aims to place all features on the same scale, indicating that all features are equally important [13]. Figure 7 depicts the data set before any scaling is applied. The approaches listed below are used. Scaling uses the following: robust scaler and standardization.
First, robust scaler removes outliers by eliminating the median and scaling the data based on the quantile range [13, 40]. Figure 8 depicts the data set after applying a robust scaler. The following formula (1) is used:
where: Xnew: Standardized value, Xi: Original values, Xmedian: sample median, Q1: 1st quartile, and Q3: 3rd quartile.
Second, standardization in Python is called a standard scaler (SS). It specifies that the standard deviation is equal to 1, and the mean of the values changes to 0 [13, 41]. Figure 9 displays data after applying SS. The (Z score) Eq. (2) determines this SS:
where: X: values, µ: mean, and σ: standard deviation.
3.4 Tuning model
The tuning process is a matter of trial and error. The statistical ML model experiments repeatedly with different hyperparameter values [13, 14]. After that, its efficiency is compared to the validation set to determine which set of hyperparameter results in the most accurate model [6]. The main technique used in tuning the model is known as RSA.
RSA defines for each hyperparameter a statistical distribution from which values are randomly picked and utilized to train the model. This step increases the likelihood of quickly determining practical prime values for each hyperparameter [6, 12]. The following Table 4 depicts the results of the RSA for the optimal hyperparameter of each model, the best hyperparameter affecting the decision model process, the hyperparameter datatype, the hyperparameter default values at which each model was operating, the start–end random values, and finally the chosen optimal values for each hyperparameter.
In the following paragraphs, we describe the mathematical functions of hyperparameters in our models. First, the linear SVM model employs two equations to determine the loss hyperparameter. A hinge is a form of the cost function in which a margin or distance from the classification border is defined according to the following Eq. (3) and the squared hinge by Eq. (4) [42, 43], where \(t\) is the actual result, either 1 or 0.
Second, the criterion hyperparameter contains three arguments in the DT model: Gini, entropy, and log loss. The equations are as follows [14, 32, 43, 44]:
The Gini determines the splitting for each feature and quantifies the impurity of \(\left( D \right)\). The following formula (5) determines Gini:
where: \(p_{i}\) is the probability that a tuple in D \(\in\) Ci and is estimated by \(\left| {C_{i,D} } \right|/\left| D \right|\). The total is calculated across \(m \) classes.
Entropy is a metric of information that is used to evaluate the impurity or uncertainty in a set of data. It controls how a decision tree splits data. \(p_{x}\) indicates the probability of the \(x\) the class in the data set \(D\), where \(x = 1,{ }2,{ } \ldots .,{ }n\). The following formula (6) is used to compute entropy:
Log loss is employed when predicting whether a Boolean (true or false) is something with a likelihood range from certainly true (1) to obviously false (0). The log loss formula (7) is defined as:
where:\( N\) is the number of instances, and Pi is the model likelihood.
Third, the solver hyperparameter in the LR model has five approaches [45]. To begin with, Newton's approach employs a quadratic function around \(\left( {xn} \right) \) to approximate \(f\left( x \right) \) in each iteration [46, 47]. Then, limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm (L-BFGS): limited memory refers to keeping only a few vectors and uses an inverse hessian matrix that is estimated using gradient evaluation-specified updates [46]. In addition, the library for large linear classification (Lib-linear) employs a coordinate descent (CD) approach to solve optimization issues by executing sequential approximation reduction along coordinate directions [48]. Furthermore, stochastic average gradient descent (SAG) is an iterative approach for optimization by gradient descent and incremental aggregated gradient technique modification that uses a random sample of prior gradient values and is suitable for large data sets since it can be handled quickly [49, 50]. Finally, SAGA is an extension of SAG that considers the improved version to have a quicker convergence than SAG [49, 50].
Fourth, the purpose of the criterion hyperparameter in the GB model is to evaluate the quality of a data split. The criterion hyperparameter includes 'friedman_mse' for mean squared error (MSE) with Friedman improvement score and 'squared_error' for mean squared error. The 'friedman_mse' Eq. (8) and MSE Eq. (9) are defined as [51, 52]:
where: \(W_{{\text{l}}}\) is the sum of weight for the left part, \(W_{{\text{r}}} \) is the sum of weight for the left part, and \(\overline{y}_{{\text{l}}} \) and \(\overline{{y_{{\text{r}}} }}\) are the mean left and right.
where: \(y_{{\text{i}}}\) is the \(i^{th}\) observed value, \(p_{{\text{i}}}\) is the corresponding predicted value, and \(n\) is the number of observed values.
Fifth, the 'max_features' hyperparameter utilized in the GB model is the maximum number of features that are permitted on each individual tree. In the first selection, Sqrt will take the square root of the overall number of features. The sqrt Eq. (10) realized as follows [51]:
Another option is \(\log_{2}\), which will take \(\log_{2} \) of the number of features. The \(\log_{2} \) Eq. (11) realized as follows:
Sixth, the algorithm hyperparameter for AdaBoost Classifier offers two options: 'SAMME.R' and 'SAMME.' SAMME is an acronym for stagewise additive modeling with a multi-class exponential loss function and R is an acronym for real. For each weak learner, SAMME employs a separate set of 'decision influence' weights (alphas). SAMME.R, on the other hand, allocates an equal weight to each weak learner and evaluates the class likelihood, which usually converges faster than SAMME [53,54,55,56]. The SAMME and SAMME.R EQs (12) and (13) are defined as:
where: \(H\left( x \right)\) classification predictions, \(T\) weak learners, \(\alpha_{{\text{t}}} \) weight for weak learner \(t\), \(h_{{\text{t}}} x\) the prediction of weak learner \(t\), and \(s_{k}^{t} \) as a multiplier.
3.5 Features selection
This section lists the features that are used throughout the training models. The following approaches are employed:
3.5.1 Univariate feature selection (UFS)
It is a statistical method that exploits the discrepancy between the qualities until a threshold value is obtained from them. This threshold value is used to determine the real features through the recursive feature elimination method utilized to train the models [19, 22, 57].
The 'f_classif' function, named in the scikit-learn ML framework, finds variance using univariate statistical tests that rely on the analysis of variance (ANOVA) F value [19, 22, 58]. It computes the overall comparison error and finds a greater F value when the variance between groups is less than within the groups, indicating a higher likelihood that the observed difference is actual rather than random. Consequently, it excludes features that differ in variance and selects features with the same variance. This technique has picked 13 features for each attack category, and then the recursive feature elimination approach has used number 13 as the threshold. The F-statistic EQs (14) and (15) in one-way ANOVA are represented as follows [19, 22, 58]:
where: MS: mean square, SS: sum of squares, I: number of groups, and nT: sample size.
3.5.2 Recursive Feature Elimination (RFE)
It is a type of wrapper used for feature selection algorithms. The RFE seeks to identify acceptable feature subsets. It operates immediately following the UFS technique. First, each model is implemented individually using tuned hyperparameters. Then, all features are passed to establish their relevance to one another. Ultimately, the least important features are pruned. The RFE recursively continues this technique on the reduced set until it obtains the requisite feature count defined as the threshold by the UFS method [19, 22, 57]. Table 5 reports the select features of each model using the RFE method for each attack category based on UFS's threshold.
3.6 Cross-validation
As previously indicated, cross-validation [13, 14, 22] is an efficient instrument for designing and choosing ML models. It is employed in the study to avoid overfitting. The following methods, which are part of cross-validation, are used in the study.
3.6.1 K-Fold CV
The original training data set is divided into equal-sized folds (K subsamples) with random sampling. The model is trained using the fold by (K-1) as training data and then verified using the remaining folds. It entails repeating and recording the arithmetic mean and standard deviation of the k-folds produced from the evaluation measures on the various partitions [6, 22]. The following Table 6, 7, 8, 9, 10, 11, and 12 show the outcomes (accuracy, recall, and area under the curve) of the K-fold CV mean and ± standard deviation between folds for each model, with the better results highlighted in bold in Table 10.
3.6.2 Stratified K-fold CV
It is the same as a K-fold CV but uses stratified sampling to avoid two issues: random sampling in the K-fold CV method and the imbalance in the sample size in the data set. The strata have nearly the same rate of samples as in the original data set, and each fold has the same size as normal and attack samples. Consequently, whichever criteria are used to evaluate them, the findings will be consistent across all folds [13, 22, 59]. Table 13 illustrates the results of the stratified K-fold CV applied to the model which achieved better results in K-fold CV. Furthermore, the stratified K-fold CV approach delivers good results for the AdaBoost model, guaranteeing that the above-mentioned concerns are addressed.
3.7 Training models
This stage covers how to train machine learning algorithms on the training data set. Algorithm (1) demonstrates the implementation phase of the framework. This framework is written in Python and uses the scikit-learn framework as a backend ML tool to analyze the predicted data and find the best model to assess the probable normal and abnormal behavior on a LAN.
3.8 Final evaluation
The final evaluation performance is implemented by using the test set. The primary principles of testing the models are their capacity to appropriately adjust to new, previously unobserved data and the model's quality, which is determined via some evaluation measures. Performance estimators are derived from the confusion matrix (CM), which visualizes the prediction results [22, 23, 33, 39], represented by four rates, as indicated in Table 14 The number of predicted values is represented in each column of the CM, while the number of actual values is represented in each row.
TP = True Positive (Normal Traffic Predicted as Normal).
TN = True Negative (Malicious Traffic Predicted as Malicious).
FP = False Positive (Malicious Traffic Predicted as Normal).
FN = False Negative (Normal Traffic Predicted as Malicious).
Our research findings reveal that the AdaBoost model gets a higher accuracy, as exhibited in Fig. 10 which shows the CMs of the experiment results for predicted values and actual values based on the above-mentioned rates by the AdaBoost model.
Accuracy (Acc), recall (Rec), or true-positive rate [14, 23, 33], and area under the receiver operating characteristic curve (AUC-ROC) are three essential assessment metrics generated from the rates listed above. The accuracy score denotes the proportion of true-positive and true-negative predictions generated by a model as a percentage of the total number of predictions made by Eq. (16).
The recall [14, 23, 33, 39] of the ML model indicates its ability to define the proportion of true positives that are correctly classified by Eq. (17), while the AUC-ROC indicates positive predictions that are classified higher than negative predictions. The ROC-AUC curve is represented as a plot of the false-positive rate (FPR) as the x-axis versus the TPR as the y-axis. (17) and (18) EQs are used to calculate AUC-ROC [13, 14, 22, 39]. Table 15 and Fig. 11 show the findings for the most accurate model (AdaBoost).
Figure 12 describes the AUC-ROC evaluation for detecting attacks by the AdaBoost model. It accurately identifies attack samples with an AUC of DoS attaining 0.992 (7410 out of 7460 samples) of Probe reaching 0.986 (2374 out of 2421 samples), of Access totaling 0.952 (2677 out of 2885 samples), and of privilege achieving 0.954 (62 out of 67 samples).
3.8.1 Comparison with related works
We compared our proposed method with the existing related works that used the NSL-KDD data set in the field. Three metrics were used to compare the performance of our method with others: recall, AUC (TPR vs. FPR), and accuracy. Our current study produces superior results through employing the AdaBoost model in all attack branches. For example, our DoS attacks have a recall of 99.3% and an AUC of 0.992, in addition to overall accuracy for all attack categories reaching 98.5%. In contrast, the highest results of [60] were a recall of 96.5%, an AUC of 0.980, and an overall accuracy of 94%. Table 16 displays all the results.
4 Conclusion and future work
This study aims to determine the most accurate ML classifier for detecting LAN attacks. The research findings demonstrate that the AdaBoost model has the highest classification accuracy for both insider attacks and normal traffic behavior, with 99% DoS, 98% probe, 96% access, and 97% privilege. It also has an AUC of 0.992 DoS, 0.986 probe, 0.952 access, and 0.954 privilege. The study is carried out using the publicly accessible NSL-KDD data set, with an AUC rate measure overriding previous approaches in this data set due to the strategies used to remove noise from the data set, the choice of relevant features, the tuning of hyperparameters, and the minimization of bias. As a future recommendation, the techniques used in the study might be integrated into firewall configurations to identify insider threats and assist cybersecurity specialists in making the work environment secure and minimizing risks.
Data availability
The data set supporting the conclusions of this article is available in the University of New Brunswick repository. Here is the hyperlink to a data set: http://205.174.165.80/CICDataset/NSL-KDD/Dataset/
Abbreviations
- NSL:
-
Network Security Laboratory
- KDD:
-
Knowledge Discovery in Databases
- ML:
-
Machine learning
- DoS:
-
Denial-of-service
- LAN:
-
Local area network
- SVM:
-
Support vector machine
- DT:
-
Decision tree
- LR:
-
Logistic regression
- AdaBoost:
-
Adaptive boost
- GB:
-
Gradient boosting
- RFs:
-
Random forests
- ERTs:
-
Extremely randomized trees
- CV:
-
Cross-validation
- CIA:
-
Confidentiality, integrity, availability
- HTTP:
-
Hypertext Transfer Protocol
- MIT:
-
Massachusetts Institute of Technology
- US:
-
United States
- DFD:
-
Data flow diagram
- TCP:
-
Transmission control protocol
- UDP:
-
User datagram protocol
- ICMP:
-
Internet control message protocol
- SS:
-
Standard scaler
- RSA:
-
Random search algorithm
- SAG:
-
Stochastic average gradient
- CD:
-
Coordinate descent
- MSE:
-
Mean squared error
- SAMME:
-
Stagewise additive modeling with a multi-class exponential
- SAMMER:
-
Stagewise additive modeling with a multi-class exponential real
- L-BFGS:
-
Limited-memory Broyden–Fletcher–Goldfarb–Shanno
- UFS:
-
Univariate feature selection
- ANOVA:
-
Analysis of variance
- RFE:
-
Recursive feature elimination
- TP:
-
True positive
- TN:
-
True negative
- FP:
-
False positive
- FN:
-
False negative
- TPR:
-
True-positive rate
- FPR:
-
False-positive rate
- Acc:
-
Accuracy
- Rec:
-
Recall
- CM:
-
Confusion matrix
- AUC:
-
Area under the curve
- ROC:
-
Receiver operator characteristic
References
Cybersecurity and infrastructure security agency (2022) Insider threat mitigation. CISA. https://www.cisa.gov/insider-threat-mitigation Accessed 20 Aug. 2022
Yuan S, Wu X (2021) Deep learning for insider threat detection: review, challenges, and opportunities. Comput Secur. https://doi.org/10.1016/j.cose.2021.102221
Kim A, Oh J, Ryu J, Lee K (2020) A review of insider threat detection approaches with IoT perspective. Special section on secure communication for the next generation 5g and IOT networks. https://doi.org/10.1109/ACCESS.2020.2990195
Pallabi Parveen JE (2011) Insider threat detection using stream mining and graph mining. IEEE third international conference on privacy, security, risk and trust and 2011 IEEE third international conference on social computing. https://doi.org/10.1109/PASSAT/SocialCom.2011.211
Nebrase Elmrabit SHY (2020) Insider threat risk prediction based on bayesian network. Comput Secur. https://doi.org/10.1016/j.cose.2020.101908
Egress (2021) 94 % of organizations suffer data breaches. Egress. https://www.egress.com/newsroom/94-percent-of-organisations-have-suffered-insider-data-breaches. Accessed 9 April 2022
Proofpoint (2022) 2022 Ponemon cost of insider threats global report. Proofpoint. https://protectera.com.au/wp-content/uploads/2022/03/The-Cost-of-Insider-Threats-2022-Global-Report.pdf. Accessed 30 April 2022
Dastres R, Soori M (2021) A review in recent development of network threats and security measures. Int J Inf Sci Comput Eng 15(1). https://hal.science/hal-03128076
Korotka MS, Yin LR, Basu SC (2014) Information assurance technical framework: an end user perspective. J Inf Priv Secur. https://doi.org/10.1080/15536548.2005.10855759
Lei J (2019) Cross-validation with confidence. J Am Stat Assoc. https://doi.org/10.1080/01621459.2019.1672556
Probst P, Boulesteix AL, Bischl B (2019) Tunability: importance of hyperparameters of machine learning algorithms. J Mach Learn Res 20(1):1934–1965
Ahmad Esmaeili ZG (2023) Agent-based collaborative random search for hyperparameter tuning and global function optimization. Systems. https://doi.org/10.3390/systems11050228
Montesinos López OA, Montesinos López A, Crossa J (2022) General elements of genomic selection and statistical learning, preprocessing tools for data preparation, & overfitting, model tuning, and evaluation of prediction performance. In: multivariate statistical machine learning methods for genomic prediction. Springer, Cham, pp 25–139. https://doi.org/10.1007/978-3-030-89010-0
Zhou ZH (2021) Model selection and evaluation. In: machine learning, 1st edn. Springer, Singapore, pp 25–55. https://doi.org/10.1007/978-981-15-1967-3
Yates LA (2021) Parsimonious model selection using information theory: a modified selection rule. Ecol Soc Am. https://doi.org/10.1002/ecy.3475
Yates LA (2022) Cross validation for model selection: a review with examples from ecology. Ecol Monogr. https://doi.org/10.1002/ecm.1557
Al-Mhiqani MN, Ahmad R, Zainal Abidin Z, Yassin W, Hassan A, Abdulkareem KH, Ali NS, Yunos Z (2020) A review of insider threat detection: classification, machine learning techniques, datasets, open challenges, and recommendations. Appl Sci. https://doi.org/10.3390/app10155208
Aram Kim JO (2019) SoK: a systematic review of insider threat detection. J Wirel Mob Netw Ubiquitous Comput Dependable Appl. https://doi.org/10.22667/JOWUA.2019.12.31.046
Sarker IH (2021) Machine learning: algorithms, real world applications and research directions. SN Comput Sci 2(3):160. https://doi.org/10.1007/s42979-021-00592-x
Altwaijry BB (2023) Insider threat detection using machine learning approach. Appl Sci. https://doi.org/10.3390/app13010259
Abualkibash M (2019) Intrusion detection system classification using different machine learning algorithms on kdd-99 and nsl-kdd datasets-a review paper. Int J Comput Sci Inf Technol. https://doi.org/10.5121/ijcsit.2019.11306
Müller Andreas C, Guido S (2017) Introduction to machine learning with python: a guide for data scientists. O’Reilly Media, Sebastopol, CA
Xu W, Jang-Jaccard J, Singh A, Wei Y, Sabrina F (2021) Improving performance of autoencoder-based network anomaly detection on nsl-kdd dataset. IEEE Access. https://doi.org/10.1109/ACCESS.2021.3116612
Alsowail RA, Al-Shehari T (2022) Techniques and countermeasures for preventing insider threats. Peer J Comput Sci. https://doi.org/10.7717/peerj-cs.938
Yuan S, Wu X (2021) Deep learning for insider threat detection: review challenges and opportunities. Comput Secur 104:102221. https://doi.org/10.1016/j.cose.2021.102221
Scikit-Learn (2019) sklearn preprocessing OneHotEncoder. Scikit-Learn. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html. Accessed 5 May 2022
Homoliak I, Toffalini F, Guarnizo J, Elovici Y, Ochoa M (2019) Insight into insiders and IT: a survey of insider threat taxonomies, analysis, modeling, and countermeasures. ACM Comput Surv. https://doi.org/10.1145/3303771
Schratz P, Muenchow J, Iturritxa E, Richter J, Brenning A (2019) Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecol Model. https://doi.org/10.1016/j.ecolmodel.2019.06.002
Berrar D (2019) Cross-validation. Encycl Bioinform Comput Biol. https://doi.org/10.1016/b978-0-12-809633-8.20349-x
Ngueajio MK, Washington G, Rawat DB, Ngueabou Y (2023) Intrusion detection systems using support vector machines on the KDDCUP’99 and NSL-KDD datasets: a comprehensive survey. Intell Syst Appl. https://doi.org/10.1007/978-3-031-16078-3_42
Oladimeji TO, Ayo CK, Adewumi SE (2019) Review on insider threat detection techniques. J Phys Conf Ser. https://doi.org/10.1088/1742-6596/1299/1/012046
Han J, Kamber M, Pei J (2011) Getting to know your data and data preprocessing. In: data mining: concepts and techniques, 3rd edn. San Francisco, pp 39–124. https://doi.org/10.1016/C2009-0-61819-5
Yin C, Zhu Y, Fei J, He X (2017) A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access. https://doi.org/10.1109/ACCESS.2017.2762418
Özgür A, Erdem H (2016) A review of KDD99 dataset usage in intrusion detection and machine learning between 2010 and 2015. Peer J Preprints. https://doi.org/10.7287/peerj.preprints.1954v1
Liu L, Chen C, Zhang J, De Vel O, Xiang Y (2019) Insider threat identification using the simultaneous neural learning of multi-source logs. IEEE Access. https://doi.org/10.1109/access.2019.2957055
Zeng C, Lu H, Chen K, Wang R, Tao J (2023) Synthetic minority with cutmix for imbalanced image classification. Intell Syst Appl. https://doi.org/10.1007/978-3-031-16078-3_37
Wang Q, Yang G, Wang L, Fu J, Liu X (2023) SR-IDS: a Novel network intrusion detection system based on self-taught learning and representation learning. Artificial neural networks and machine learning–ICANN 2023. https://doi.org/10.1007/978-3-031-44213-1_46
Zhang A, Lipton ZC, Li M, Smola AJ (2022) Linear neural networks. In: dive into deep learning, 1st edn. pp 87–128
Moon SA (2020) Feature selection methods simultaneously improve the detection accuracy and model building time of machine learning classifiers. Symmetry. https://doi.org/10.3390/sym12091424
Scikit-Learn (2023) Sklearn preprocessing robustscaler. Scikit-Learn. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html?highlight=robust#sklearn.preprocessing.RobustScaler.fit. Accessed 15 May 2022
Scikit-Learn (2022) Preprocessing data. Scikit-Learn. https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing. Accessed 17 May 2022
Luo J, Qiao H, Zhang B (2021) Learning with smooth Hinge losses. Neurocomputing. https://doi.org/10.1016/j.neucom.2021.08.060
Géron Aurélien (2017) Support vector machines. In: hands-on machine learning with scikit-learn and tensorflow: concepts, tools, and techniques to build intelligent systems. 1st edn. O'Reilly Media, Sebastopol, CA, pp 145–166.
Manzali Y, Chahhou M, El Mohajir M (2017) Impure decision trees for auc and log loss optimization. IEEE Xplore. https://doi.org/10.1109/WITS.2017.7934675
Scikit-Learn (2014) model logistic regression. Scikit-Learn. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html. Accessed 25 October 2023
Wicht D, Schneider M, Böhlke T (2019) On quasi-newton methods in fast fourier transform-based micromechanics. Int J Numer Methods Eng. https://doi.org/10.1002/nme.6283
Wang C, Sun D, Toh KC (2010) Solving log-determinant optimization problems by a newton-cg primal proximal point algorithm. SIAM J Optim. https://doi.org/10.1137/090772514
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–2174
Defazio A, Bach F, Lacoste-Julien S (2014) SAGA: a fast-incremental gradient method with support for non-strongly convex composite objectives. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.1407.0202
Chen A, Chen B, Chai X, Rui B, Li H (2017) A novel stochastic stratified average gradient method: convergence rate and its complexity. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.1710.07783
scikit-learn (2009) Gradient boosting classifier. Scikit-Learn. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html. Accessed 10 October 2023
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. The Ann Stat. https://doi.org/10.1214/aos/1013203451
Scikit-learn (2023) ensemble AdaBoost Classifier. Scikit-Learn. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html. Accessed 12 October 2023
Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class AdaBoost. Stat Its Interface. https://doi.org/10.4310/sii.2009.v2.n3.a8
Ferrario A, Hämmerli R (2019) On boosting: theory and applications. Soc Sci Res Netw. https://doi.org/10.3929/ethz-b-000383242
oneDAL (2023) AdaBoost multiclass classifier. OneDAL. https://oneapi-src.github.io/oneDAL/daal/algorithms/boosting/adaboost-multiclass.html. Accessed 20 October 2023
Scikit-Learn (2019) Feature selection. Scikit-Learn. https://scikit-learn.org/stable/modules/feature_selection.html. Accessed 18 May 2022
Chen T, Xu M, Tu J, Wang H, Niu X (2018) Relationship between omnibus and post-hoc tests: an investigation of performance of the F test in ANOVA. Shanghai archives of psychiatry. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5925602/
SciKit-Learn (2009) Cross-validation: evaluating estimator performance. Scikit-Learn. https://scikit-learn.org/stable/modules/cross_validation.html. Accessed 22 May 2022
Wang Z, Zeng Y, Liu Y, Li D (2021) Deep belief network integrating improved kernel-based extreme learning machine for network intrusion detection. IEEE Access. https://doi.org/10.1109/ACCESS.2021.3051074
Funding
Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB). Not applicable.
Author information
Authors and Affiliations
Contributions
MF designed the study, performed the computations, interpreted data, and wrote the manuscript. RHS and NH encouraged Mohamed Farouk1 to investigate the problem of the study, supervised the study, and contributed to the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
Not applicable.
Ethical approval
There is not any ethical conflict.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Farouk, M., Sakr, R.H. & Hikal, N. Identifying the most accurate machine learning classification technique to detect network threats. Neural Comput & Applic 36, 8977–8994 (2024). https://doi.org/10.1007/s00521-024-09562-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-09562-9