Paper 127-A Comprehensive Analysis of Network Security Attack Classification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 15, No. 4, 2024

A Comprehensive Analysis of Network Security


Attack Classification using Machine Learning
Algorithms
Abdulaziz Saeed Alqahtani1, Osamah A. Altammami2, Mohd Anul Haq3*
Department of Computer Science-College of Computer and Information Sciences, Majmaah University,
Al Majmaah, 11952, Saudi Arabia1, 2
College of Business Administration, Majmaah University, Al Majmaah, 11952, Saudi Arabia3

Abstract—As internet usage and connected devices continue A. Research Objectives and Motivation
to proliferate, the concern for network security among The main objective of this paper is to conduct a
individuals, businesses, and governments has intensified.
comprehensive examination of network security attack
Cybercriminals exploit these opportunities through various
attacks, including phishing emails, malware, and DDoS attacks,
classification using ML algorithms. By exploring various ML
leading to disruptions, data exposure, and financial losses. In techniques and evaluating their applicability to network
response, this study investigates the effectiveness of machine security, the research aims to enhance precision and efficiency
learning algorithms for enhancing intrusion detection systems in in identifying and categorizing network attacks [4]. The
network security. Our findings reveal that Random Forest motivation behind this research lies in the critical need for
demonstrates superior performance, achieving 90% accuracy adaptive and intelligent security measures to counter the
and balanced precision-recall scores. KNN exhibits robust dynamic tactics employed by cybercriminals [5].
predictive capabilities, while Logistic Regression delivers
commendable accuracy, precision, and recall. However, Naive
B. Consequences of Cyber-Attacks
Bayes exhibits slightly lower performance compared to other The introduction also underscores the significant
algorithms. The study underscores the significance of leveraging consequences of successful cyber-attacks, ranging from
advanced machine learning techniques for accurate intrusion financial losses to reputational damage and legal ramifications
detection, with Random Forest emerging as a promising choice. [6]. This [7] highlights the importance of enhancing security
Future research directions include refining models and exploring measures to safeguard sensitive data, ensure uninterrupted
novel approaches to further enhance network security. operations, and maintain trust in digital systems.
Keywords—Machine learning; cyber security; intrusion C. Transition to Proactive Security Strategies
detection; network security; cyber security Furthermore, the integration of ML into network security
protocols facilitates a transition from reactive to proactive
I. INTRODUCTION
security strategies [8]. By preemptively addressing potential
In recent years, cyber-attacks have become more threats, organizations can enhance overall resilience and
sophisticated and frequent, posing significant challenges to security posture.
cybersecurity efforts. As organizations increasingly rely on
interconnected networks for their operations, they are exposed This paper will include a detailed comparative analysis
to a greater risk of malicious activities. Traditional security with state-of-the-art methods, including recent advancements
methods, such as firewalls and antivirus software, while still in deep learning applied to intrusion detection. Additionally,
valuable, are struggling to keep pace with the evolving tactics recent research in deep learning for intrusion detection will be
of cybercriminals [1]. These attacks can take various forms, reviewed to identify advancements and opportunities for
from relatively simple phishing emails to complex malware improvement. This comprehensive comparison will enhance
and DDoS attacks, resulting in operational disruptions, data the credibility and relevance of the research findings.
breaches, and financial losses [2]. To effectively combat these This study is structured to first explore the existing
threats, security professionals need to adopt more advanced landscape of network security and the challenges posed by
techniques for threat detection and mitigation [3]. Machine cyber-attacks. It will then delve into the application of ML
learning algorithms offer a promising solution by leveraging algorithms in enhancing threat detection and response
data analysis to identify patterns and anomalies indicative of processes. Following this, the paper will evaluate the strengths
malicious activity [4]. By automating threat detection and and limitations of existing network intrusion detection
response processes, ML can help organizations bolster their systems, proposing innovative ML solutions to address
network security defenses in the face of evolving cyber emerging challenges. Finally, it will provide recommendations
threats. for developing stronger, more flexible, and smarter security
systems to combat cyber threats effectively in today's digital
*Corresponding Author age.

1269 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 4, 2024

II. RELATED WORKS novel and previously unseen threats that may not be captured
This review of the existing literature offers an in-depth by traditional rule-based systems.
examination of the present state of research in the For instance, research conducted by [17] on intrusion
classification of network security attacks through the detection exemplifies the application of machine learning in
application of machine learning algorithms. enhancing security measures. By leveraging machine learning
A. Network Security Attack Classification algorithms, researchers have demonstrated the effectiveness of
these techniques in discerning malicious activities within
Traditional cybersecurity methods rely on predefined rules network traffic. This study showcases the potential of machine
and signatures to detect and mitigate threats, but they struggle learning to augment traditional security measures by providing
to keep up with the rapidly evolving tactics of cybercriminals. a more adaptive and proactive approach to threat detection and
This [9] limitation has prompted a shift towards more adaptive mitigation [18].
and intelligent systems, leading to the exploration of machine
learning techniques. In their examination of machine learning Furthermore, the exploration of machine learning
algorithms, the focus is on their crucial role in intelligent data approaches in network security continues to evolve, with
analysis and automation within the cybersecurity field [10]. researchers investigating new algorithms and methodologies
They [11] highlight the ability of these algorithms to extract to address emerging challenges. As cyber threats become
valuable insights from diverse cyber data sources, increasingly sophisticated and diverse, the integration of
demonstrating their relevance in real-world scenarios and machine learning techniques holds promise for enhancing the
illustrating how data-driven intelligence contributes to resilience of network defenses and mitigating the impact of
proactive cybersecurity measures [12]. Furthermore, [13] cyber-attacks.
their analysis explores current methodologies, their practical D. Feature Extraction
implications, and emerging research directions, aiming to
provide a comprehensive understanding of the current state of The success of machine learning models in network
machine learning in cybersecurity and its potential for security heavily relies on the selection and extraction of
transformative advancements in line with the goals of our relevant features. Features can include traffic patterns, packet
research content, and behavioral analysis [19]. The process of feature
selection is critical in optimizing the performance of the
B. Machine Learning in Network Security machine learning model, as irrelevant or redundant features
Machine learning's role in network security extends far can lead to decreased accuracy and increased computational
beyond just threat detection. It encompasses prevention, overhead. Researchers in [20] have explored various feature
response, and recovery aspects as well. By leveraging machine selection techniques to identify the most informative features
learning, organizations can build systems that continuously for attack classification. The study in [21] employs machine
adapt to emerging threats, effectively fortifying their defenses learning models and feature selection techniques to detect
against evolving attack patterns [14]. This adaptability is DDoS attacks in SDN, achieving optimal accuracy (98.3%)
particularly crucial in an environment where cyber threats are with KNN.
constantly evolving in sophistication and evasiveness. Feature engineering is a critical step in the data
Furthermore, a recent study introduces a comprehensive preprocessing pipeline, aimed at transforming raw data into a
taxonomy of security threats, evaluating the potential of format that enhances the performance of machine learning
artificial intelligence (AI), including machine learning, to models. It encompasses various techniques, including feature
address a wide range of challenges. This study in [15] extraction and feature selection, to optimize the dataset for
represents the first exhaustive examination of AI solutions analysis. Given our dataset's high dimensionality with 49
across various security types and threats. It covers lessons features, effective dimensionality reduction was essential to
learned, current contributions, future directions, open issues, streamline the analysis and mitigate computational
and strategies for effectively countering advanced security complexity. To achieve this, we opted for PCA as a feature
threats [16]. This holistic approach underscores the extraction technique. PCA transforms the original features into
significance of integrating machine learning techniques into a reduced set of principal components, capturing the dataset's
network security frameworks to combat the diverse and essential variance while preserving valuable information.
evolving landscape of cyber threats effectively. Unlike feature selection techniques, which may exclude
potentially informative features, PCA retains underlying
C. Existing Machine Learning Approaches patterns and structures in the data. This approach not only
In addition to supervised learning methods like Support enhances computational efficiency but also maintains the
Vector Machines (SVM) and Random Forests, unsupervised integrity of the dataset. Explained variance analysis revealed
learning approaches, particularly anomaly detection, have that 10 principal components accounted for 90% of the
gained prominence in the realm of network security. Unlike dataset's variance, striking an optimal balance between
supervised methods that rely on labeled datasets to classify variance coverage and computational complexity in our study.
attacks, anomaly detection techniques can identify deviations
E. Related Articles and Cybersecurity Majors
from normal network behavior without predefined attack
signatures. This makes them particularly useful for detecting Table I shows summary of literature reviews, the table
major drawback from previous, write their accuracy values.

1270 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 4, 2024

TABLE I. LITERATURE REVIEW


Security Threat Detection & Incident Standards
Cite Key Important Findings
& Attacks Mitigation Response & Policy
Early identification and detection of TTPs using supervised machine
[22] ✓ ✓ ✓ ✓ learning
[23] ✓ ✗ ✓ ✓ Use ML & DL algorithms
highlights the ease with which DDoS attacks can be executed using a
[24] ✓ ✓ ✓ ✓
network of infected bots under the control of a single botmaster
addresses the significant security concern of email phishing attacks in
[25] ✓ ✓ ✗ ✗ cloud computing
attack taxonomy and threat model, organizations can enhance their
[26] ✓ ✗ ✓ ✗
ability to anticipate, detect, and respond to cyber threats
Proposed ABRC exhibits significant performance improvement
[27] ✓ ✓ ✓ X compared to existing deep learning techniques for cyber-attack
detection
ML & QML for attacks; calculate precision and recall despite
decreased accuracy post-attack; Inter-model susceptibility to crafted
[28] ✓ ✓ ✓ ✓ adversarial samples underscores the need for robust defense strategies.
Future research will delve deeper into model performance and
resilience against attacks
Developed technique achieves 99.7% accuracy in multi-class
classification for intrusion detection, surpassing existing algorithms
[29] ✓ X X ✓ significantly; Demonstrates the efficiency of auto-tuned hyper-
parameters and dataset improvements in enhancing detection
capabilities.
Intrusion detection model achieves 96.00% accuracy, outperforming
other neural network models; Stable training and test times.Data
transmission security performance shows over 80% data message
[30] ✓ ✓ ✓ ✓ delivery rate, less than 10% message leakage and packet loss rates,
and stable average delay around 350 milliseconds. The model ensures
high security and prediction accuracy, serving as an experimental
basis for enhancing safety in smart city rail transit systems.
objective conclusions and make generalizations about the
F. Research Gap Analysis relationships between variables quantitative research is
Addressing the identified research gaps holds paramount deemed suitable for this investigation as it enables the
importance in advancing our understanding and fortification measurement and analysis of numerical data providing a
against privacy attacks in the realm of machine learning. statistical foundation for evaluating ML in network security
Existing studies exhibit a propensity to focus on specific attack classification.
machine learning models, leaving a critical void in
comprehending privacy threats across a broader spectrum of A. System Design
techniques. Furthermore, the proposed attack taxonomy The system design shown in Fig. 1, is designed to analyze
provides a foundational framework, yet there exists a gap in network security attacks using a subset of the unsw nb 15
grasping the nuanced impact of different adversarial dataset it comprises two main steps aimed at enhancing the
knowledge levels on the severity of privacy attacks. Bridging accuracy of attack detection in the initial step data
this gap demands a contextual exploration of privacy attacks preprocessing is conducted involving both standardization and
within real-world machine learning applications, considering normalization to ensure uniformity in the dataset given the
the diversity of domains and their unique challenges. The datasets high dimensional nature some features may be
scarcity of longitudinal studies underscores the need for a irrelevant or redundant potentially impacting the accuracy of
dynamic perspective, tracking the evolution of privacy attacks attack detection negatively to address this issue a feature
over time. Lastly, the burgeoning landscape of emerging selection process is implemented to identify and retain only
machine learning paradigms, such as federated learning and the most relevant subset of features effectively eliminating
edge computing, lacks adequate attention in current research, useless and noisy elements from the multidimensional dataset
necessitating a focused effort to understand and mitigate moreover class imbalance is recognized as a potential
privacy attacks in these evolving contexts. Addressing these challenge in the dataset to mitigate this specific measures are
gaps promises significant implications, fostering the taken to balance the representation of different attack
development of more resilient privacy-preserving machine categories ensuring that the classifiers are trained on a more
learning models, implementing enhanced security measures, equitable distribution of data moving on to the second step
and cultivating a holistic comprehension of privacy risks for various classifiers are trained using the selected and refined
the continued advancement of secure and ethical machine features these classifiers are designed to detect all categories
learning applications. of attacks thereby aiming for maximum accuracy in the
identification of security threats the utilization of multiple
III. METHODS classifiers allows for a comprehensive assessment of the
The research methodology for this study adopts a dataset considering the nuanced characteristics of different
quantitative approach leveraging empirical data to draw attacks finally the model’s performance is evaluated using key

1271 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 4, 2024

metrics such as accuracy precision recall and f 1 score these Nine different classes of attack families each representing
measures provide a thorough understanding of how well the a unique category of network security threat these classes
classifiers are performing in terms of correctly identifying and encompass a wide spectrum of attack methodologies
classifying network security attacks the combination of these providing a holistic view of the diverse challenges faced in
performance metrics ensures a comprehensive evaluation contemporary cybersecurity the dataset employs two label
taking into account various aspects of the model’s values for classification normal and attack enabling the
effectiveness in summary the proposed methodology begins categorization of network activities into either benign or
with meticulous data preprocessing addressing issues of malicious classes the dataset s utility extends shown in Fig. 2.
standardization normalization and feature selection it then The unsw nb 15 dataset serves as a vital resource in the field
tackles the challenge of class imbalance before training of cybersecurity research offering a rich and diverse collection
classifiers to detect diverse attack categories the evaluation of network activity records that enable in depth investigations
phase employs a set of performance metrics to gauge the into advanced intrusion techniques and the development of
overall effectiveness of the framework in accurately effective security solutions its comprehensive nature and well
identifying and classifying network security threats this defined class structure make it an invaluable tool for
methodological approach provides a systematic and robust researchers practitioners and educators alike in advancing the
foundation for analyzing network security attack data. understanding and mitigation of network security threats.
2) Data preprocessing: Raw data will undergo
preprocessing to handle missing values normalize features and
address any anomalies this step is crucial for the effective
application of machine learning algorithms.
a) Data standardizations: Data standardization, also
known as data normalization, is a crucial preprocessing step in
data analysis, particularly when working with machine
learning algorithms sensitive to input feature scales. This
process transforms the values of different variables to a
common scale, ensuring that no particular feature dominates
the learning process due to differences in their original scales.
By rescaling the variables to have a mean of 0 and a standard
deviation of 1, standardizing the data aids in maintaining
consistency and improving algorithm performance. Formula
is:
Z”= X’ – M’ / σ’
Here:-
 Z” is the standardized value
Fig. 1. System design.  X’ is the original value of the variable
B. Data Collection  M’ is the mean of the variable
1) Data sources: Network traffic and attack data will be  σ’ is the standard deviation of the variable.
sourced from Kaggle. This includes both real-world and
b) Data normalization: Data normalization is a
simulated datasets to ensure a comprehensive evaluation.
preprocessing method employed to adjust numerical variables
The unsw nb 15 dataset crafted by researchers in 2015 to a standardized range, usually between 0 and 1. This practice
stands as a comprehensive resource specifically tailored to aims to ensure that all variables equally contribute to the
address advanced network intrusion techniques comprising an analysis, preventing any single feature with larger magnitudes
extensive collection of 25 million records this dataset provides from dominating. One frequently used technique for
a rich and diverse landscape for the study of network security normalization is min-max scaling, which involves a formula
threats to the dataset encapsulates the complexity of modern for normalizing a variable.
cyber threats by encompassing 49 distinct features facilitating
Xnormalized=Xmax−Xmin/X−Xmin
a nuanced analysis of network activity the 49 features in the
dataset encapsulate various aspects of network traffic creating C. Machine Learning ML Classification Algorithm
a multidimensional representation of cyber activities these Machine learning classification algorithms are
features serve as essential variables for understanding and computational tools created to classify input data into
classifying different types of network security attacks predefined categories or labels by analyzing their underlying
researchers and practitioners benefit from the detailed and patterns and characteristics. These algorithms learn from
granular information embedded in each record enabling a labeled training data, identifying patterns and relationships to
thorough exploration of advanced intrusion techniques one predict the class labels of new instances. Various classification
noteworthy characteristic.

1272 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 4, 2024

algorithms, each with unique methodologies such as rule- algorithm calculates the probability of a given instance
based decision-making or probabilistic modeling, are utilized belonging to a specific class by considering the conditional
to effectively categorize data points into different classes. probabilities of each feature given the class despite its
These algorithms are versatile tools used for tasks like simplicity naive bayes often performs well and is
detecting spam, recognizing images, and diagnosing medical
computationally efficient the naive assumption simplifies
conditions. Their performance is typically assessed using
metrics such as accuracy, precision, recall, and F1 score, calculations making it suitable for high dimensional datasets
ensuring their efficacy across diverse applications. despite its success naive bayes might struggle with correlated
features violating the independence assumption nevertheless
This is a Classification problem where we want to detect its speed simplicity and respectable performance in various
whether there is an attack or not. applications make it a popular choice for tasks involving
1) KNN: This is Ml algorithm proficient in both categorical or text based data.
classification and regression assignments. Unlike traditional 4) Logistic regression: It is used linear model for binary
methods, KNN doesn't undergo a conventional training phase and multiclass classification problems despite its name. Also
but rather memorizes the entire training dataset. During sigmoid the logistic function transforms the output into a
prediction, it relies on the proximity of data points within the range between 0 and 1 interpreting it as the probability of the
feature space [31]. To classify a new data point, KNN positive class the algorithm optimizes its parameters through
computes distances, often employing Euclidean distance, from maximum likelihood estimation regularization techniques like
the point to all other instances in the training set. The k- l 1 or l 2 regularization can be applied to prevent overfitting
nearest neighbors, identified by the smallest distances, then logistic regression is interpretable and its coefficients provide
engage in a majority voting mechanism to allocate the class to insights into feature importance it s suitable for linearly
the new data point. Alternatively, a weighted voting system separable problems but may struggle with complex
can be utilized, granting closer neighbors greater influence. In relationships ensemble methods like random forest often
regression duties, KNN forecasts the target value through outperform logistic regression on more intricate datasets but
averaging (or weighted averaging) the target values of the k- its simplicity interpretability and efficiency make it a valuable
nearest neighbors. The selection of the hyper-parameter 'k' is tool in various classification tasks.
pivotal, as it shapes the algorithm's sensitivity and D. Evaluation Metrics
generalization capability. KNN showcases its adaptability
Evaluation metrics serve as the compass for navigating the
across various domains like image recognition and
landscape of machine learning model performance accuracy
recommendation systems. Nonetheless [32], its performance the bedrock metric quantifies the models overall correctness
hinges on meticulous 'k' selection, the choice of a distance precision zooms in on the models ability to avoid false
metric, and understanding the dataset's traits. Employing positives while recall encapsulates its prowess in capturing all
efficient data structures such as KD-trees can enhance actual positive instances the f 1 score harmonizes precision
scalability, while thoughtful parameter tuning ensures its and recall into a single metric striking a balance between
efficacy across diverse contexts. precision oriented and recall oriented scenarios the confusion
2) Random forest: It is an ensemble learning method matrix a comprehensive tableau breaks down a models
widely used for classification and regression tasks, particularly predictions into true positives true negatives false positives
in intrusion detection, the algorithm operates through and false negatives these metrics collectively illuminate the
multifaceted facets of a models effectiveness providing
bootstrapped sampling creating diverse subsets of the dataset
practitioners with a versatile toolkit to gauge and enhance
by randomly selecting instances with replacement and training performance across diverse applications shown in Fig. 2.
individual decision trees on these subsets key to its robustness
is the random select of features at each node split tree
construction preventing overemphasis on specific features in
classification random forest employs a majority voting
mechanism aggregating predictions from multiple trees to
make [33] the final decision this approach not only yields
high accuracy but also enhances the models resilience to noise
and variability the algorithm s adaptability and effectiveness
make it a valuable tool in cybersecurity and various other
domains.
3) Naïve bayes: Naive Bayes is a probabilistic
classification algorithm based on Bayes theorem with the
naive assumption of feature independence it s particularly
effective for text classification and spam filtering the
Fig. 2. Performance evaluation.

1273 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 4, 2024

IV. IMPLEMENTATION
The experimental setup encompasses the selection and
preparation of datasets the configuration of machine learning
algorithms and the establishment of a controlled environment
for rigorous testing the unsw nb 15 dataset consisting of 2 5 Fig. 3. Data visualization.
million records with 49 features was chosen for its relevance
to advanced network intrusion techniques to ensure a diverse Next distribution of classes in the target variable
representation of attacks the dataset was partitioned into showcases the balance or imbalance between normal and
training and testing sets. attack instances in Fig. 4. This is crucial for assessing the
A. Tools and Techniques dataset's class distribution and potential class imbalance,
which can impact machine learning model training.
This experimentation involved the implementation of
various machine learning algorithms to evaluate their
performance in network security attack classification python
leveraging popular libraries such as scikit learn and tensor
flow served as the primary programming language for
algorithm implementation the choice of algorithms includes
decision trees support vector machines neural networks and
ensemble methods each configured with appropriate
hyperparameters.
B. Implementation
The machine learning algorithms were implemented using
a modular and scalable approach allowing for easy integration
of new algorithms and flexibility in experimenting with
different configurations the jupyter notebook was version-
controlled using git to track changes and ensure
reproducibility.
1) Import dataset: In the dataset preparation phase, the
unsw nb 15 datasets were employed consisting of both Fig. 4. Distribution of classes.
training set unsw nb 15 training set csv and a testing set unsw
nb 15 testings set csv the dataset was loaded into a python The below in Fig. 5, visualization presents a correlation
environment using the pandas library the training set as read heatmap, offering a comprehensive overview of the numerical
features' relationships. This heatmap aids in identifying
from the unsw nb 15 training set csv file comprised 82 332
potential multicollinearity and understanding feature
records while the testing set obtained from the unsw nb 15 interdependencies.
testing set csv file included 175 341 records to verify the
integrity of the dataset and ensure the appropriate division
between training and testing data the lengths of the training
and testing sets were checked the training set exhibited a
length of 82 332 records and the testing set comprised 175.
2) Data visualization: The data visualization code utilizes
the seaborn library to create informative plots depicting the
distribution of attacks and normal traffic in both the training
and testing sets the first two count plots in the top row display
the overall distribution of labels attack or normal in the
training and testing datasets meanwhile the bottom row
illustrates the distribution of attack categories in both sets with
the order specified based on the frequency of attack categories
these visualizations provide a clear overview of the class Fig. 5. Correlation of heat map.
distribution and the prevalence of different attack categories
within the datasets such insights are crucial for understanding In Fig. 6, a boxplot of the sttl feature is depicted
the imbalance between attack and normal instances and guide showcasing its distribution across different classes this
subsequent steps in the analysis such as addressing class graphical representation allows for a quick assessment of the
imbalances and selecting appropriate evaluation metrics for feature's potential discriminative power in distinguishing
machine learning models show in Fig. 3. between normal and attack instances together these
visualizations contribute to a holistic understanding of the
dataset's characteristics guiding subsequent steps in the
analysis and model development.

1274 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 4, 2024

correlation matrices were employed to explore relationships


between features aiding in the identification of potential
multicollinearity visualization of categorical variables and the
distribution of attack categories within them further enriched
our understanding of the dataset s structure the eda process
also encompassed data cleaning and preprocessing steps
addressing missing values and encoding categorical variables
scatter plots and density plots were generated to visualize
relationships between numeric features facilitating the
detection of patterns shown in Fig. 7 and Fig. 8.

Fig. 6. Boxplot of sttl by class.

3) Data preprocessing: The data preprocessing phase


involved a comprehensive examination and cleaning of the
dataset the initial step focused on identifying and handling
missing values and the results showed that there were no null
values in any of the features this indicates a well maintained
dataset without missing information ensuring the integrity of
the subsequent analysis furthermore a closer look at
categorical variables including proto service state and attack
cat revealed the nature of these attributes the attack cat
variable which represents the attack category is a crucial
element for classification tasks the categorical variables were
encoded appropriately for machine learning algorithms and
their unique values and distribution were inspected this Fig. 7. Correlation matrix for test data.
preprocessing step ensures that the dataset is ready for model
training with categorical variables appropriately handled and
missing values addressed the cleanliness and encoding of
categorical variables contribute to the robustness of the
subsequent machine learning analysis and enhance the
interpretability of the results.
Also, in [36], data preprocessing focuses on numeric
variables and involves a thorough exploration of statistical
summaries initially it showcases a selection of numeric
features from the dataset such as id dur spkts and others
highlighting their characteristics subsequently statistical
summaries are generated including count mean standard
deviation minimum 25th percentile median 50th percentile
75th percentile and maximum values for each numeric
variable this summary provides valuable insights into the
distribution and variability of these features furthermore an
additional exploration of the unsw nb 15 testings set csv file is
conducted to understand its structure and dimensions
revealing that it contains 1 000 rows and 45 columns this step
Fig. 8. Correlation matrix for train data.
is crucial for gaining an overview of the testing set which will
be utilized in the subsequent stages of model evaluation.
5) Testing and training: The initial split was performed
4) EDA: The dataset comprising features related to based on the index with the first 175 341 records designated
network security attack classification was initially examined for training and the remaining 82 332 records for testing the
for its dimensions and the presence of relevant attributes labels was extracted and assigned to y train and y test for
descriptive statistics were computed to understand the training and testing respectively subsequently the label
distribution and variability of numeric variables and class column was dropped from the feature sets to standardize the
distribution analysis provided insights into the balance feature values a min max scaler was applied it s crucial to note
between normal traffic and different attack categories that the scaler was fitted only on the training data to avoid data

1275 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 4, 2024

leakage the training data x train was transformed using the For normal instances (label 0), the precision and recall are
fitted scaler and the testing data x test was scaled accordingly 0.72 and 0.96, respectively, resulting in an F1-score of 0.82.
the final dataset dimensions were confirmed showcasing 175 Similarly, for attack instances (label 1), the precision, recall,
341 samples for training each comprising 196 features and 82 and F1-score are notably higher at 0.98, 0.83, and 0.90,
respectively. The weighted average F1-score is reported as
332 samples for testing additionally categorical columns such
0.87, indicating a balanced performance across both classes.
as proto-state and service underwent one hot encoding for
inclusion in the analysis these preprocessing steps ensure that The confusion matrix further shows the classifier's
the machine learning models are trained and tested on effectiveness, correctly identifying 53,638 instances of normal
standardized and appropriately formatted data. traffic and 98,636 instances of attacks. However, the model
misclassified 2,362 normal instances as attacks and 20,705
C. ML Model Classifications attack instances as normal. Despite these misclassifications,
1) Random forest: In Fig. 9, RF classification algorithm the overall accuracy of 87% underscores the robustness of the
was implemented on a network security attack dataset, KNN model in distinguishing between benign and malicious
network activities.
achieving an accuracy of 90%. The classification report details
the model's performance in distinguishing normal and attack
instances, with precision, recall, and F1 score metrics
providing insights. For normal instances (label 0), precision
and recall are 0.77 and 0.98, resulting in an F1 score of 0.86.
For attack instances (label 1), precision, recall, and F1 score
are higher at 0.99, 0.86, and 0.92 respectively. The weighted
average F1 score is 0.90, indicating balanced performance.
The confusion matrix shows the model correctly identifying
54,699 normal instances and 102,950 attack instances while
misclassifying 1,301 normal instances as attacks and 16,391
attack instances as normal. Despite these misclassifications,
the 90% accuracy highlights the random forest model's
robustness in network security attack classification,
contributing valuable insights to the research.

Fig. 10. Confusion matrix for KNN.

These findings contribute valuable insights to the research,


emphasizing the efficacy of the K-Nearest Neighbors
algorithm in the context of network security attack
classification.
3) Naïve bayes: The Gaussian Naive Bayes classification
algorithm was employed for network security attack
classification, resulting in an accuracy of 79%. The
classification report reveals that the model achieved a
precision of 62% and recall of 87% for normal instances (label
0), yielding an F1-score of 0.72. For attack instances (label 1),
the precision and recall are notably higher at 92% and 75%,
contributing to an F1-score of 0.83. The weighted average F1-
score is reported as 0.80, indicating a balanced performance
across both classes shown in Fig. 11.
Fig. 9. Confusion matrix of random forest. The confusion matrix provides additional insights,
indicating that the model correctly identified 48,706 instances
2) KNN: In Fig. 10, the K-Nearest Neighbors (KNN) of normal traffic and 89,663 instances of attacks. However,
classification algorithm was implemented with k=5 on the there were misclassifications, with 7,294 normal instances
network security attack dataset, yielding an accuracy of 87%. being erroneously identified as attacks and 29,678 attack
The classification report provides detailed insights into the instances mistakenly labeled as normal. Despite these
model's performance, showcasing its ability to discriminate challenges, the Gaussian Naive Bayes algorithm demonstrates
between normal and attack instances. a commendable accuracy, emphasizing its suitability for
network security attack detection in this context.

1276 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 4, 2024

V. RESULTS
The outcomes of the machine learning models including
accuracy precision recall and f 1 score will be systematically
analyzed a comparative study will be conducted to identify the
algorithm that best suits the requirements of network security
attack detection additionally insights gained from the analysis
will be used to draw meaningful conclusions about the
performance of each algorithm in handling diverse patterns
present in the network traffic data.

TABLE II. RESULTS OF CLASSIFICATION MODELS WITHOUT FEATURE


SELECTION

Classifier Accuracy Precision Recall F1


Random Forest 0.90 0.92 0.90 0.90
K-Nearest Neighbors 0.87 0.90 0.87 0.87
Fig. 11. Confusion matrix for naive bayes. Naive Bayes 0.79 0.83 0.79 0.80
Logistic Regression 0.87 0.85 0.96 0.90
This analysis in Fig. 12 contributes valuable findings to the
research, highlighting the strengths and limitations of the In the result analysis Table II, the Random Forest classifier
Gaussian Naive Bayes classifier in the domain of network demonstrated superior performance with a high accuracy of
security. 90%, effectively balancing precision and recall at 0.92 and
4) Logistic regression: The Logistic Regression classifier 0.90, respectively. K-Nearest Neighbors (KNN) showcased
was implemented for network security attack classification, strong predictive capabilities with an accuracy of 87% and a
yielding an accuracy of 87.25%. The precision, recall, and F1- well-balanced precision-recall trade-off at 0.90 and 0.87.
Naive Bayes exhibited a decent accuracy of 79%, with a
score for normal instances (label 0) are 74%, 92%, and 82%,
precision of 0.83 and a balanced F1-Score of 0.80. Logistic
respectively. For attack instances (label 1), the classifier Regression delivered an accuracy of 87%, with a
achieved higher precision (96%) and slightly lower recall commendable precision of 0.85 and a high recall of 0.96,
(85%), resulting in an impressive F1 score of 90.06%. The resulting in a robust F1-Score of 0.90.
weighted average F1 score stands at 88%, indicating a
balanced performance across both classes.
The confusion matrix and classification report provide
detailed insights into Fig. 12, the classifier's performance. It
correctly identified 51,592 instances of normal traffic and
101,719 instances of attacks. However, there were
misclassifications, with 8,408 normal instances being
erroneously identified as attacks and 17,622 attack instances
mistakenly labeled as normal.

Fig. 13. ML classifiers metrics comparisons.

These classifiers play a crucial role in our network security


context, offering effective means of identifying and
classifying instances of security attacks based on the observed
performance metrics in Fig. 13. The Random Forest model, in
particular, emerges as a promising choice for its overall high
accuracy and balanced precision-recall scores, making it well-
suited for robust intrusion detection in network security
applications.
A. Discussions
Building upon the result analysis the discussion section
will explore the implications of the findings in the context of
Fig. 12. Confusion matrix for logistic regression.
network security consideration will be given to the practicality
efficiency and robustness of the algorithms the discussion will

1277 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 4, 2024

also address potential challenges and limitations observed TABLE III. COMPARISON WITH EXISTING APPROACHES
during the analysis providing a comprehensive perspective on
the feasibility of deploying these algorithms in real-world Paper Classifiers Accuracy Precision Recalls
scenarios furthermore comparisons with existing literature and [34] SGD 80% 82.1% 82.1%
benchmarks will be made to contextualize the significance of
This study Random forest 90% 90.2% 90%
the results.
[35] Neural network 87% 87.2% 87.8%
Deep learning has revolutionized intrusion detection,
offering unparalleled accuracy and efficiency. In a study, [12] [3] XGBoost 88% 88.3% 88.8%
introduced the Principal Component-based Convolution [8] SVM 76% 77% 77%
Neural Network (PCCNN) approach for IDS, specifically [23] Random forest 80% 81% 8.9%
targeting DoS and DDoS attacks on IoT devices. This
approach boasts impressive accuracies of 99.34% for binary [14] KNN 82% 82% 82%
and 99.13% for multiclass classification on the NSL-KDD
dataset. Utilizing a sophisticated architecture of 13 layers of B. Limitations
Sequential 1-D CNN and feature reduction through Principal Though our study presented promising results, it is crucial
Component Analysis (PCA), it showcases exceptional promise to recognize the limitations. The effectiveness of machine
for cutting-edge IoT intrusion detection. learning models heavily relies on the dataset's quality and
representativeness. The utilization of the unsw nb 15 dataset in
Furthermore, the IDSGT-DNN framework, presented by our research may not adequately cover all real-world network
[37], elevates cloud security by seamlessly integrating an traffic scenarios and variations. The chosen features and
attacker-defender mechanism using game theory and deep preprocessing techniques could impact model performance,
neural networks. This framework outperforms traditional suggesting further exploration of feature engineering methods
methods in accuracy, detection rate, and various metrics on to improve classifier efficacy. The selection of classifiers was
the CICIDS-2017 dataset. Remarkably, the defender's based on established algorithms, but future research could
detection rate spans from 0 to 0.99, with gains strategically set investigate new approaches or DL methods for better
at -5, 0, and 5. While the present study may not achieve the outcomes. Evaluation metrics mainly focused on accuracy,
accuracies of the PCCNN approach (99.34% for binary and precision, recall, and f1 score, potentially overlooking
99.13% for multiclass) and the IDSGT-DNN framework variations in performance among different attack types. These
presented in previous works, it excels in computational restrictions highlight the importance of continuous refinement
efficiency. Our machine learning classifiers—Random Forest and exploration in intrusion detection to combat evolving
(RF) with an accuracy of 0.90, K-Nearest Neighbors (KNN) at cyber threats effectively.
0.87, Naive Bayes with 0.79, and Logistic Regression (LR)
also at 0.87—demonstrate competitive performance. The ML models used in the present investigation have
Importantly, these classifiers deliver these results in been practically implemented and tested using the real
significantly less time, underscoring the trade-off between intrusion detection dataset, which is recognized for its
accuracy and computational speed in intrusion detection relevance to real-world network intrusion scenarios. This
systems. approach leverages the dataset to demonstrate the models'
practical applicability in a real-world network environment.
Additionally, promising results in the random forest model By conducting experiments on the dataset, the effectiveness of
showcased notable improvements achieving a commendable the models in detecting a variety of attacks, including novel
balance between precision and recall k nearest neighbors and sophisticated ones, was evaluated. This hands-on
demonstrated strong predictive capabilities aligning with its validation allows for the identification of operational
suitability for identifying patterns in network traffic although challenges and fine-tuning of the models for improved
naive bayes presented a lower accuracy its performance performance in real-world scenarios. The practical testing
remains consistent with the algorithm s inherent assumptions provides valuable insights into the models' robustness,
logistic regression emerged as a reliable choice showcasing a scalability, and applicability, thereby reinforcing their
balanced precision recall trade off collectively our findings effectiveness and reliability in real-world network intrusion
contribute to the existing body of research by highlighting the detection applications. Future research could consider novel
effectiveness of these classifiers in the specific context of approaches or DL methods for better results [38]. Evaluation
intrusion detection offering valuable insights for the metrics focused on overall accuracy, precision, recall, and f1
development of robust and accurate network security systems. score, neglecting performance variations across different
The performance of the proposed methodology will be attack types. These limitations highlight the importance of
compared with existing approaches, highlighting the continuous refinement and exploration in intrusion detection
advancements achieved in Table III. to address evolving cyber threats.

1278 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 4, 2024

VI. CONCLUSIONS [11] X.-S. Y. S. S. N. D. A. Joshi, "Fourth International Congress on


Information and Communication Technology," ICICT 2019, London,
The detailed examination and discussion of the outcomes Volume 2, 2019.
provide valuable insights into the effectiveness of different [12] M. A. M. A. R. K. a. T. A.-H. Haq, "Development of PCCNN-Based
machine learning classifiers for detecting intrusions in Network Intrusion Detection System for EDGE Computing,"
Computers, Materials & Continua 71, no. 1, 2022.
network security. The Random Forest classifier showed the
best performance, with high accuracy, precision, recall, and F1 [13] I. H. Sarker, "Machine learning for intelligent data analysis and
automation in cybersecurity: current and future prospects," Annals of
score. K-Nearest Neighbors and Logistic Regression also had Data Science 10, no. 6, 2023.
good results, while Naive Bayes had a slightly lower [14] A. A. A. D. V. a. S. S. Mahfouz, ""Ensemble classifiers for network
performance. These results highlight the importance of using intrusion detection using a novel network attack dataset.," Future
advanced machine-learning techniques for accurate intrusion Internet 12, no. 11, 2020.
detection. Choosing the right algorithm based on the specific [15] M. S. T. Z. H. S. U. R. G. A. a. Z. H. A. Waqas, "The role of artificial
characteristics of the cybersecurity task is crucial. However, intelligence and machine learning in wireless networks security:
it's important to recognize the limitations and future research Principle, practice and challenges," Artificial Intelligence Review 55, no.
7, 2022.
should focus on improving models, exploring new approaches,
[16] H. M. a. P. S. Prachi, "Intrusion detection using machine learning and
and incorporating more data to enhance the strength and feature selection.," International Journal of Computer Network and
applicability of intrusion detection systems. In conclusion, this Information security 11, no. 4, 2019.
study adds to the conversation on strengthening cybersecurity [17] A. a. M. A. R. Alotaibi, "Enhancing the Sustainability of Deep-Learning-
defenses through machine learning methods. In concluding Based Network Intrusion Detection Classifiers against Adversarial
this study, it is essential to highlight future research Attacks," Sustainability 15, no. 12 , 2023.
possibilities for advancing intrusion detection and network [18] D. M. A. F. A. A. R. a. R. M. M. Musleh, "Intrusion Detection System
security. One potential avenue is to enhance existing models Using Feature Extraction with Machine Learning Algorithms in IoT.,"
through hyperparameter tuning and ensemble methods. Journal of Sensor and Actuator Networks 12, no. 2, 2023.
Moreover, utilizing diverse datasets could broaden the [19] M. A. a. M. A. R. K. Haq, "DNNBoT: Deep neural network-based botnet
detection and classification.," Computers, Materials & Continua 71, no.
adaptability of models to various network scenarios. Exploring 1, 2022.
deep learning approaches like neural networks could help [20] E. S. R. R. N. Z. A. A. A. H. J. M. N. S. S. M. I. E. a. B. A. M. Alomari,
uncover complex patterns in network traffic data. "Malware detection using deep learning and correlation-based feature
Additionally, addressing limitations like dataset reliance and selection," Symmetry 15, no. 1 , 2023.
biases may require more comprehensive datasets and real- [21] H. O. P. a. A. C. Polat, "Detecting DDoS attacks in software-defined
world scenarios for model validation. Lastly, research could networks through feature selection methods and machine learning
focus on creating hybrid models that combine multiple models," Sustainability 12, no. 3 , 2020.
classifiers' strengths for increased resilience. [22] M. H. U. R. S. A. R. M. A. R. F. R. a. I. A. Imran, "A performance
overview of machine learning-based defense strategies for advanced
REFERENCES persistent threats in industrial control systems.," Computers & Security
134, 2023.
[1] S.-W. M. M. S. R. A. M. R. M. M. a. M. H. Lee, "Towards secure [23] Z. M. M. I. a. M. N. H. Azam, "Comparative analysis of intrusion
intrusion detection systems using deep learning techniques: detection systems and machine learning based model analysis through
Comprehensive analysis and review," Journal of Network and Computer decision tree.," IEEE, 2023.
Applications 187, 2021.
[24] M. A. S. M. a. M. A. Al-Shareeda, "DDoS attacks detection using
[2] H. a. G. K. Alqahtani, "Machine learning for enhancing transportation machine learning and deep learning techniques: Analysis and
security: A comprehensive analysis of electric and flying vehicle comparison," Bulletin of Electrical Engineering and Informatics 12, no.
systems.," Engineering Applications of Artificial Intelligence 129, 2024. 2, 2023.
[3] T. S. S. C. D. T. D. C. a. M. A. K. Saranya, "Performance analysis of [25] U. A. R. A. H. A. S. M. B. A. a. A. A. Butt, "Cloud-based email phishing
machine learning algorithms in intrusion detection system: A review," attack using machine and deep learning algorithm," Complex &
Procedia Computer Science 171, 2020. Intelligent Systems 9, no. 3 , 2023.
[4] M. D. M. a. K. R. Karthikeyan, "Firefly algorithm based WSN-IoT [26] M. a. S. G. Rigaki, "A survey of privacy attacks in machine learning,"
security enhancement with machine learning for intrusion detection," ACM Computing Surveys 56, no. 4, 2023.
Scientific Reports 14, no. 1, 2024.
[27] B. M. M. AlShahrani, "Classification of cyber-attack using Adaboost
[5] X. X. Z. Z. Y. L. Y. B. Z. Q. L. a. X. L. Xiao, "A comprehensive analysis regression classifier and securing the network," Turkish Journal of
of website fingerprinting defenses on Tor," Computers & Security 136, Computer and Mathematics Education (TURCOMAT) 12, no. 10, 2021.
2024.
[28] M. S. H. S. I. I. M. D. H. M. A. K. V. C. a. R. V. Akter, "Exploring the
[6] K. M. M. M. K. F. K. a. I. G. Aygul, "Benchmark of machine learning Vulnerabilities of Machine Learning and Quantum Machine Learning to
algorithms on transient stability prediction in renewable rich power grids Adversarial Attacks using a Malware Dataset: A Comparative Analysis,"
under cyber-attacks," Internet of Things 25, 2024. arXiv preprint arXiv:2305.19593 , 2023.
[7] G. MeeraGandhi, "Machine learning approach for attack prediction and [29] H. Z. S. Y. C. a. H. B. Xu, "A data-driven approach for intrusion and
classification using supervised learning algorithms.," Int. J. Comput. Sci. anomaly detection using automated machine learning for the Internet of
Commun 1, no. 2 (2010): 247-250, 2010. Things," Soft Computing 27, no. 19 (2023): 14469-14481., 2023.
[8] M. a. M. M. Zamani, "Machine learning techniques for intrusion [30] Z. X. X. L. C. S. S. a. Z. W. Wang, ""Intrusion detection and network
detection.," arXiv preprint arXiv:1312.2177, 2013. information security based on deep learning algorithm in urban rail
[9] Z. G. A. Y. Y. a. Y. L. Sun, "Optimized machine learning enabled transit management system," IEEE Transactions on Intelligent
intrusion detection 2 system for internet of medical things," Franklin Transportation Systems 24, no. 2, 2023.
Open 6, 2024. [31] K. ". Alnowaiser, "Improving Healthcare Prediction of Diabetic Patients
[10] A. L. a. E. G. Buczak, "A survey of data mining and machine learning Using KNN Imputed Features and Tri-Ensemble Model.," IEEE Access,
methods for cyber security intrusion detection," IEEE Communications 2023.
surveys & tutorials 18, no. 2, 2015.

1279 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 4, 2024

[32] A. R. X. Y. C. a. M. G. Huang, "Research on multi-label user [36] G. MeeraGandhi, "Machine Learning Approach for Attack Prediction
classification of social media based on ML-KNN algorithm.," and Classification using supervised learning algorithms," Int. J. Comput.
Technological Forecasting and Social Change 188, 2023. Sci. Commun 1, no. 2, 2010.
[33] B. D. J. A. a. S. H. L. He, "Assessment of tunnel blasting-induced [37] E. Balamurugan, A. Mehbodniya, E. Kariri, K. Yadav, A. Kumar, and M.
overbreak: A novel metaheuristic-based random forest approach.," Anul Haq, “Network optimization using defender system in cloud
Tunnelling and Underground Space Technology 133 , 2023. computing security based intrusion detection system withgame theory
[34] G. a. G. K. Kocher, "Analysis of Machine Learning Algorithms with deep neural network (IDSGT-DNN),” Pattern Recognit. Lett., vol. 156,
Feature Selection for Intrusion Detection Using UNSW-NB15 Dataset," pp. 142–151, 2022, doi: https://doi.org/10.1016/j.patrec.2022.02.013.
Available at SSRN 3784406, 2021. [38] H. Mohd Anul, “DBoTPM : A Deep Neural Network-Based Botnet,”
[35] M. S. L. a. K. G. K. Beechey, ""Evidential classification for defending Electronics, vol. 12, no. 1159, pp. 1–14, 2023.
against adversarial attacks on network traffic," Information Fusion 92
(2023): 115-126., 2023.

1280 | P a g e
www.ijacsa.thesai.org

You might also like