Journal Full

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

PREDICTION OF CYBER-ATTACKS

USING DATA SCIENCE TECHNIQUE


Vinoth Kumar . G Mrs . G . Dhivya
Department of Computer Application Department of Computer Application
Karpagam Academy Of Higher Education Karpagam Academy Of Higher Education
Coimbatore-641 021,Tamil Nadu Coimbatore-641 021
[email protected] [email protected]

Abstract - Cyber-attack, via cyberspace, targeting an


enterprise's use of cyberspace for the purpose of disrupting, MECHINE LEARNING
disabling, destroying, or maliciously controlling a computing
environment/infrastructure; or destroying the integrity of the
data or stealing controlled information. The state of the
cyberspace portends uncertainty for the future Internet and its
accelerated number of users. New paradigms add more concerns
with big data collected through device sensors divulging large
amounts of information, which can be used for targeted attacks.
Though a plethora of extant approaches, models and algorithms
have provided the basis for cyber-attack predictions, there is the
need to consider new models and algorithms, which are based on
data representations other than task-specific techniques. Supervised Machine Learning is the majority of
However, its non-linear information processing architecture can practical machine learning uses supervised learning.
be adapted towards learning the different data representations of Supervised learning is where have input variables (X) and an
network traffic to classify type of network attack. In this paper, output variable (y) and use an algorithm to learn the mapping
we model cyber-attack prediction as a classification problem,
function from the input to the output is y = f(X). The goal is to
Networking sectors have to predict the type of Network attack
from given dataset using machine learning techniques. The approximate the mapping function so well that when you have
analysis of dataset by supervised machine learning new input data(X) that you can predict the output variables (y)
technique(SMLT) to capture several information’s like, variable for that data. Techniques of Supervised Machine Learning
identification, uni-variate analysis, bi-variate and multi-variate algorithms include logistic regression, multi-class
analysis, missing value treatments etc. A comparative study classification, Decision Trees and support vector machines etc.
between machine learning algorithms had been carried out in Supervised learning requires that the data used to train the
order to determine which algorithm is the most accurate in algorithm is already labeled with correct answers. Supervised
predicting the type cyber Attacks. We classify four types of learning problems can be further grouped into
attacks are DOS Attack, R2L Attack, U2R Attack, Probe attack.
Classification problems. This problem has as goal the
The results show that the effectiveness of the proposed machine
learning algorithm technique can be compared with best construction of a succinct model that can predict the value of
accuracy with entropy calculation, precision, Recall, F1 Score, the dependent attribute from the attribute variables. The
Sensitivity, Specificity and Entropy. difference between the two tasks is the fact that the dependent
attribute is numerical for categorical for classification. A
Index Terms - List key index terms here. No mare than 5. classification model attempts to draw some conclusion from
observed values. Given one or more inputs a classification
I. INTRODUCTION
model will try to predict the value of one or more outcomes. A
Domain overview classification problem is when the output variable is a
Data Science category, such as “red” or “blue”.
Data science is an interdisciplinary field that uses
II.REVIEW OF LITERATURE SURVEY
scientific methods, processes, algorithms and systems to
extract knowledge and insights from structured and A literature review is a body of text that aims to
unstructured data, and apply knowledge and actionable review the critical points of current knowledge on and/or
insights from data across a broad range of application methodological approaches to a particular topic. It is
domains. secondary sources and discuss published information in a
Data science can be defined as a blend of particular subject area and sometimes information in a
mathematics, business acumen, tools, algorithms and machine particular subject area within a certain time period. Its ultimate
learning techniques, all of which help us in finding out the goal is to bring the reader up to date with current literature on
hidden insights or patterns from raw data which can be of a topic and forms the basis for another goal, such as future
major use in the formation of big business decisions. research that may be needed in the area and precedes a
research proposal and may be just a simple summary of
sources. Usually, it has an organizational pattern and visualisation of the data is done to insights of the data .The
combines both summary and synthesis. model is build based on the previous dataset where the
algorithm learn data and get trained different algorithms are
A summary is a recap of important information about the
used for better comparisons. The performance metrics are
source, but a synthesis is a re-organization, reshuffling of
calculated and compared.
information. It might give a new interpretation of old material
or combine new with old interpretations or it might trace the
Existing System:
intellectual progression of the field, including major debates.
Depending on the situation, the literature review may evaluate
They proposed first to create a contrastive self-
the sources and advise the reader on the most pertinent or
supervised learning to the anomaly detection problem of
relevant of them.
attributed networks. CoLa, is mainly consists of three
Loan default trends have been long studied from a components: contrastive instance pair sampling, GNN-based
socio-economic stand point. Most economics surveys believe contrastive learning model, and multiround sampling-based
in empirical modeling of these complex systems in order to be anomaly score computation. Their model captures the
able to predict the loan default rate for a particular individual. relationship between each node and its neighbouring structure
The use of machine learning for such tasks is a trend which it and uses an anomaly-related objective to train the contrastive
is observing now. Some of the survey’s to understand the past learning model. We believe that the proposed framework
and present perspective of loan approval or not. opens a new opportunity to expand self-supervised learning
and contrastive learning to increasingly graph anomaly
Title : Distributed Secure Cooperative Control Under
detection applications. The multiround predicted scores by the
Denial-of-Service Attacks From Multiple Adversaries
contrastive learning model are further used to evaluate the
Author : Wenying Xu , Guoqiang Hu abnormality of each node with statistical estimation. The
training phase and the inference phase. In the training phase,
Year : 2019
the contrastive learning model is trained with sampled
• This paper has investigated the distributed secure instance pairs in an unsupervised fashion. After that the
control of multiagent systems under DoS attacks. We anomaly score for each node is obtained in the inference
focus on the investigation of a jointly adverse impact phase.
of distributed DoS attacks from multiple adversaries.
In this scenario, two kinds of communication III. METHODOLOGY:
schemes, that is, sample-data and event-triggered
communication schemes, have been discussed and, Data science techniques provide a powerful tool for
then, a fully distributed control protocol has been predicting and preventing cyberattacks, as they allow us to
developed to guarantee satisfactory asymptotic identify patterns in large datasets and use them to develop
consensus. Note that this protocol has strong models that can detect and predict cyberattacks. In this
robustness and high scalability. Its design does not section, we discuss the use of data science techniques for the
involve any global information, and its efficiency has prediction of cyberattacks.
been proved. For the event-triggered case, two
effective dynamical event conditions have been Data mining techniques, such as association rule
designed and implemented in a fully distributed way, mining, clustering and classification, can be used to identify
and both of them have excluded Zeno behavior. patterns in large datasets and develop models that can detect
Finally, a simulation example has been provided to and predict cyberattacks. Association rule mining can be used
verify the effectiveness of theoretical analysis. Our to identify patterns in datasets that indicate the presence of a
future research topics focus on fully distributed malicious attack. Clustering algorithms can be used to group
event/self-triggered control for linear/nonlinear similar datasets together, and classification algorithms can be
multiagent systems to gain a better understanding of used to classify datasets into different classes, such as
fully distributed control. malicious or non-malicious.

Proposed System: Machine learning techniques, such as deep learning,


can also be used to develop predictive models that can detect
The proposed model is to build a machine and predict cyberattacks. Deep learning models can be trained
learning model for anomaly detection. Anomaly detection is on large datasets to identify patterns that indicate the presence
an important technique for recognizing fraud activities, of a malicious attack. The models can then be used to predict
suspicious activities, network intrusion, and other abnormal future attacks and take preventive actions.
events that may have great significance but are difficult to
detect. The machine learning model is built by applying
proper data science techniques like variable identification that
is the dependent and independent variables. Then the
Tools: DoS attacks are classified based on the services that an
attacker renders unavailable to legitimate users.
Anaconda is a free an open-source distribution
of the Python and R programming languages for scientific User to Root:
computing (data science, machine learning applications, large-
scale data processing, predictive analytics, etc.), that aims to In User to Root attack, an attacker starts with access to a
simplify package management and deployment. Package normal user account on the system and gains root access.
versions are managed by the package Regular programming mistakes and environment assumption
management system “Conda”. The Anaconda distribution is give an attacker the opportunity to exploit the vulnerability of
used by over 12 million users and includes more than 1400 root access.
popular data-science packages suitable for Windows, Linux,
and MacOS. So, Anaconda distribution comes with more than Remote to User:
1,400 packages as well as the Conda package and virtual
environment manager called Anaconda Navigator and it In Remote to User attack, an attacker sends packets to a
eliminates the need to learn to install each library machine over a network that exploits the machine’s
independently. The open source packages can be individually vulnerability to gain local access as a user illegally. There are
installed from the Anaconda repository with the conda install different types of R2L attacks and the most common attack in
command or using the pip install command that is installed this class is done by using social engineering.
with Anaconda. Pip packages provide many of the features of
conda packages and in most cases they can work together. Probing:
Custom packages can be made using the conda build
command, and can be shared with others by uploading them to Probing is a class of attacks where an attacker scans a network
Anaconda Cloud, PyPI or other repositories. The default to gather information in order to find known vulnerabilities.
installation of Anaconda2 includes Python 2.7 and Anaconda3 An attacker with a map of machines and services that are
includes Python 3.7. However, you can create new available on a network can manipulate the information to look
environments that include any version of Python packaged for exploits. There are different types of probes: some of them
with conda. abuse the computer’s legitimate features and some of them use
social engineering techniques. This class of attacks is the most
Methods: common because it requires very little technical expertise.

The data set in KDD Cup99 have normal and 22 attack type IV. BLOCK DIAGRAM
data with 41 features and all generated traffic patterns end
with a label either as ‘normal’ or any type of ‘attack’ for 1.Architecture Diagram
upcoming analysis. There are varieties of attacks which are
entering into the network over a period of time and the attacks
are classified into the following four main classes.

➢ Denial of Service (DoS)

➢ User to Root (U2R)

➢ Remote to User (R2L)

➢ Probing

Denial of Service:

Denial of Service is a class of attacks where an attacker


makes some computing or memory resource too busy or too
full to handle legitimate requests, denying legitimate users
access to a machine. The different ways to launch a DoS
attack are by abusing the computer’s legitimate features,

➢ by targeting the implementation bugs

➢ by exploiting the misconfiguration of the systems


2. Work flow diagram:
4.Class Diagram:
Source Data

Data Processing and Cleaning

Training Testing
Dataset Dataset

5.Activity Diagram:

Classification ML Algorithms Best Model by Accuracy

Finding network attack Website

3.Use Case Diagram


V. RESULTS AND DISCUSSION Data Validation/ Cleaning/Preparing Process:

Data Pre-processing: Importing the library packages with loading given


dataset. To analyzing the variable identification by data shape,
Pre-processing refers to the transformations data type and evaluating the missing values, duplicate values.
applied to our data before feeding it to the algorithm.Data A validation dataset is a sample of data held back from
Preprocessing is a technique that is used to convert the raw training your model that is used to give an estimate of model
data into a clean data set. In other words, whenever the data is skill while tuning model's and procedures that you can use to
gathered from different sources it is collected in raw format make the best use of validation and test datasets when
which is not feasible for the analysis. To achieving better evaluating your models. Data cleaning / preparing by rename
results from the applied model in Machine Learning method of the given dataset and drop the column etc. to analyze the uni-
the data has to be in a proper manner. Some specified Machine variate, bi-variate and multi-variate process. The steps and
Learning model needs information in a specified format; for techniques for data cleaning will vary from dataset to dataset.
example, Random Forest algorithm does not support null The primary goal of data cleaning is to detect and remove
values. Therefore, to execute random forest algorithm null errors and anomalies to increase the value of data in analytics
values have to be managed from the original raw data set. And and decision making.
another aspect is that data set should be formatted in such a
way that more than one Machine Learning and Deep Learning Exploration data analysis of visualization:
algorithms are executed in given dataset.
Data visualization is an important skill in applied
Variable Identification Process / data validation process:
statistics and machine learning. Statistics does indeed focus on
Validation techniques in machine learning are used to
quantitative descriptions and estimations of data. Data
get the error rate of the Machine Learning (ML) model, which
visualization provides an important suite of tools for gaining a
can be considered as close to the true error rate of the dataset.
qualitative understanding. This can be helpful when exploring
If the data volume is large enough to be representative of the
and getting to know a dataset and can help with identifying
population, you may not need the validation techniques.
patterns, corrupt data, outliers, and much more. With a little
However, in real-world scenarios, to work with samples of
domain knowledge, data visualizations can be used to express
data that may not be a true representative of the population of
and demonstrate key relationships in plots and charts that are
given dataset. To finding the missing value, duplicate value
more visceral and stakeholders than measures of association or
and description of data type whether it is float variable or
significance. Data visualization and exploratory data analysis
integer. The sample of data used to provide an unbiased
are whole fields themselves and it will recommend a deeper
evaluation of a model fit on the training dataset while tuning
dive into some the books mentioned at the end.
model hyper parameters. The evaluation becomes more biased
as skill on the validation dataset is incorporated into the model
configuration. The validation set is used to evaluate a given
model, but this is for frequent evaluation. It as machine
learning engineers uses this data to fine-tune the model hyper
parameters. Data collection, data analysis, and the process of
addressing data content, quality, and structure can add up to a
time- consuming to-do list. During the process of data
identification, it helps to understand your data and its
properties; this knowledge will help you choose which
algorithm to use to build your model. For example, time series
data can be analyzed by regression algorithms; classification
algorithms can be used to analyze discrete data. (For example
to show the data type format of given dataset)

Percentage level of protocol type


Sometimes data does not make sense until it can look at in a It couldn’t fit the model on the training data and can’t say that
visual form, such as with charts and plots. Being able to the model will work accurately for the real data. For this, we
quickly visualize of data samples and others is an important must assure that our model got the correct patterns from the
skill both in applied statistics and in applied machine learning. data, and it is not getting up too much noise. Cross-validation
It will discover the many types of plots that you will need to is a technique in which we train our model using the subset of
know when visualizing data in Python and how to use them to the data-set and then evaluate using the complementary subset
better understand your own data. of the data-set.

➢ How to chart time series data with line plots and VI. ALOGORITHM EXPLANATION
categorical quantities with bar charts.
In machine learning and statistics, classification is a
➢ How to summarize data distributions with histograms supervised learning approach in which the computer program
and box plots. learns from the data input given to it and then uses this
learning to classify new observation. This data set may simply
➢ How to summarize the relationship between variables be bi-class (like identifying whether the person is male or
with scatter plots. female or that the mail is spam or non-spam) or it may be
multi-class too. Some examples of classification problems are:
speech recognition, handwriting recognition, bio metric
identification, document classification etc. In Supervised
Learning, algorithms learn from labeled data. After
understanding the data, the algorithm determines which label
should be given to new data based on pattern and associating
the patterns to the unlabeled new data.

Logistic Regression:

It is a statistical method for analysing a data set in which there


are one or more independent variables that determine an
outcome. The outcome is measured with a dichotomous
variable (in which there are only two possible outcomes). The
goal of logistic regression is to find the best fitting model to
describe the relationship between the dichotomous
characteristic of interest (dependent variable = response or
outcome variable) and a set of independent (predictor or
explanatory) variables. Logistic regression is a Machine
Learning classification algorithm that is used to predict the
probability of a categorical dependent variable. In logistic
regression, the dependent variable is a binary variable that
contains data coded as 1 (yes, success, etc.) or 0 (no, failure,
etc.).
Many machine learning algorithms are sensitive to the range
and distribution of attribute values in the input data. Outliers In other words, the logistic regression model predicts P(Y=1)
in input data can skew and mislead the training process of as a function of X. Logistic regression Assumptions:
machine learning algorithms resulting in longer training times,
less accurate models and ultimately poorer results. ➢ Binary logistic regression requires the dependent
variable to be binary.
Even before predictive models are prepared on training data,
outliers can result in misleading representations and in turn ➢ For a binary regression, the factor level 1 of the
misleading interpretations of collected data. Outliers can skew dependent variable should represent the desired
the summary distribution of attribute values in descriptive outcome.
statistics like mean and standard deviation and in plots such as
histograms and scatterplots, compressing the body of the ➢ Only the meaningful variables should be included.
data.Finally, outliers can represent examples of data instances
that are relevant to the problem such as anomalies in the case • The independent variables should be independent of
of fraud detection and computer security. each other. That is, the model should have little.
• The independent variables are linearly related to the training time and outputting the class that is the mode of the
log odds. classes (classification) or mean prediction (regression) of the
individual trees. Random decision forests correct for decision
• Logistic regression requires quite large sample sizes. trees’ habit of over fitting to their training set. Random forest
is a type of supervised machine learning algorithm based on
Decision Tree: ensemble learning. Ensemble learning is a type of learning
where you join different types of algorithms or same algorithm
It is one of the most powerful and popular algorithm. multiple times to form a more powerful prediction model. The
Decision-tree algorithm falls under the category of supervised random forest algorithm combines multiple algorithm of the
learning algorithms. It works for both continuous as well as same type i.e. multiple decision trees, resulting in a forest of
categorical output variables. Assumptions of trees, hence the name "Random Forest". The random forest
algorithm can be used for both regression and classification
Decision tree: tasks.

• At the beginning, we consider the whole training set


as the root.
The following are the basic steps involved in performing the
• Attributes are assumed to be categorical for random forest algorithm:
information gain, attributes are assumed to be
continuous. ➢ Pick N random records from the dataset.

➢ Build a decision tree based on these N records.


• On the basis of attribute values records are distributed
recursively.
➢ Choose the number of trees you want in your
algorithm and repeat steps 1 and 2.
• We use statistical methods for ordering attributes as
root or internal node.
➢ In case of a regression problem, for a new record,
each tree in the forest predicts a value for Y (output).
Decision tree builds classification or regression models
The final value can be calculated by taking the
in the form of a tree structure. It breaks down a data set into
average of all the values predicted by all the trees in
smaller and smaller subsets while at the same time an
forest. Or, in case of a classification problem, each
associated decision tree is incrementally developed. A
tree in the forest predicts the category to which the
decision node has two or more branches and a leaf node
new record belongs. Finally, the new record is
represents a classification or decision. The topmost decision
assigned to the category that wins the majority vote.
node in a tree which corresponds to the best predictor called
root node. Decision trees can handle both categorical and
VII. CONCLUTION
numerical data. Decision tree builds classification or
regression models in the form of a tree structure. It utilizes an
if-then rule set which is mutually exclusive and exhaustive for The analytical process started from data cleaning and
classification. The rules are learned sequentially using the processing, missing value, exploratory analysis and finally
training data one at a time. Each time a rule is learned, the model building and evaluation. The best accuracy on public
tuples covered by the rules are removed. test set is higher accuracy score will be find out by comparing
each algorithm with type of all network attacks for future
prediction results by finding best connections. This brings
This process is continued on the training set until
some of the following insights about diagnose the network
meeting a termination condition. It is constructed in a top-
attack of each new connection. To presented a prediction
down recursive divide-and-conquer manner. All the attributes
model with the aid of artificial intelligence to improve over
should be categorical. Otherwise, they should be discretized in
human accuracy and provide with the scope of early detection.
advance. Attributes in the top of the tree have more impact
It can be inferred from this model that, area analysis and use
towards in the classification and they are identified using the
of machine learning technique is useful in developing
information gain concept.A decision tree can be easily over-
prediction models that can helps to network sectors reduce the
fitted generating too many branches and may reflect anomalies
long process of diagnosis and eradicate any human error.
due to noise or outliers.

Random Forest:

Random forests or random decision forests are an ensemble


learning method for classification, regression and other tasks,
that operate by constructing a multitude of decision trees at
REFERENCES
[1]. Xiaoyong Yuan , Pan He, Qile Zhu, and Xiaolin Li , “Adversarial
Examples: Attacks and Defenses for Deep Learning”, Issue on 2019.

[2].S.Y.Jiang,S.X.Yang,”A steady-state and generational evolutionary


algorithm for dynamic multiobjective optimization,” IEEE Trans. Evol.
Comput., vol. 21, no. 1, pp. 65-82, Feb. 2017.

[3]. Preetish Ranjan, Abhishek Vaish , “Apriori Viterbi Model for Prior
Detection of Socio-Technical Attacks in a Social Network ”, Issue on
2014.

[4].Seraj Fayyad, Cristoph Meinel,“New Attack Scenario Prediction


Methodology ”, Issue on 2013.

[5].A. N. Bhagoji, D. Cullina, C. Sitawarin, and P. Mittal “Enhancing


robustness of machine learning systems via data transformations,”
[Online].Issue on 2017.

[6]. Wentao Zhao, Jianping Yin and Jun Long “A Prediction Model of
DoS Attack’s Distribution Discrete Probability,”.Issue on 2008.

[7]. Jinyu W1, Lihua Yin and Yunchuan Guo “Cyber Attacks Prediction
Model Based on Bayesian Network ,”.Issue on 2012.

[8]. Seraj Fayyad, Cristoph Meinel “New Attack Scenario Prediction


Methodology ,”.Issue on 2013.

[9]. Preetish Ranjan, Abhishek Vaish “Apriori Viterbi Model for Prior
Detection of Socio-Technical Attacks in a Social Network ,”.Issue on
2014.
[10]. Xiaoyong Yuan , Pan He, Qile Zhu, and Xiaolin Li “Adversarial
Examples: Attacks and Defenses for Deep Learning ,”.Issue on 2019.

You might also like