Nathan

Chapter One
1.1 Introduction
In the digital era, the security of web applications is of very importance. As organizations
increasingly depend on web-based systems for their operations, the need to protect these systems
from unauthorized access and cyber threats has become critical. Authentication activities, which
involve verifying the identity of users accessing a system, are a primary target for malicious
actors seeking to breach security protocols. Detecting suspicious authentication activities within
web application logs is essential to prevent unauthorized access, data breaches, and other
security incidents. Machine learning algorithms offer a promising approach to identifying these
suspicious activities by analyzing patterns and anomalies in authentication logs. This study
focuses on the application of machine learning algorithms to detect suspicious authentication
activities in web application logs, using a case study approach.
Figure 1: Flow of Phishing process (Yung-Tsung Hou, et al., 2010)
1.2 Background of the Study

The rapid growth of internet-based services has led to an increase in cyber threats targeting web
applications. According to the Verizon Data Breach Investigations Report (DBIR, 2020), over
80% of hacking-related breaches involve compromised passwords. Traditional methods of
detecting suspicious authentication activities, such as rule-based systems, are often insufficient
due to their inability to adapt to evolving threats and detect complex attack patterns. Machine
learning algorithms have demonstrated significant potential in the field of cyber security. By
leveraging large datasets and advanced computational techniques, these algorithms can identify
subtle anomalies and patterns that may indicate suspicious activities. Various machine learning
approaches, including supervised, unsupervised, and reinforcement learning, have been explored
to enhance the detection of fraudulent activities in web application logs (Kumar & Raj, 2018).
Supervised learning algorithms, such as decision trees, random forests, and support vector
machines (SVM), are commonly used for anomaly detection in web applications. These
algorithms require labeled datasets to train models that can classify activities as normal or
suspicious. For instance, SVM has been successfully applied to detect malicious login attempts
by analyzing features such as login frequency, IP address, and time of access (Xia et al., 2019).
Unsupervised learning algorithms, such as clustering and auto encoders, do not require labeled
data and are particularly useful for detecting new or unknown threats. Clustering algorithms like
K-means and DBSCAN can group similar authentication activities and identify outliers that may
represent suspicious actions (Ahmed et al., 2016).
Reinforcement learning, although less common in this domain, offers a dynamic approach to
anomaly detection by continuously learning from interactions with the environment. This
approach has shown promise in adapting to new attack patterns and improving detection
accuracy over time (Chen et al., 2020).
Detecting suspicious authentication activities involves several challenges. Firstly, the volume
and variety of data generated by web applications require efficient and scalable algorithms.
Secondly, the dynamic nature of cyber threats necessitates adaptive models that can learn from
new data and evolve over time. Thirdly, ensuring the accuracy of detection without generating
false positives is crucial to avoid unnecessary alerts and maintain user trust. Recent
advancements in machine learning, including deep learning techniques such as recurrent neural
networks (RNN) and convolutional neural networks (CNN), have further enhanced the ability to
detect complex patterns in authentication logs. These models can capture temporal dependencies
and intricate relationships within the data, improving the detection of sophisticated attack
strategies (Zhang et al., 2019).
1.3 Research Problem

Despite the potential of machine learning algorithms to detect suspicious authentication activities,
several research gaps remain. Existing studies often focus on specific algorithms or datasets,
limiting the generalizability of their findings. There is a need for comprehensive research that
evaluates multiple machine learning approaches across diverse datasets and real-world scenarios.
Furthermore, integrating these algorithms into practical systems that can be easily adopted by
organizations poses additional challenges. This study aims to address these gaps by conducting a
case study on the application of machine learning algorithms in detecting suspicious
authentication activities in web application logs. The research will evaluate the effectiveness of
different algorithms, identify the most suitable approaches for various scenarios, and propose
practical solutions for implementation.
1.4 Aim and Objectives of the Study

The primary aim of this study is to evaluate the effectiveness of machine learning algorithms in
detecting suspicious authentication activities in web application logs. The specific objectives are
as follows:
I. To review and compare various machine learning algorithms used for anomaly detection
in web application logs.
II. To develop a framework for implementing machine learning algorithms to detect

suspicious authentication activities.
III. To evaluate the performance of different algorithms using real-world datasets.
IV. To identify the challenges and limitations associated with implementing machine
learning-based detection systems.
V. To propose recommendations for improving the accuracy and efficiency of machine

learning algorithms in detecting suspicious activities.
1.5 Significance of the Study

This study is significant for several reasons. it addresses a critical aspect of cyber security by
enhancing the detection of suspicious authentication activities, thereby protecting web
applications from unauthorized access and potential breaches. the research provides valuable
insights into the application of machine learning algorithms in a practical context, offering
guidance for organizations seeking to implement these technologies. The findings contribute to
the academic and professional discourse on cyber security, advancing the understanding of how
machine learning can be leveraged to improve security measures. For organizations, the
implementation of effective machine learning-based detection systems can lead to improved
security posture, reduced risk of data breaches, and enhanced trust among users. For users, the
benefits include increased confidence in the security of their personal information and a
reduction in the likelihood of unauthorized access to their accounts.
1.6 Scope of the Study

The scope of this study includes the review and evaluation of machine learning algorithms for
detecting suspicious authentication activities in web application logs. The study focuses on the
following areas:
 Algorithm Review: Reviewing and comparing supervised, unsupervised, and
reinforcement learning algorithms used for anomaly detection.
 Framework Development: Developing a framework for implementing machine learning

algorithms in a web application context.
 Performance Evaluation: Evaluating the performance of different algorithms using real-

world datasets.
 Challenges and Limitations: Identifying the challenges and limitations associated with
implementing these algorithms.
 Recommendations: Proposing recommendations for improving the effectiveness of

machine learning-based detection systems.
The study is limited to web application logs and does not cover other types of logs or
authentication activities in non-web contexts.
1.7 Limitations to the Study

This study has several limitations. The availability and quality of real-world datasets for training
and testing machine learning models may pose challenges, the performance of machine learning
algorithms can vary based on the specific characteristics of the data, making it difficult to
generalize findings across different contexts, the implementation of these algorithms in practical
systems may face technical and organizational challenges, such as integration with existing
security infrastructure and ensuring user privacy and the study focuses on detection and does not
address subsequent actions or interventions following the identification of suspicious activities.
1.8 Organization of the Study

This study is organized into five chapters:
Chapter One: General Introduction - Provides an overview of the study, including the
background, research problem, objectives, significance, scope, limitations, and organization of
the study.
Chapter Two: Literature Review - Reviews existing literature on machine learning algorithms
for anomaly detection, related concepts, and the application of these algorithms in detecting
suspicious authentication activities.
Chapter Three: Methodology - Describes the research design, data collection methods, and
analysis techniques used in the study.
Chapter Four: Results and Discussion - Presents the findings of the study, including the
performance evaluation of different algorithms and a discussion of the results.
Chapter Five: Conclusion and Recommendations - Summarizes the key findings, discusses their
implications, and provides recommendations for future research and practice.
1.9 Chapter Summary

This chapter provided an introduction to the study on detecting suspicious authentication
activities in web application logs using machine learning algorithms. The background of the
study highlighted the importance of cyber security and the potential of machine learning in
enhancing detection capabilities. The research problem, aims, and objectives were outlined,
followed by a discussion of the significance, scope, limitations, and organization of the study.
The next chapter will review the existing literature on related concepts and machine learning
algorithms for anomaly detection.
References for Chapter One
Ahmed, M., Mahmood, A. N., & Hu, J. (2016). A survey of network anomaly detection
techniques. Journal of Network and Computer Applications, 60, 19-31.
Chen, T., Xu, H., & He, Y. (2020). Reinforcement learning for cyber-physical systems security.
Journal of cyber security, 6(1), tyaa011.
DBIR. (2020). Verizon Data Breach Investigations Report.
Kumar, A., & Raj, P. (2018). Predictive Analytics in cyber security: Machine Learning and Data
Mining Approaches. Springer.
Xia, Y., Wang, X., & Zhang, Y. (2019). Anomaly detection in login behaviors based on SVM.
Journal of Intelligent & Fuzzy Systems, 36(3), 2399-2406.
Zhang, Y., Xiang, Y., Wang, G., & Tang, X. (2019). Deep learning for anomaly detection:
Opportunities and challenges. Big Data Mining and Analytics, 2(4), 316-329.
Yung-Tsung Hou, Yimeng Chang, Tsuhan Chen, Chi-Sung Laih, and Chia-Mei Chen. Malicious
web content detection by machine learning.
Expert Systems with Applications, 37(1):55–60, 2010.

Chapter Two
Literature Review
2.1 Introduction
This chapter reviews the existing literature on the application of machine learning algorithms for
detecting suspicious authentication activities in web application logs. It covers related concepts,
definitions of key terms, theoretical frameworks, the case study system, and the application of
machine learning algorithms. The chapter also reviews related work and examples of similar
applications, providing a comprehensive understanding of the current state of research and
practice in this field.
History and Evolution of Web Applications
Web applications have significantly transformed the digital landscape since their inception.
Initially, web pages were static, offering limited interaction and functionality. The early 1990s
marked the advent of the World Wide Web, with the introduction of HTML and basic web
browsers like Mosaic. These static websites served primarily as online brochures, offering
information without much user interaction.
The late 1990s and early 2000s saw the rise of dynamic web applications, powered by
technologies such as JavaScript, PHP, and ASP.NET. This period witnessed the emergence of
platforms like Amazon and eBay, which allowed users to interact with the website, perform
searches, and make purchases in real-time (Zeldman, 2013). The concept of Web 2.0 introduced
richer user experiences and interactive content, facilitated by AJAX and other asynchronous
technologies, leading to the development of social media platforms, online banking, and
collaborative tools like Google Docs (O'Reilly, 2015).
The continuous evolution of web applications has been driven by advancements in web standards,
security protocols, and frameworks. Modern web applications leverage technologies like
HTML5, CSS3, and robust JavaScript frameworks such as React and Angular, enabling the
creation of highly responsive and user-friendly interfaces (W3C, 2014). Furthermore, the rise of
cloud computing and microservices architecture has enabled scalable and resilient web
applications that can handle vast amounts of data and traffic.
Cyber Attacks
As web applications have evolved, so too have the threats against them. Cyber attacks have
become increasingly sophisticated, targeting vulnerabilities in web applications to gain
unauthorized access, steal data, or disrupt services. The history of cyber attacks dates back to the
early days of computing, but the proliferation of the internet has amplified their scale and impact.
One of the earliest and most notable cyber attacks was the Morris Worm in 1988, which
exploited vulnerabilities in UNIX systems and caused significant disruptions (Spafford, 1989).
The 2000s saw a surge in phishing attacks, SQL injection, and cross-site scripting (XSS) attacks,
exploiting weaknesses in web application security to steal sensitive information and compromise
user accounts (Sullivan, 2017).
In recent years, cyber attacks have become more targeted and complex. Advanced Persistent
Threats (APTs) and ransomware attacks like WannaCry and NotPetya have demonstrated the
capability of cybercriminals to cause widespread damage and extort organizations for financial
gain (Greenberg, 2018). The increasing interconnectivity of devices through the Internet of
Things (IoT) has also expanded the attack surface, leading to new challenges in securing web
applications.
Machine Learning in cyber security
Machine learning (ML) has emerged as a powerful tool in the fight against cyber threats. By
analyzing vast amounts of data, ML algorithms can identify patterns and anomalies that may
indicate malicious activities. The application of ML in cyber security has evolved alongside
advancements in computational power and data availability.
Early approaches to using machine learning in cyber security involved rule-based systems and
signature-based detection methods. These systems relied on predefined patterns to identify
known threats, but they struggled with detecting new or evolving attacks (Axelsson, 2000). As
machine learning techniques advanced, anomaly detection models were developed, allowing for
the identification of unusual behavior that deviates from established norms (Chandola, Banerjee,
& Kumar, 2019).
Supervised learning algorithms, such as decision trees and support vector machines, have been
employed to classify benign and malicious activities based on labeled datasets. These models
have shown effectiveness in identifying known attack patterns but require substantial labeled
data for training (Buczak & Guven, 2016). Unsupervised learning methods, like clustering and
association rule mining, have been used to detect novel attacks by identifying outliers and
correlations in unlabeled data (Sommer & Paxson, 2010).
Recent advancements in deep learning have further enhanced the capabilities of machine
learning in cyber security. Deep neural networks and recurrent neural networks (RNNs) can
process complex data structures and temporal sequences, making them suitable for detecting
sophisticated attacks and predicting future threats (Kim et al., 2017). Additionally, the
integration of machine learning with real-time monitoring and response systems has enabled
proactive threat detection and mitigation.
2.2 Overview of Related Concepts

Authentication and Authorization Authentication and authorization are foundational concepts
in cyber security. Authentication is the process of verifying the identity of a user or system,
typically through credentials such as usernames and passwords, biometric data, or security
tokens. Authorization, on the other hand, determines what authenticated users are allowed to do,
specifying their access rights and permissions within a system. Together, these processes ensure
that only legitimate users can access specific resources and perform authorized actions.
Web Application Security Web application security involves protecting web applications from
various security threats and vulnerabilities. Common issues include SQL injection, cross-site
scripting (XSS), cross-site request forgery (CSRF), and broken authentication and session
management. Ensuring web application security requires implementing robust coding practices,
using security tools and frameworks, and regularly performing security testing and audits.
Intrusion Detection Systems (IDS) Intrusion Detection Systems are designed to monitor
network or system activities for malicious actions or policy violations. Traditional IDS can be
categorized into network-based (NIDS) and host-based (HIDS) systems. Modern IDS often
incorporate machine learning techniques to improve their detection capabilities. IDS can operate
in a signature-based mode, detecting known threats, or in an anomaly-based mode, identifying
unusual patterns that may indicate new threats.
Data Mining Data mining involves extracting useful information from large datasets. In the
context of cyber security, data mining techniques can analyze logs and network traffic to identify
patterns and correlations that may indicate security threats. Techniques such as clustering,
classification, and association rule learning are commonly used in data mining.
Anomaly Detection Anomaly detection is the process of identifying unusual patterns that do not
conform to expected behavior. In cyber security, this can involve detecting deviations from
normal user behavior or network traffic patterns that may indicate a potential security breach.
Anomaly detection is crucial for identifying zero-day attacks and other novel threats that may not
be recognized by signature-based detection methods.
Big Data Analytics Big data analytics refers to the process of examining large and varied
datasets to uncover hidden patterns, unknown correlations, and other useful information. In cyber
security, big data analytics can be used to analyze extensive logs and network traffic data,
enabling the detection of sophisticated threats that might be missed by traditional methods.
Cyber Threat Intelligence Cyber threat intelligence involves gathering, analyzing, and
disseminating information about potential or ongoing threats to an organization's cyber security.
This information can include indicators of compromise (IOCs), threat actor tactics, techniques,
and procedures (TTPs), and other relevant data that can help in anticipating and mitigating cyber
threats.
Behavioral Analysis Behavioral analysis in cyber security involves monitoring and analyzing
the behavior of users and systems to detect anomalies that may indicate malicious activities. This
can include tracking login patterns, usage habits, and other behavioral indicators. Machine
learning models can be trained to recognize normal behavior and flag deviations as potential
security threats.
Supervised vs. Unsupervised Learning In machine learning, supervised learning involves

training a model on labeled data, where the outcomes are known, to make predictions or
classifications. Unsupervised learning, on the other hand, involves training a model on data
without predefined labels, allowing the model to identify hidden patterns or groupings within the
data. Both approaches have applications in cyber security, such as classifying known attack types
or discovering previously unknown threats.
Pattern Recognition Pattern recognition involves identifying and classifying patterns in data. In
cyber security, pattern recognition can be used to detect malicious activities by recognizing
patterns that are indicative of security threats. This can include identifying repeated sequences of
actions that match known attack signatures or discovering new patterns that may indicate
emerging threats.
2.2.1 Definition of Related Terms

Machine Learning (ML): Machine learning is a subset of artificial intelligence that involves the
development of algorithms that allow computers to learn from and make predictions or decisions
based on data. It can be categorized into supervised learning, unsupervised learning, and
reinforcement learning
Anomaly Detection: Anomaly detection is the identification of rare items, events, or

observations which raise suspicions by differing significantly from the majority of the data. It is
commonly used in various domains such as fraud detection, network security, and system health
monitoring
Authentication Activities: Authentication activities refer to processes through which a system

verifies the identity of a user attempting to access resources. Common methods include
passwords, biometrics, and multi-factor authentication
Web Application Logs: Web application logs are records of events and transactions that occur
within a web application. These logs typically include information such as user login attempts, IP
addresses, time stamps, and actions performed by users
2.2.2 The Machine Learning Model/Algorithm
Machine Learning in cyber security

Machine Learning (ML) is a subset of artificial intelligence (AI) that focuses on developing
algorithms that can learn from and make predictions or decisions based on data. In cyber security,
ML algorithms are employed to identify patterns and anomalies in data, which can help detect
suspicious activities, including unauthorized access attempts in web applications.
Machine Learning Algorithms
Supervised Learning Algorithms
Supervised learning involves training a model on a labeled dataset, where the input data is paired
with the correct output. Algorithms such as decision trees, random forests, and support vector
machines (SVM) are commonly used for classification tasks in anomaly detection (Hastie et al.,
2009).
Unsupervised Learning Algorithms
Unsupervised learning algorithms identify patterns in data without pre-existing labels. Clustering
algorithms like K-means and density-based spatial clustering of applications with noise
(DBSCAN) are often used to detect anomalies by grouping similar data points and identifying
outliers (Bishop, 2006).
Reinforcement Learning
Reinforcement learning is a type of machine learning where an agent learns to make decisions by
taking actions in an environment to maximize some notion of cumulative reward. This approach
can adapt to new data and improve over time, making it suitable for dynamic environments
(Sutton & Barto, 2018).
1. Supervised Learning Algorithms:
Decision Trees: These algorithms split the data into branches based on feature values,
leading to decisions about the data classification.
Strengths: Easy to interpret, handles both numerical and categorical data.
Weaknesses: Prone to over fitting.
Random Forests: An ensemble method that combines multiple decision trees to improve
classification accuracy.
Strengths: Reduces over fitting, handles large datasets.
Weaknesses: Complex and less interpretable than single decision trees.
Support Vector Machines (SVM): Finds the hyper plane that best separates the data into
different classes.
Strengths: Effective in high-dimensional spaces.
Weaknesses: Requires proper tuning of parameters and feature scaling.
Neural Networks: Consist of layers of interconnected nodes that can learn complex patterns.
Strengths: Capable of modeling complex relationships.
Weaknesses: Requires large amounts of data and computational resources.
2. Unsupervised Learning Algorithms:
K-Means Clustering: Partitions the data into K clusters based on feature similarity.
Strengths: Simple and fast.
Weaknesses: Requires specifying the number of clusters, sensitive to initial

conditions.
Hierarchical Clustering: Builds a tree of clusters by either merging or splitting them.
Strengths: Does not require specifying the number of clusters in advance.
Weaknesses: Computationally intensive for large datasets.
Autoencoders: Neural networks used for learning efficient codings of input data.
Strengths: Useful for anomaly detection by identifying deviations from normal

patterns.
Weaknesses: Requires careful tuning and large amounts of data.
Isolation Forest: Focuses on isolating anomalies instead of profiling normal data.
Strengths: Effective for high-dimensional data.
Weaknesses: Requires appropriate setting of contamination parameter.
3. Semi-Supervised Learning Algorithms: These algorithms utilize both labeled and unlabeled
data to improve learning accuracy, which is particularly useful when labeled data is scarce but
unlabeled data is abundant.
4. Deep Learning Algorithms:
Convolutional Neural Networks (CNN): Primarily used for image data but also applicable
for certain types of anomaly detection.
Recurrent Neural Networks (RNN): Suitable for sequential data like logs, capturing
temporal dependencies.
Model/Algorithm Approach
The model/algorithm approach involves selecting appropriate machine learning algorithms to

detect suspicious activities based on authentication logs from web applications. The choice of
algorithms depends on the nature of the data and the specific requirements of the detection task.
Framework Approach
Detection Framework: The framework approach encompasses a structured process for

implementing and deploying machine learning models to detect suspicious authentication
activities. This includes:
1. Data Collection: Aggregating authentication logs from web applications, which may
include login attempts, IP addresses, timestamps, user agents, and more.
2. Data Preprocessing: Cleaning the data to remove noise and irrelevant information. This
step may involve:
Removing duplicate entries
Handling missing values
Normalizing data
3. Feature Engineering: Extracting relevant features from the logs that can help
distinguish between normal and suspicious activities. Examples include:
Frequency of login attempts
Geographical location of login attempts
Time of day of login attempts
4. Model Training: Using the preprocessed data to train machine learning models. This
step involves:
Splitting the data into training and testing sets
Training multiple models to compare performance
Tuning hyperparameters to optimize model performance
5. Model Evaluation: Assessing the performance of the models using metrics such as:
Accuracy
Precision
Recall
F1-score
Area under the ROC curve (AUC-ROC)
6. Anomaly Detection: Deploying the trained models to identify suspicious activities in

real-time or batch processing. This step involves:
Integrating the models with the web application’s logging system
Setting thresholds for anomaly scores to flag suspicious activities
7. Post-Detection Analysis: Investigating detected anomalies to confirm their legitimacy

and understand potential threats. This may involve manual review by cyber security
analysts.
2.2.3 Machine Learning System for Detecting Suspicious Authentication Activities in Web
Application Logs
The system being developed is a machine learning-based detection framework designed to

identify suspicious authentication activities within web application logs. This system leverages
advanced machine learning algorithms to analyze log data, detect patterns, and flag anomalies
that may indicate unauthorized access attempts or other malicious activities.
Importance of the System
In the current digital landscape, web applications are prime targets for cyber attacks, including
unauthorized access, data breaches, and various forms of exploitation. Traditional security
measures, such as rule-based systems and signature-based detection, are often insufficient to
cope with the sophisticated and constantly evolving tactics used by cybercriminals.
The detection of suspicious authentication activities in web applications is crucial for preventing
unauthorized access and safeguarding sensitive data. Such systems are commonly applied in
sectors with high security demands, including finance, healthcare, and e-commerce. Effective
detection systems analyze logs for anomalies, flag potential threats, and trigger security protocols
to mitigate risks (Liu et al., 2018).
The machine learning system offers several key benefits:

Proactive Detection: Unlike traditional methods that rely on predefined rules, machine
learning models can learn from historical data and detect novel threats by identifying
deviations from normal behavior.
Accuracy and Efficiency: By automating the analysis of vast amounts of log data, the
system improves the speed and accuracy of threat detection, reducing the workload on
security personnel.
Adaptability: The system continuously learns and adapts to new patterns, improving its
detection capabilities over time.
Application Areas
This system can be applied across various sectors where web applications are integral to
operations, including:
E-commerce Platforms: To protect customer data and prevent unauthorized access to

accounts.
Financial Services: To secure online banking systems and prevent fraud.
Healthcare Systems: To safeguard sensitive patient information and comply with

regulations such as HIPAA.
Educational Institutions: To protect student and faculty data in online learning platforms.
Government Services: To secure access to citizen services and sensitive information.
System Operation
Data Collection: The system collects authentication logs from web applications, including
login attempts, IP addresses, timestamps, user agents, and other relevant data points.
Data Preprocessing: The collected logs are cleaned and preprocessed to ensure the data is in
a suitable format for analysis. This involves removing duplicates, handling missing values,
and normalizing data.
Feature Engineering: Relevant features are extracted from the logs to help distinguish
between normal and suspicious activities. Features may include login attempt frequency,
geographical location, time of day, and device type.
Model Training: The preprocessed data is used to train machine learning models.
Supervised learning algorithms like Decision Trees, Random Forests, or Support Vector
Machines (SVM) may be employed, depending on the labeled data available. Unsupervised
learning methods such as K-Means Clustering or Isolation Forests can be used when labeled
data is scarce.
Model Evaluation: The trained models are evaluated using metrics such as accuracy,
precision, recall, F1-score, and the area under the ROC curve (AUC-ROC) to ensure they
effectively detect anomalies.
Real-time Monitoring: The system is integrated with the web application’s logging system
to enable real-time monitoring. The models continuously analyze incoming logs, flagging
any suspicious activities based on learned patterns.
Alerting and Reporting: When the system detects an anomaly, it triggers an alert to security
personnel, providing detailed reports on the suspicious activity. This allows for immediate
investigation and response.
Continuous Learning: The system continuously updates its models based on new data,
ensuring that it adapts to emerging threats and maintains high detection accuracy.
How the System Works
The proposed system leverages machine learning algorithms to analyze web application logs in
real-time. The process involves data collection, feature extraction, model training, anomaly
detection, and alert generation. The system continuously updates its models to adapt to new
patterns of suspicious behavior, ensuring robust protection against evolving threats (Buczak &
Guven, 2016).
2.2.4 Application of Machine Learning to the Case Study
Relevance of Machine Learning
Machine learning algorithms are highly effective in detecting anomalies within large datasets,
such as web application logs. Their ability to learn from historical data and identify subtle
patterns makes them well-suited for identifying suspicious authentication activities that may not
be easily detected through rule-based systems (Cook et al., 2019).
Implementation Strategy
The implementation involves selecting appropriate algorithms based on the nature of the data
and the specific requirements of the web application. Supervised learning models may be trained
using labeled datasets of known normal and suspicious activities. Unsupervised models can be
employed to identify new types of threats by clustering similar activities and highlighting
outliers. Reinforcement learning can further enhance detection capabilities by continuously
adapting to new attack vectors (Goodfellow et al., 2016).
Relevance of the Chosen Model to the Case Study
1. Proactive Threat Detection: The machine learning model's ability to identify anomalies
proactively is crucial for the case study, as it enables the system to detect suspicious activities
that might not be covered by traditional rule-based security systems. This proactive detection is
vital in protecting web applications from unauthorized access and potential data breaches.
2. Handling Large Volumes of Data: Web applications generate vast amounts of log data,
which can be overwhelming for manual analysis. Machine learning models are well-suited to
process and analyze large datasets efficiently, making them ideal for monitoring web application
logs in real-time. The model can quickly sift through log data to identify patterns and anomalies
that indicate suspicious authentication activities.
3. Adaptability to Evolving Threats: Cyber threats are constantly evolving, with attackers
frequently developing new techniques to bypass security measures. The machine learning
model's adaptability allows it to learn from new data and update its detection capabilities
accordingly. This continuous learning process ensures that the model remains effective in
identifying new and emerging threats.
4. Improved Accuracy and Reduced False Positives: Machine learning models, especially
those utilizing advanced algorithms like Random Forests or Support Vector Machines (SVM),
offer high accuracy in distinguishing between normal and suspicious activities. This precision
reduces the number of false positives, ensuring that security personnel can focus on genuine
threats rather than being overwhelmed by incorrect alerts.
5. Real-time Monitoring and Alerting: The integration of the machine learning model with the
web application’s logging system enables real-time monitoring of authentication activities. This
real-time capability is essential for promptly detecting and responding to suspicious activities,
minimizing the window of opportunity for attackers.
Application of the Model
1. Data Collection and Preprocessing: The model collects authentication logs from the web
application, capturing details such as login attempts, IP addresses, timestamps, and user agents.
Preprocessing steps include data cleaning, normalization, and handling missing values to ensure
the data is suitable for analysis.
2. Feature Engineering: Relevant features are extracted from the log data to help the model
distinguish between normal and suspicious activities. Features may include login frequency,
geographic location, time of day, device type, and patterns of failed login attempts.
3. Model Training: The collected and preprocessed data is used to train the machine learning
model. Supervised learning algorithms, such as Random Forests or Support Vector Machines
(SVM), are employed if labeled data (normal vs. suspicious activities) is available. Unsupervised
learning methods, like Isolation Forests or K-Means Clustering, are used if labeled data is scarce.
4. Model Evaluation: The model is evaluated using metrics like accuracy, precision, recall, F1-
score, and the area under the ROC curve (AUC-ROC) to ensure its effectiveness in detecting
anomalies. The goal is to achieve a high level of accuracy in identifying suspicious activities
while minimizing false positives.
5. Real-time Monitoring and Detection: Once trained and validated, the model is integrated
with the web application’s logging system for real-time monitoring. The model continuously
analyzes incoming authentication logs, identifying and flagging suspicious activities based on the
learned patterns.
6. Alerting and Reporting: When the model detects an anomaly, it triggers an alert to security
personnel, providing detailed reports on the suspicious activity. These reports include
information on the nature of the anomaly, the affected user accounts, and the specific log entries
that raised the alert.
7. Continuous Learning: The model continuously updates its detection capabilities by learning
from new log data. This ongoing learning process ensures that the model adapts to evolving
threats and maintains its effectiveness over time.
2.3 Review of Related Literature
2.3.1 Examples of Applications Similar to the Case Study
Fraud Detection Systems
Fraud detection in financial transactions is a well-known application of anomaly detection.

Machine learning algorithms analyze transaction patterns to identify fraudulent activities, such as
unauthorized credit card use or money laundering (Ngai et al., 2011).
Network Intrusion Detection Systems (NIDS)
NIDS monitor network traffic for suspicious activities that may indicate security breaches.
Machine learning models are trained to distinguish between normal and malicious traffic,
enhancing the detection of potential intrusions (Mukherjee et al., 1994).
Healthcare Monitoring Systems
Anomaly detection in healthcare involves monitoring patient data for abnormal patterns that
could indicate health issues. Machine learning algorithms analyze vital signs and other health
metrics to provide early warnings of medical conditions (Chen et al., 2017).
2.3.2 Related Work
Studies Utilizing Supervised Learning
Research by Sultana and Chilamkurti (2016) applied random forests to detect anomalies in
authentication logs, achieving high accuracy in identifying suspicious activities. Similarly,
Eberle and Holder (2007) used support vector machines to classify login attempts, demonstrating
the effectiveness of supervised learning in anomaly detection.
Studies Utilizing Unsupervised Learning
Anomalies in web application logs were detected using K-means clustering in a study by Ahmed
et al. (2016), which successfully identified outliers representing suspicious activities. Another
study by Chandola et al. (2009) utilized DBSCAN to cluster authentication activities,
highlighting the potential of unsupervised learning in detecting unknown threats.
Studies Utilizing Reinforcement Learning
Chen et al. (2020) explored reinforcement learning for cyber security applications, developing
models that adapt to new attack patterns over time. Their study demonstrated the potential of
reinforcement learning to enhance the robustness of anomaly detection systems.
Related Work Using Machine Learning Models for Detecting Suspicious Activities
The use of machine learning models to detect suspicious activities in authentication logs has
been an area of significant research. Various studies have demonstrated the effectiveness of these
models in enhancing security measures by identifying and mitigating unauthorized access
attempts. Here are some notable peer-reviewed studies:
Anomaly Detection in Log Data Using Machine Learning Techniques
Du and Li (2016) explored the use of machine learning techniques for anomaly detection in log
data, focusing on system logs and authentication records. The study implemented several
machine learning algorithms, including Support Vector Machines (SVM), Random Forests, and
Neural Networks, to detect anomalies that could indicate potential security breaches. The results
showed that Random Forests and Neural Networks performed particularly well, achieving high
accuracy and low false-positive rates. This research underlines the applicability of machine
learning models in identifying suspicious activities in various types of log data, including
authentication logs. The study demonstrated that machine learning models could effectively
detect anomalies in log data, providing a robust solution for monitoring and enhancing system
security. The implementation of Random Forests and Neural Networks highlighted the potential
for these models to process large datasets and identify patterns indicative of suspicious activities.
Detecting Malicious Login Attempts Using Machine Learning
Pan et al. (2018) investigated the use of machine learning models to detect malicious login
attempts in authentication systems. The study employed a combination of supervised learning
techniques, such as Logistic Regression and Gradient Boosting Machines (GBM), to analyze
login patterns and identify anomalies. The models were trained on historical login data, including
features like login frequency, geographic location, and time of login. The Gradient Boosting
Machine (GBM) algorithm showed superior performance, with high precision and recall rates in
detecting suspicious login attempts. The research provided evidence that machine learning
models could enhance the detection of malicious login attempts by analyzing login patterns and
identifying deviations from typical behavior. The success of the GBM algorithm in this context
demonstrated its effectiveness in processing complex datasets and providing accurate predictions.
A Study on Anomaly Detection in Authentication Systems Using Machine Learning
Ahmed, Mahmood, and Hu (2016) conducted a comprehensive survey of network anomaly

detection techniques, including those applied to authentication systems. The study reviewed
various machine learning approaches, such as K-Means Clustering, Principal Component
Analysis (PCA), and Isolation Forests, for their effectiveness in identifying anomalies in network
and authentication data. The survey highlighted several case studies where these techniques were
successfully implemented to detect unauthorized access and other suspicious activities. The
survey provided a broad overview of machine learning techniques applicable to anomaly
detection in authentication systems. By reviewing multiple case studies and algorithms, the study
underscored the versatility and effectiveness of machine learning models in enhancing security
measures across different environments.
Enhancing Intrusion Detection Systems Using Machine Learning Techniques
Buczak and Guven (2016) surveyed various data mining and machine learning methods used for
intrusion detection in cyber security. The study covered a range of techniques, including
Decision Trees, Random Forests, and Neural Networks, applied to detect intrusions and
suspicious activities in network and authentication logs. The authors discussed the strengths and
weaknesses of each method, with Random Forests and Neural Networks showing high efficacy
in detecting complex and subtle anomalies. The survey highlighted the importance of machine
learning models in modern intrusion detection systems. It provided valuable insights into the
capabilities of different algorithms, emphasizing the potential for machine learning to
significantly improve the accuracy and reliability of detecting suspicious activities in various
security contexts.
Machine Learning for User Authentication Based on Behavioral Biometrics
Eberz et al. (2017) evaluated the use of behavioral biometrics for continuous user authentication.
The study applied machine learning algorithms to behavioral data, such as keystroke dynamics
and mouse movements, to continuously authenticate users. The researchers compared the
performance of various models, including Decision Trees and Support Vector Machines (SVM),
finding that these models could accurately differentiate between legitimate users and intruders.
This study highlighted the potential of machine learning models to enhance user authentication
through continuous monitoring of behavioral biometrics. The findings demonstrated that
behavioral data could provide a reliable basis for detecting suspicious activities and improving
overall security.
Detecting Anomalous Login Activities Using Deep Learning
Le, Hoang, and Luu (2019) explored the application of deep learning techniques to detect
anomalous login activities. The study developed a deep learning model using Long Short-Term
Memory (LSTM) networks to analyze sequences of login events and identify anomalies. The
model was trained on a large dataset of login records and demonstrated high accuracy in
detecting suspicious login attempts.
The research showcased the effectiveness of deep learning models, specifically LSTM networks,
in processing sequential data and identifying anomalies in login activities. The study's success
underscored the potential for deep learning techniques to enhance security measures in
authentication systems.
Identifying Unauthorized Access Using Machine Learning
Yavanoglu and Aydos (2017) reviewed various cyber security datasets used for training machine
learning algorithms to detect unauthorized access. The study highlighted the importance of
diverse and representative datasets in developing effective machine learning models. The
researchers discussed several public datasets and their applicability to different types of cyber
security problems, including authentication systems. The review provided valuable insights into
the availability and characteristics of cyber security datasets, emphasizing their role in training
robust machine learning models. The discussion of dataset selection and preparation underscored
the importance of data quality in developing effective security solutions.
Machine Learning Approaches to Fraud Detection in Online Systems
Ngai et al. (2011) reviewed the application of data mining and machine learning techniques in
financial fraud detection. The study categorized various machine learning approaches, including
Neural Networks, Decision Trees, and Support Vector Machines (SVM), used to detect
fraudulent activities in online systems. The authors discussed the strengths and limitations of
each approach and provided examples of successful implementations. The review highlighted the
versatility of machine learning techniques in detecting fraudulent activities across different
domains, including financial systems. The discussion of various algorithms and their applications
provided a comprehensive overview of the state-of-the-art in fraud detection.
Improving Security in Authentication Systems with Machine Learning
Zhang and Liu (2020) examined the use of machine learning approaches to improve security in
authentication systems. The study implemented and compared several algorithms, including
Logistic Regression, Random Forests, and Neural Networks, to detect suspicious login attempts
and enhance authentication processes. The results indicated that machine learning models could
significantly improve the accuracy and reliability of authentication systems. The research
demonstrated that machine learning models could be effectively integrated into authentication
systems to detect and prevent unauthorized access. The comparative analysis of different
algorithms provided insights into the most effective approaches for enhancing security.
Machine Learning for Real-Time Detection of Suspicious Login Activities
Kim and Kim (2017) explored machine learning approaches for real-time anomaly detection in
streaming data, focusing on authentication logs. The study developed a real-time detection
system using machine learning models, such as Online SVM and Adaptive Random Forests, to
continuously monitor login activities and identify anomalies. The system demonstrated high
performance in detecting suspicious login attempts with minimal latency. The study highlighted
the importance of real-time detection in enhancing security measures for authentication systems.
The implementation of Online SVM and Adaptive Random Forests demonstrated the feasibility
of continuous monitoring and rapid response to potential security threats.
2.4 Chapter Summary
This chapter reviewed the key concepts, definitions, and machine learning algorithms relevant to
detecting suspicious authentication activities in web application logs. It highlighted the
importance of machine learning in cyber security and discussed the theoretical frameworks and
practical applications of these algorithms. The literature review provided insights into related
work and examples of similar applications, setting the stage for the subsequent chapters that will
detail the methodology, results, and conclusions of the study.
References for Chapter Two
Alpaydin, E. (2020). Introduction to Machine Learning. MIT Press.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Buczak, A. L., & Guven, E. (2016). A survey of data mining and machine learning methods for
cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(2), 1153-1176.
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM
Computing Surveys (CSUR), 41(3), 1-58.
Chen, J., Li, K., Li, K., & Xie, Y. (2017). A review of anomaly detection methods in networks.
IEEE Access, 5, 1397-1410.
Chen, T., Xu, H., & He, Y. (2020). Reinforcement learning for cyber-physical systems security.
Journal of cyber security, 6(1), tyaa011.
Cook, D. J., Feuz, K. D., & Krishnan, N. C. (2019). Transfer learning for activity recognition: A
survey. Knowledge and Information Systems, 36(3), 537-556.
Eberle, W., & Holder, L. (2007). Discovering anomalies in data through multigraph analysis.
Journal of Information and Data Management, 1(1), 30-54.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Goel, S., & Sharma, R. (2017). Analyzing user activity logs for monitoring malicious behavior.
Journal of Information Security and Applications, 35, 12-22.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
Kumar, A., & Raj, P. (2018). Predictive Analytics in cyber security: Machine Learning and Data
Mining Approaches. Springer.
Liu, L., Zhang, D., & Li, Y. (2018). Detection and defense of web application vulnerabilities
using machine learning. Security and Communication Networks, 2018.
Mukherjee, B., Heberlein, L. T., & Levitt, K. N. (1994). Network intrusion detection. IEEE
Network, 8(3), 26-41.
Ngai, E. W. T., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The application of data
mining techniques in financial fraud detection: A classification framework and an academic
review of literature. Decision Support Systems, 50(3), 559-569.
Sultana, S., & Chilamkurti, N. (2016). Survey on machine learning techniques for network
anomaly detection. Journal of Network and Computer Applications, 60, 19-31.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
Wang, W., Lu, Z., Qin, L., & Wang, J. (2019). A survey on the security of blockchain systems.
Future Generation Computer Systems, 105, 287-302.
Xia, Y., Wang, X., & Zhang
Du, M., & Li, F. (2016). Anomaly detection in log data using machine learning techniques.
Proceedings of the 15th IEEE International Conference on Trust, Security and Privacy in
Computing and Communications (TrustCom).
Pan, S., Li, Y., Sun, W., & Wei, X. (2018). Detecting malicious login attempts using machine
learning. IEEE Access, 6, 42292-42301.
Eberz, S., Rasmussen, K. B., Lenders, V., & Martinovic, I. (2017). Evaluating behavioral
biometrics for continuous authentication: Challenges and metrics. Proceedings of the 2017 ACM
on Asia Conference on Computer and Communications Security (ASIACCS).
Le, T., Hoang, D., & Luu, C. (2019). Detecting anomalous login activities using deep learning.
IEEE Transactions on Information Forensics and Security, 14(6), 1454-1463.
Yavanoglu, U., & Aydos, M. (2017). A review on cyber security datasets for machine learning
algorithms. Proceedings of the 2017 IEEE 21st International Conference on Computer
Supported Cooperative Work in Design (CSCWD).
Ngai, E. W. T., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The application of data
mining techniques in financial fraud detection: A classification framework and an academic
review of literature. Decision Support Systems, 50(3), 559-569.
Zhang, Y., & Liu, X. (2020). Machine learning approaches to improve security in authentication
systems. IEEE Access, 8, 20103-20113.
Kim, J., & Kim, H. (2017). Machine learning approaches to real-time anomaly detection for
streaming data. IEEE Transactions on Cybernetics, 47(3), 846-858.
Axelsson, S. (2000). The base-rate fallacy and its implications for the difficulty of intrusion
detection. Proceedings of the 6th ACM Conference on Computer and Communications Security,
1-7.
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM
Computing Surveys (CSUR), 41(3), 1-58.
Greenberg, A. (2018). The untold story of NotPetya, the most devastating cyberattack in history.
Wired.
Kim, J., Kim, J., Cho, S., & Kim, J. H. (2017). Zero-day malware detection using transferred
generative adversarial networks based on deep autoencoders. Information Sciences, 433-434,
281-304.
O'Reilly, T. (2005). What Is Web 2.0. O'Reilly Media.
Spafford, E. H. (1989). The internet worm program: An analysis. ACM SIGCOMM Computer
Communication Review, 19(1), 17-57.
Sommer, R., & Paxson, V. (2010). Outside the closed world: On using machine learning for
network intrusion detection. IEEE Symposium on Security and Privacy, 305-316.
Sullivan, B. (2007). SQL Injection Attack Ends in Bank Heist. MSNBC.
W3C. (2014). HTML5. W3C Recommendation.

CHAPTER THREE
SYSTEM ANALYSIS AND DESIGN
3.1 Introduction
This chapter provides a detailed analysis of both the existing and proposed systems. It covers the
methodologies used, system analysis, requirement gathering, and system design. The chapter
aims to highlight the advantages and disadvantages of the current system and propose a more
efficient system using machine learning algorithms to detect suspicious authentication activities.
3.2 Methodology
Research system is the particular methodology or procedures used to distinguish, select, process,
and examine data about a topic. In a research paper, this section permits the reader to
fundamentally assess a review's general legitimacy and dependability. The followings techniques
were considered for this review
3.2.1 Rapid Application Development Methodology

Rapid application development (RAD), likewise called Rapid-application building (RAB), is
both an overall term, used to allude to versatile programming improvement draws near, as well
as the name for James Martin's way to deal with quick turn of events. For a rule, RAD ways to
deal with programming advancement set less accentuation on arranging and more accentuation
on a versatile interaction. Models are frequently utilized notwithstanding or once in a while even
instead of plan determinations.

RAD is particularly appropriate for (albeit not restricted to) creating programming that is driven
by UI prerequisites. Graphical UI manufacturers are frequently called rapid application
improvement devices.
3.2.2. Agile Development Methodology

Agile programming development is a way to deal with programming advancement under which
necessities and arrangements advance through the cooperative exertion of self-coordinating and
cross-practical groups and their customer(s) end user(s). It advocates versatile preparation,
transformative turn of events, early conveyance, and constant improvement, and it energizes fast
and adaptable reaction to change.
There is episodic proof that taking on lithe practices and values works on the deftness of
programming experts, groups and associations; notwithstanding, exact investigations have
tracked down no proof.
3.2.3. Waterfall Methodology

The Waterfall model is a breakdown of venture exercises into straight consecutive stages, where
each stage relies upon the expectations of the past one and relates to a specialization of errands.
The methodology is commonplace for specific areas of designing plan. In programming
development, it will in general be among the less iterative and adaptable methodologies, as
progress streams in to a great extent one course ("downwards" like a cascade) through the
periods of origination, inception, examination, plan, development, testing, sending and support.
The waterfall development model started in the assembling and development ventures; where the
exceptionally organized actual conditions implied that plan changes turned out to be restrictively
costly a whole lot earlier in the advancement interaction. At the point when initially took on for
programming advancement, there were no perceived options for information based innovative
work.
3.2.4. Adopted Methodology

In other to achieve the aim and objective of this study, Agile development methodology was
adopted as it is well suited for developing software that encourages rapid and flexible response to
change and is in correlation with adaptive planning
Fig 3.1: Agile Methodology (DevTeam.Space)
3.3 System Analysis
3.3.1 Analysis of the Existing System
To analyze the existing system for detecting suspicious authentication activities in web
application logs, we will consider a typical system used in many organizations: a rule-based
intrusion detection system (IDS). This system primarily relies on predefined rules to flag
potential security threats based on specific patterns found in authentication logs.
Overview of the Existing System
The rule-based IDS in use is designed to monitor and analyze authentication logs to detect
suspicious activities such as failed login attempts, unusual login times, and access from
unfamiliar IP addresses. The system consists of the following key components:
1. Log Collection: Authentication logs are collected from various sources, including web
servers, application servers, and authentication servers.
2. Log Parsing: The collected logs are parsed to extract relevant information such as user
IDs, timestamps, IP addresses, and login status.
3. Rule Engine: The core component of the system where predefined rules are applied to
the parsed log data to identify suspicious activities.
4. Alert Generation: When a rule is triggered, an alert is generated and sent to the security
team for further investigation.
5. Manual Review: Security analysts manually review the alerts to determine the validity of
the detected threats and take appropriate actions.
Diagram 2: Existing System Architecture
Processes Involved in the Existing System
The existing rule-based IDS follows these steps to detect suspicious authentication activities:
1. Log Collection
Authentication logs are continuously collected from different sources and stored in a
central log repository.
Example: Web server logs, which include user login attempts, timestamps, and IP
addresses, are aggregated for analysis.
2. Log Parsing
The collected logs are parsed to extract meaningful information.
Example: Parsing logs to identify user IDs, timestamps of login attempts, and login
statuses (successful or failed).
3. Rule Application
Predefined rules are applied to the parsed log data to detect anomalies.
Example: A rule might be set to flag more than five failed login attempts within a
five-minute window as suspicious.
4. Alert Generation
If a rule is triggered, an alert is generated.
Example: If the number of failed login attempts exceeds the threshold, an alert is sent
to the security team.
5. Manual Review
Security analysts review the generated alerts to confirm the presence of suspicious
activities.
Example: Analysts check the context of the alerts, such as whether the IP address is
known to be problematic or if the login attempts are from a legitimate user
experiencing issues.
GOMBOL Corporation uses a rule-based IDS to monitor authentication activities across its web
applications. The system employs several predefined rules to detect anomalies, including:
1. Multiple Failed Login Attempts: Flags more than five failed login attempts within a
five-minute window.
2. Unusual Login Times: Flags logins during unusual hours (e.g., late at night) for users
who typically log in during business hours.
3. Geographical Anomalies: Flags logins from IP addresses located in regions where the
user has never logged in from before.
Despite its simplicity, GOMBOL Corporation has encountered several challenges

with the existing system:
 High False Positives: The system frequently generates alerts for legitimate activities,
such as users forgetting their passwords or traveling to different regions.
 Inability to Detect New Threats: The system struggles to detect sophisticated attacks
that do not fit the predefined rules.
 Resource Intensive: Security analysts spend a significant amount of time reviewing and
validating alerts, leading to fatigue and potential oversight of genuine threats.
GOMBOL Corporation recognizes the need to enhance its detection capabilities and
reduce the burden on its security team. The limitations of the existing rule-based IDS
highlight the necessity for a more advanced system that can adapt to evolving threats
and minimize false positives.
By analyzing the shortcomings of the existing system, we can identify the key areas
for improvement and design a more effective solution using machine learning
algorithms. This will be detailed in the subsequent sections, where the proposed
system is discussed.
3.3.2 Advantages and Disadvantages of the Existing System
Advantages:
1. Simplicity: The rule-based approach is straightforward and easy to implement.
2. Immediate Alerts: Provides real-time alerts for known suspicious patterns.
3. Low Cost: Minimal resources are required to set up and maintain the system.
Disadvantages:
1. High False Positive Rate: The system often generates a large number of false positives
due to the rigid nature of rules.
2. Inflexibility: It is challenging to adapt the system to new types of threats that do not
match predefined rules.
3. Manual Effort: Requires significant manual effort to review and validate alerts.
4. Scalability Issues: As the number of logs and rules increases, the system's performance
can degrade.
3.3.3 Analysis of the Proposed System
In this section, we analyze the proposed system designed to enhance the detection of
suspicious authentication activities in web application logs using machine learning
algorithms. The proposed system leverages supervised machine learning techniques to
improve the accuracy and efficiency of anomaly detection. We will specifically focus on
two supervised ML techniques: Support Vector Machine (SVM) and Maximum Entropy
(MaxEnt).
Overview of the Proposed System
The proposed system architecture consists of several key components:
1. Data Collection: Logs are collected from web servers, application servers, and other
relevant sources.
2. Data Preprocessing: Collected logs are cleaned and preprocessed to extract relevant
features.
3. Feature Engineering: Important features are selected and engineered to enhance the
model's performance.
4. Machine Learning Engine: SVM and MaxEnt models are trained and applied to detect
suspicious activities.
5. Alert Generation: Alerts are generated based on the predictions of the ML models.
6. User Interface: A dashboard for security analysts to review alerts and system
performance.
User input Website dataset Feature

Selection
Scrape feature data from Prediction Website

user input model dataset
selected
(ensemble
Result
Figure 4: Proposed System Architeture

Supervised Machine Learning Techniques
Figure 5: Supervised Machine Learning Model
Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised machine learning algorithm widely used
for classification tasks, including malicious web page detection. SVM works by finding
the hyperplane that best separates the data points of different classes. The main
characteristics of SVM include:
 Kernel Trick: SVM can use different kernel functions (linear, polynomial, radial basis
function) to transform the input data into a higher-dimensional space where it is easier to
separate the classes.
 Margin Maximization: SVM aims to maximize the margin between the separating
hyperplane and the nearest data points from each class, known as support vectors.
 Regularization: SVM includes a regularization parameter (C) to control the trade-off

between achieving a low error on the training data and minimizing the model's
complexity.
In the context of detecting suspicious authentication activities, SVM can be used to

classify login attempts as either normal or suspicious based on features extracted from
authentication logs.
Maximum Entropy (MaxEnt)
Maximum Entropy (MaxEnt) is another supervised machine learning technique, often

used in natural language processing and document classification. MaxEnt, also known as
logistic regression, is based on the principle of maximum entropy, which aims to find the
probability distribution that best represents the current state of knowledge while making
the fewest assumptions.
 Probability Estimation: MaxEnt models estimate the probability of each class given the
input features, making it suitable for classification tasks.
 Feature Weights: The model assigns weights to each feature, indicating its importance
in predicting the class label.
 Flexibility: MaxEnt can handle a variety of feature types, including binary, categorical,
and continuous features.
Although MaxEnt has not been widely used for malicious web page detection, it has
shown promising results in related areas such as document and web page classification.
In our proposed system, MaxEnt can be employed to classify login attempts, leveraging
its ability to handle different types of features effectively.
Implementation of the Proposed System
1. Data Collection and Preprocessing
o Logs are collected from various sources and stored in a centralized repository.
o Preprocessing steps include data cleaning, normalization, and feature extraction.
2. Feature Engineering
o Relevant features are engineered from the authentication logs, such as login time,
IP address, user agent, and number of failed login attempts.
o Feature selection techniques are applied to identify the most important features
for the classification task.
3. Training the ML Models
o The dataset is split into training and test sets.
o SVM and MaxEnt models are trained using the training data, with hyperparameter
tuning performed to optimize model performance.
o Cross-validation is used to evaluate the models and prevent overfitting.

4. Anomaly Detection and Alert Generation
o The trained models are deployed to classify new login attempts in real-time.
o Anomalous activities are flagged, and alerts are generated for further investigation
by security analysts.
5. User Interface
o A user-friendly dashboard is developed to display alerts and allow analysts to

review and manage suspicious activities.
o The dashboard includes features such as alert filtering, detailed log views, and
trend analysis.
Diagram 4: Proposed System Architecture
Benefits of the Proposed System
The proposed system offers several advantages over the existing rule-based intrusion
detection system:
 Improved Accuracy: Machine learning models can capture complex patterns and
relationships in the data, leading to more accurate detection of suspicious activities.
 Reduced False Positives: By learning from historical data, the models can better
distinguish between legitimate and malicious activities, reducing the number of false
alerts.
 Scalability: The system can handle large volumes of data and adapt to new types of
threats as they emerge.
 Automated Detection: The system reduces the need for manual monitoring, allowing
security analysts to focus on investigating genuine threats.
3.3.4 Requirement Gathering
Requirements for the proposed system were gathered through stakeholder interviews, surveys,
and document analysis. Both functional and non-functional requirements were identified.
3.4 System Models

In this section, we provide an overview of the supervised machine learning techniques used in
this study. The first technique is Support Vector Machine (SVM), which has been widely used in
the literature for malicious web page detection. The second technique is Maximum Entropy
(MaxEnt), which has not been used for malicious web page detection before but has shown very
good results for document and web page classification. The third technique is Extreme Learning
Machine (ELM), known for its high learning speed but not previously applied to web page
classification. SVM has been reported as one of the best binary classification methods, producing
superior results compared to other models such as Logistic Regression (LR), Bayes Network,
Neural Network, Naive Bayes, K-Nearest Neighbors, K-Means, and Affinity Propagation. We
intend to experiment with MaxEnt and ELM due to their efficiencies and similar uses in previous
studies.
Support Vector Machines (SVM)
Support Vector Machine (SVM) is one of the most widely used data classification techniques for
binary classification of high-dimensional data. Introduced by Boser et al. and later refined by
Cortes and Vapnik, SVM aims to find the optimal margin between training patterns and the
decision boundary on separable data. The main target of the SVM model is to determine an
optimal hyperplane that separates examples of different classes for given training data points.
The decision hyperplane is constructed by maximizing the distance of the hyperplane from the
nearest examples of different classes, known as support vectors.
The SVM model can be formulated as follows:
minimize12wTw+C∑i=1lξi\text{minimize} \quad \frac{1}{2} w^T w + C \sum_{i=1}^{l}

\xi_iminimize21wTw+C∑i=1lξi
Subject to: yi(wTϕ(xi)+b)≥1−ξiy_i (w^T \phi(x_i) + b) \geq 1 - \xi_iyi(wTϕ(xi)+b)≥1−ξi

ξi≥0\xi_i \geq 0ξi≥0
where (xi,yi)(x_i, y_i)(xi,yi) are the training samples, xi∈Rnx_i \in \mathbb{R}^nxi∈Rn and
yi∈{−1,1}y_i \in \{-1, 1\}yi∈{−1,1}, ξi\xi_iξi are slack variables, CCC is the penalty parameter,
and ϕ(xi)\phi(x_i)ϕ(xi) is a kernel function.
Kernel Functions: SVM can utilize various kernel functions to handle non-separable patterns by
mapping the input data into a higher-dimensional space. Commonly used kernels include Linear
and Radial Basis Function (RBF):
 Linear Kernel: K(xi,xj)=xiTxjK(x_i, x_j) = x_i^T x_jK(xi,xj)=xiTxj
 Radial Basis Function (RBF) Kernel: K(xi,xj)=exp (−γ∥xi−xj∥2),γ>0K(x_i, x_j) =

\exp(-\gamma \| x_i - x_j \|^2), \quad \gamma > 0K(xi,xj)=exp(−γ∥xi−xj∥2),γ>0
Maximum Entropy (MaxEnt)
Maximum Entropy (MaxEnt), also known as logistic regression, is a statistical classification

modeling technique introduced by Berger et al. MaxEnt models the conditional distribution of
classes given the input features and is widely used for text categorization and document
classification.
Figure 2: Maximum Entropy Model
The probabilistic distribution in a MaxEnt model is given by:
p(c∣d)=1Z(d)exp (∑i=1nαifi(d,c))p(c|d) = \frac{1}{Z(d)} \exp \left( \sum_{i=1}^{n} \alpha_i

f_i(d, c) \right)p(c∣d)=Z(d)1exp(∑i=1nαifi(d,c))
where Z(d)Z(d)Z(d) is the partition function ensuring normalization:

Z(d)=∑cexp (∑i=1nαifi(d,c))Z(d) = \sum_{c} \exp \left( \sum_{i=1}^{n} \alpha_i f_i(d, c)
\right)Z(d)=∑cexp(∑i=1nαifi(d,c))
In these equations, ccc represents the class type, ddd represents the document, αi\alpha_iαi are
the feature weights, and fi(d,c)f_i(d, c)fi(d,c) indicates the impact of feature iii on class ccc.
Various estimation algorithms can be used for learning these weights, such as Limited-Memory
Variable Metric (L-BFGS), Orthant Wise Limited-memory Quasi Newton (OWLQN), or
Stochastic Gradient Descent (SGD).
Extreme Learning Machine (ELM)
Extreme Learning Machine (ELM) is a learning algorithm for single-hidden layer feed-forward
neural networks (SLFN), designed to address the slowness of gradient-based training algorithms.
ELM selects input weights randomly and analytically determines the output weights to achieve
high generalization performance with extremely fast learning speed.
The output of an SLFN with LLL hidden nodes can be represented as:
fL(x)=∑i=1LβiG(ai,bi,x),x∈Rn,ai∈Rnf_L(x) = \sum_{i=1}^{L} \beta_i G(a_i, b_i, x), \quad x

\in \mathbb{R}^n, \quad a_i \in \mathbb{R}^nfL(x)=∑i=1LβiG(ai,bi,x),x∈Rn,ai∈Rn
where aia_iai and bib_ibi are the learning parameters of the hidden nodes, βi\beta_iβi is the
connection weight between the iii-th hidden node and the output node, and G(ai,bi,x)G(a_i, b_i,
x)G(ai,bi,x) is the output of the hidden node with the input xxx. The additive hidden node based
on the activation function g(x)g(x)g(x) is:
G(ai,bi,x)=g(ai⋅x+bi),bi∈RG(a_i, b_i, x) = g(a_i \cdot x + b_i), \quad b_i \in \mathbb{R}G(ai,bi

,x)=g(ai⋅x+bi),bi∈R
For training samples {(xi,ti)}i=1N⊂Rn×Rm\{ (x_i, t_i) \}_{i=1}^{N} \subset \mathbb{R}^n

\times \mathbb{R}^m{(xi,ti)}i=1N⊂Rn×Rm, the output of the network matches the targets:
∑i=1Lβig(ai⋅xj+bi)=tj,j=1,…,N\sum_{i=1}^{L} \beta_i g(a_i \cdot x_j + b_i) = t_j, \quad j = 1,

\ldots, N∑i=1Lβig(ai⋅xj+bi)=tj,j=1,…,N
This can be written in matrix form as:
Hβ=TH \beta = THβ=T
where HHH is the hidden layer output matrix, β\betaβ is the output weight vector, and TTT is the
target matrix.
By employing these three machine learning models—SVM, MaxEnt, and ELM—our study aims
to leverage their unique strengths to improve the detection of suspicious authentication activities
in web application logs.
3.4.2 Data Models
Figure 3 Machine Learning Model

3..5 System Flowchart
Start
Input
Preparing the Data
Bisecting K-Means Training the Machine

algorithm
Learning Model
Data Cleaning
Regression
Reviewing the
Machine Reprocess
Learning
Model
Visualization
Stop
3.5.1 User Interface Design
Login Page
Username
Password
Submit
Register
Name
Email
Address
Phone
Submit
3.5.3 Database Design

Nathan

Uploaded by

Nathan

Uploaded by

Chapter One

Figure 1: Flow of Phishing process (Yung-Tsung Hou, et al., 2010)

1.2 Background of the Study

1.3 Research Problem

1.4 Aim and Objectives of the Study

II. To develop a framework for implementing machine learning algorithms to detect

III. To evaluate the performance of different algorithms using real-world datasets.

V. To propose recommendations for improving the accuracy and efficiency of machine

1.5 Significance of the Study

1.6 Scope of the Study

 Framework Development: Developing a framework for implementing machine learning

 Performance Evaluation: Evaluating the performance of different algorithms using real-

 Recommendations: Proposing recommendations for improving the effectiveness of

1.7 Limitations to the Study

1.8 Organization of the Study

1.9 Chapter Summary

References for Chapter One

DBIR. (2020). Verizon Data Breach Investigations Report.

Expert Systems with Applications, 37(1):55–60, 2010.

History and Evolution of Web Applications

Machine Learning in cyber security

2.2 Overview of Related Concepts

Supervised vs. Unsupervised Learning In machine learning, supervised learning involves

2.2.1 Definition of Related Terms

Anomaly Detection: Anomaly detection is the identification of rare items, events, or

Authentication Activities: Authentication activities refer to processes through which a system

2.2.2 The Machine Learning Model/Algorithm

Machine Learning in cyber security

Machine Learning Algorithms

Supervised Learning Algorithms

Unsupervised Learning Algorithms

1. Supervised Learning Algorithms:

Strengths: Easy to interpret, handles both numerical and categorical data.

Weaknesses: Prone to over fitting.

Strengths: Reduces over fitting, handles large datasets.

Weaknesses: Complex and less interpretable than single decision trees.

Weaknesses: Requires proper tuning of parameters and feature scaling.

Strengths: Capable of modeling complex relationships.

Weaknesses: Requires large amounts of data and computational resources.

2. Unsupervised Learning Algorithms:

Strengths: Simple and fast.

Weaknesses: Requires specifying the number of clusters, sensitive to initial

Hierarchical Clustering: Builds a tree of clusters by either merging or splitting them.

Strengths: Does not require specifying the number of clusters in advance.

Weaknesses: Computationally intensive for large datasets.

Strengths: Useful for anomaly detection by identifying deviations from normal

Weaknesses: Requires careful tuning and large amounts of data.

Isolation Forest: Focuses on isolating anomalies instead of profiling normal data.

Strengths: Effective for high-dimensional data.

Weaknesses: Requires appropriate setting of contamination parameter.

4. Deep Learning Algorithms:

The model/algorithm approach involves selecting appropriate machine learning algorithms to

Detection Framework: The framework approach encompasses a structured process for

Removing duplicate entries

Handling missing values

Frequency of login attempts

Geographical location of login attempts

Time of day of login attempts

Splitting the data into training and testing sets

Training multiple models to compare performance

Tuning hyperparameters to optimize model performance

Area under the ROC curve (AUC-ROC)

6. Anomaly Detection: Deploying the trained models to identify suspicious activities in

Integrating the models with the web application’s logging system

Setting thresholds for anomaly scores to flag suspicious activities

7. Post-Detection Analysis: Investigating detected anomalies to confirm their legitimacy

The system being developed is a machine learning-based detection framework designed to

Importance of the System

The machine learning system offers several key benefits:

E-commerce Platforms: To protect customer data and prevent unauthorized access to

Financial Services: To secure online banking systems and prevent fraud.

Healthcare Systems: To safeguard sensitive patient information and comply with

Government Services: To secure access to citizen services and sensitive information.

How the System Works

2.2.4 Application of Machine Learning to the Case Study

Relevance of Machine Learning