Nathan
Nathan
1.1 Introduction
In the digital era, the security of web applications is of very importance. As organizations
increasingly depend on web-based systems for their operations, the need to protect these systems
from unauthorized access and cyber threats has become critical. Authentication activities, which
involve verifying the identity of users accessing a system, are a primary target for malicious
actors seeking to breach security protocols. Detecting suspicious authentication activities within
web application logs is essential to prevent unauthorized access, data breaches, and other
security incidents. Machine learning algorithms offer a promising approach to identifying these
suspicious activities by analyzing patterns and anomalies in authentication logs. This study
focuses on the application of machine learning algorithms to detect suspicious authentication
activities in web application logs, using a case study approach.
Supervised learning algorithms, such as decision trees, random forests, and support vector
machines (SVM), are commonly used for anomaly detection in web applications. These
algorithms require labeled datasets to train models that can classify activities as normal or
suspicious. For instance, SVM has been successfully applied to detect malicious login attempts
by analyzing features such as login frequency, IP address, and time of access (Xia et al., 2019).
Unsupervised learning algorithms, such as clustering and auto encoders, do not require labeled
data and are particularly useful for detecting new or unknown threats. Clustering algorithms like
K-means and DBSCAN can group similar authentication activities and identify outliers that may
represent suspicious actions (Ahmed et al., 2016).
Reinforcement learning, although less common in this domain, offers a dynamic approach to
anomaly detection by continuously learning from interactions with the environment. This
approach has shown promise in adapting to new attack patterns and improving detection
accuracy over time (Chen et al., 2020).
Detecting suspicious authentication activities involves several challenges. Firstly, the volume
and variety of data generated by web applications require efficient and scalable algorithms.
Secondly, the dynamic nature of cyber threats necessitates adaptive models that can learn from
new data and evolve over time. Thirdly, ensuring the accuracy of detection without generating
false positives is crucial to avoid unnecessary alerts and maintain user trust. Recent
advancements in machine learning, including deep learning techniques such as recurrent neural
networks (RNN) and convolutional neural networks (CNN), have further enhanced the ability to
detect complex patterns in authentication logs. These models can capture temporal dependencies
and intricate relationships within the data, improving the detection of sophisticated attack
strategies (Zhang et al., 2019).
I. To review and compare various machine learning algorithms used for anomaly detection
in web application logs.
IV. To identify the challenges and limitations associated with implementing machine
learning-based detection systems.
Challenges and Limitations: Identifying the challenges and limitations associated with
implementing these algorithms.
The study is limited to web application logs and does not cover other types of logs or
authentication activities in non-web contexts.
Chapter One: General Introduction - Provides an overview of the study, including the
background, research problem, objectives, significance, scope, limitations, and organization of
the study.
Chapter Two: Literature Review - Reviews existing literature on machine learning algorithms
for anomaly detection, related concepts, and the application of these algorithms in detecting
suspicious authentication activities.
Chapter Three: Methodology - Describes the research design, data collection methods, and
analysis techniques used in the study.
Chapter Four: Results and Discussion - Presents the findings of the study, including the
performance evaluation of different algorithms and a discussion of the results.
Chapter Five: Conclusion and Recommendations - Summarizes the key findings, discusses their
implications, and provides recommendations for future research and practice.
Ahmed, M., Mahmood, A. N., & Hu, J. (2016). A survey of network anomaly detection
techniques. Journal of Network and Computer Applications, 60, 19-31.
Chen, T., Xu, H., & He, Y. (2020). Reinforcement learning for cyber-physical systems security.
Journal of cyber security, 6(1), tyaa011.
Kumar, A., & Raj, P. (2018). Predictive Analytics in cyber security: Machine Learning and Data
Mining Approaches. Springer.
Xia, Y., Wang, X., & Zhang, Y. (2019). Anomaly detection in login behaviors based on SVM.
Journal of Intelligent & Fuzzy Systems, 36(3), 2399-2406.
Zhang, Y., Xiang, Y., Wang, G., & Tang, X. (2019). Deep learning for anomaly detection:
Opportunities and challenges. Big Data Mining and Analytics, 2(4), 316-329.
Yung-Tsung Hou, Yimeng Chang, Tsuhan Chen, Chi-Sung Laih, and Chia-Mei Chen. Malicious
web content detection by machine learning.
Web applications have significantly transformed the digital landscape since their inception.
Initially, web pages were static, offering limited interaction and functionality. The early 1990s
marked the advent of the World Wide Web, with the introduction of HTML and basic web
browsers like Mosaic. These static websites served primarily as online brochures, offering
information without much user interaction.
The late 1990s and early 2000s saw the rise of dynamic web applications, powered by
technologies such as JavaScript, PHP, and ASP.NET. This period witnessed the emergence of
platforms like Amazon and eBay, which allowed users to interact with the website, perform
searches, and make purchases in real-time (Zeldman, 2013). The concept of Web 2.0 introduced
richer user experiences and interactive content, facilitated by AJAX and other asynchronous
technologies, leading to the development of social media platforms, online banking, and
collaborative tools like Google Docs (O'Reilly, 2015).
The continuous evolution of web applications has been driven by advancements in web standards,
security protocols, and frameworks. Modern web applications leverage technologies like
HTML5, CSS3, and robust JavaScript frameworks such as React and Angular, enabling the
creation of highly responsive and user-friendly interfaces (W3C, 2014). Furthermore, the rise of
cloud computing and microservices architecture has enabled scalable and resilient web
applications that can handle vast amounts of data and traffic.
Cyber Attacks
As web applications have evolved, so too have the threats against them. Cyber attacks have
become increasingly sophisticated, targeting vulnerabilities in web applications to gain
unauthorized access, steal data, or disrupt services. The history of cyber attacks dates back to the
early days of computing, but the proliferation of the internet has amplified their scale and impact.
One of the earliest and most notable cyber attacks was the Morris Worm in 1988, which
exploited vulnerabilities in UNIX systems and caused significant disruptions (Spafford, 1989).
The 2000s saw a surge in phishing attacks, SQL injection, and cross-site scripting (XSS) attacks,
exploiting weaknesses in web application security to steal sensitive information and compromise
user accounts (Sullivan, 2017).
In recent years, cyber attacks have become more targeted and complex. Advanced Persistent
Threats (APTs) and ransomware attacks like WannaCry and NotPetya have demonstrated the
capability of cybercriminals to cause widespread damage and extort organizations for financial
gain (Greenberg, 2018). The increasing interconnectivity of devices through the Internet of
Things (IoT) has also expanded the attack surface, leading to new challenges in securing web
applications.
Machine learning (ML) has emerged as a powerful tool in the fight against cyber threats. By
analyzing vast amounts of data, ML algorithms can identify patterns and anomalies that may
indicate malicious activities. The application of ML in cyber security has evolved alongside
advancements in computational power and data availability.
Early approaches to using machine learning in cyber security involved rule-based systems and
signature-based detection methods. These systems relied on predefined patterns to identify
known threats, but they struggled with detecting new or evolving attacks (Axelsson, 2000). As
machine learning techniques advanced, anomaly detection models were developed, allowing for
the identification of unusual behavior that deviates from established norms (Chandola, Banerjee,
& Kumar, 2019).
Supervised learning algorithms, such as decision trees and support vector machines, have been
employed to classify benign and malicious activities based on labeled datasets. These models
have shown effectiveness in identifying known attack patterns but require substantial labeled
data for training (Buczak & Guven, 2016). Unsupervised learning methods, like clustering and
association rule mining, have been used to detect novel attacks by identifying outliers and
correlations in unlabeled data (Sommer & Paxson, 2010).
Recent advancements in deep learning have further enhanced the capabilities of machine
learning in cyber security. Deep neural networks and recurrent neural networks (RNNs) can
process complex data structures and temporal sequences, making them suitable for detecting
sophisticated attacks and predicting future threats (Kim et al., 2017). Additionally, the
integration of machine learning with real-time monitoring and response systems has enabled
proactive threat detection and mitigation.
Web Application Security Web application security involves protecting web applications from
various security threats and vulnerabilities. Common issues include SQL injection, cross-site
scripting (XSS), cross-site request forgery (CSRF), and broken authentication and session
management. Ensuring web application security requires implementing robust coding practices,
using security tools and frameworks, and regularly performing security testing and audits.
Intrusion Detection Systems (IDS) Intrusion Detection Systems are designed to monitor
network or system activities for malicious actions or policy violations. Traditional IDS can be
categorized into network-based (NIDS) and host-based (HIDS) systems. Modern IDS often
incorporate machine learning techniques to improve their detection capabilities. IDS can operate
in a signature-based mode, detecting known threats, or in an anomaly-based mode, identifying
unusual patterns that may indicate new threats.
Data Mining Data mining involves extracting useful information from large datasets. In the
context of cyber security, data mining techniques can analyze logs and network traffic to identify
patterns and correlations that may indicate security threats. Techniques such as clustering,
classification, and association rule learning are commonly used in data mining.
Anomaly Detection Anomaly detection is the process of identifying unusual patterns that do not
conform to expected behavior. In cyber security, this can involve detecting deviations from
normal user behavior or network traffic patterns that may indicate a potential security breach.
Anomaly detection is crucial for identifying zero-day attacks and other novel threats that may not
be recognized by signature-based detection methods.
Big Data Analytics Big data analytics refers to the process of examining large and varied
datasets to uncover hidden patterns, unknown correlations, and other useful information. In cyber
security, big data analytics can be used to analyze extensive logs and network traffic data,
enabling the detection of sophisticated threats that might be missed by traditional methods.
Cyber Threat Intelligence Cyber threat intelligence involves gathering, analyzing, and
disseminating information about potential or ongoing threats to an organization's cyber security.
This information can include indicators of compromise (IOCs), threat actor tactics, techniques,
and procedures (TTPs), and other relevant data that can help in anticipating and mitigating cyber
threats.
Behavioral Analysis Behavioral analysis in cyber security involves monitoring and analyzing
the behavior of users and systems to detect anomalies that may indicate malicious activities. This
can include tracking login patterns, usage habits, and other behavioral indicators. Machine
learning models can be trained to recognize normal behavior and flag deviations as potential
security threats.
Pattern Recognition Pattern recognition involves identifying and classifying patterns in data. In
cyber security, pattern recognition can be used to detect malicious activities by recognizing
patterns that are indicative of security threats. This can include identifying repeated sequences of
actions that match known attack signatures or discovering new patterns that may indicate
emerging threats.
Web Application Logs: Web application logs are records of events and transactions that occur
within a web application. These logs typically include information such as user login attempts, IP
addresses, time stamps, and actions performed by users
Supervised learning involves training a model on a labeled dataset, where the input data is paired
with the correct output. Algorithms such as decision trees, random forests, and support vector
machines (SVM) are commonly used for classification tasks in anomaly detection (Hastie et al.,
2009).
Unsupervised learning algorithms identify patterns in data without pre-existing labels. Clustering
algorithms like K-means and density-based spatial clustering of applications with noise
(DBSCAN) are often used to detect anomalies by grouping similar data points and identifying
outliers (Bishop, 2006).
Reinforcement Learning
Reinforcement learning is a type of machine learning where an agent learns to make decisions by
taking actions in an environment to maximize some notion of cumulative reward. This approach
can adapt to new data and improve over time, making it suitable for dynamic environments
(Sutton & Barto, 2018).
Decision Trees: These algorithms split the data into branches based on feature values,
leading to decisions about the data classification.
Random Forests: An ensemble method that combines multiple decision trees to improve
classification accuracy.
Support Vector Machines (SVM): Finds the hyper plane that best separates the data into
different classes.
Strengths: Effective in high-dimensional spaces.
Neural Networks: Consist of layers of interconnected nodes that can learn complex patterns.
K-Means Clustering: Partitions the data into K clusters based on feature similarity.
Autoencoders: Neural networks used for learning efficient codings of input data.
3. Semi-Supervised Learning Algorithms: These algorithms utilize both labeled and unlabeled
data to improve learning accuracy, which is particularly useful when labeled data is scarce but
unlabeled data is abundant.
Convolutional Neural Networks (CNN): Primarily used for image data but also applicable
for certain types of anomaly detection.
Recurrent Neural Networks (RNN): Suitable for sequential data like logs, capturing
temporal dependencies.
Model/Algorithm Approach
Framework Approach
1. Data Collection: Aggregating authentication logs from web applications, which may
include login attempts, IP addresses, timestamps, user agents, and more.
2. Data Preprocessing: Cleaning the data to remove noise and irrelevant information. This
step may involve:
Normalizing data
3. Feature Engineering: Extracting relevant features from the logs that can help
distinguish between normal and suspicious activities. Examples include:
4. Model Training: Using the preprocessed data to train machine learning models. This
step involves:
5. Model Evaluation: Assessing the performance of the models using metrics such as:
Accuracy
Precision
Recall
F1-score
2.2.3 Machine Learning System for Detecting Suspicious Authentication Activities in Web
Application Logs
In the current digital landscape, web applications are prime targets for cyber attacks, including
unauthorized access, data breaches, and various forms of exploitation. Traditional security
measures, such as rule-based systems and signature-based detection, are often insufficient to
cope with the sophisticated and constantly evolving tactics used by cybercriminals.
The detection of suspicious authentication activities in web applications is crucial for preventing
unauthorized access and safeguarding sensitive data. Such systems are commonly applied in
sectors with high security demands, including finance, healthcare, and e-commerce. Effective
detection systems analyze logs for anomalies, flag potential threats, and trigger security protocols
to mitigate risks (Liu et al., 2018).
Accuracy and Efficiency: By automating the analysis of vast amounts of log data, the
system improves the speed and accuracy of threat detection, reducing the workload on
security personnel.
Adaptability: The system continuously learns and adapts to new patterns, improving its
detection capabilities over time.
Application Areas
This system can be applied across various sectors where web applications are integral to
operations, including:
Educational Institutions: To protect student and faculty data in online learning platforms.
System Operation
Data Collection: The system collects authentication logs from web applications, including
login attempts, IP addresses, timestamps, user agents, and other relevant data points.
Data Preprocessing: The collected logs are cleaned and preprocessed to ensure the data is in
a suitable format for analysis. This involves removing duplicates, handling missing values,
and normalizing data.
Feature Engineering: Relevant features are extracted from the logs to help distinguish
between normal and suspicious activities. Features may include login attempt frequency,
geographical location, time of day, and device type.
Model Training: The preprocessed data is used to train machine learning models.
Supervised learning algorithms like Decision Trees, Random Forests, or Support Vector
Machines (SVM) may be employed, depending on the labeled data available. Unsupervised
learning methods such as K-Means Clustering or Isolation Forests can be used when labeled
data is scarce.
Model Evaluation: The trained models are evaluated using metrics such as accuracy,
precision, recall, F1-score, and the area under the ROC curve (AUC-ROC) to ensure they
effectively detect anomalies.
Real-time Monitoring: The system is integrated with the web application’s logging system
to enable real-time monitoring. The models continuously analyze incoming logs, flagging
any suspicious activities based on learned patterns.
Alerting and Reporting: When the system detects an anomaly, it triggers an alert to security
personnel, providing detailed reports on the suspicious activity. This allows for immediate
investigation and response.
Continuous Learning: The system continuously updates its models based on new data,
ensuring that it adapts to emerging threats and maintains high detection accuracy.
The proposed system leverages machine learning algorithms to analyze web application logs in
real-time. The process involves data collection, feature extraction, model training, anomaly
detection, and alert generation. The system continuously updates its models to adapt to new
patterns of suspicious behavior, ensuring robust protection against evolving threats (Buczak &
Guven, 2016).
Machine learning algorithms are highly effective in detecting anomalies within large datasets,
such as web application logs. Their ability to learn from historical data and identify subtle
patterns makes them well-suited for identifying suspicious authentication activities that may not
be easily detected through rule-based systems (Cook et al., 2019).
Implementation Strategy
The implementation involves selecting appropriate algorithms based on the nature of the data
and the specific requirements of the web application. Supervised learning models may be trained
using labeled datasets of known normal and suspicious activities. Unsupervised models can be
employed to identify new types of threats by clustering similar activities and highlighting
outliers. Reinforcement learning can further enhance detection capabilities by continuously
adapting to new attack vectors (Goodfellow et al., 2016).
1. Proactive Threat Detection: The machine learning model's ability to identify anomalies
proactively is crucial for the case study, as it enables the system to detect suspicious activities
that might not be covered by traditional rule-based security systems. This proactive detection is
vital in protecting web applications from unauthorized access and potential data breaches.
2. Handling Large Volumes of Data: Web applications generate vast amounts of log data,
which can be overwhelming for manual analysis. Machine learning models are well-suited to
process and analyze large datasets efficiently, making them ideal for monitoring web application
logs in real-time. The model can quickly sift through log data to identify patterns and anomalies
that indicate suspicious authentication activities.
3. Adaptability to Evolving Threats: Cyber threats are constantly evolving, with attackers
frequently developing new techniques to bypass security measures. The machine learning
model's adaptability allows it to learn from new data and update its detection capabilities
accordingly. This continuous learning process ensures that the model remains effective in
identifying new and emerging threats.
4. Improved Accuracy and Reduced False Positives: Machine learning models, especially
those utilizing advanced algorithms like Random Forests or Support Vector Machines (SVM),
offer high accuracy in distinguishing between normal and suspicious activities. This precision
reduces the number of false positives, ensuring that security personnel can focus on genuine
threats rather than being overwhelmed by incorrect alerts.
5. Real-time Monitoring and Alerting: The integration of the machine learning model with the
web application’s logging system enables real-time monitoring of authentication activities. This
real-time capability is essential for promptly detecting and responding to suspicious activities,
minimizing the window of opportunity for attackers.
1. Data Collection and Preprocessing: The model collects authentication logs from the web
application, capturing details such as login attempts, IP addresses, timestamps, and user agents.
Preprocessing steps include data cleaning, normalization, and handling missing values to ensure
the data is suitable for analysis.
2. Feature Engineering: Relevant features are extracted from the log data to help the model
distinguish between normal and suspicious activities. Features may include login frequency,
geographic location, time of day, device type, and patterns of failed login attempts.
3. Model Training: The collected and preprocessed data is used to train the machine learning
model. Supervised learning algorithms, such as Random Forests or Support Vector Machines
(SVM), are employed if labeled data (normal vs. suspicious activities) is available. Unsupervised
learning methods, like Isolation Forests or K-Means Clustering, are used if labeled data is scarce.
4. Model Evaluation: The model is evaluated using metrics like accuracy, precision, recall, F1-
score, and the area under the ROC curve (AUC-ROC) to ensure its effectiveness in detecting
anomalies. The goal is to achieve a high level of accuracy in identifying suspicious activities
while minimizing false positives.
5. Real-time Monitoring and Detection: Once trained and validated, the model is integrated
with the web application’s logging system for real-time monitoring. The model continuously
analyzes incoming authentication logs, identifying and flagging suspicious activities based on the
learned patterns.
6. Alerting and Reporting: When the model detects an anomaly, it triggers an alert to security
personnel, providing detailed reports on the suspicious activity. These reports include
information on the nature of the anomaly, the affected user accounts, and the specific log entries
that raised the alert.
7. Continuous Learning: The model continuously updates its detection capabilities by learning
from new log data. This ongoing learning process ensures that the model adapts to evolving
threats and maintains its effectiveness over time.
NIDS monitor network traffic for suspicious activities that may indicate security breaches.
Machine learning models are trained to distinguish between normal and malicious traffic,
enhancing the detection of potential intrusions (Mukherjee et al., 1994).
Anomaly detection in healthcare involves monitoring patient data for abnormal patterns that
could indicate health issues. Machine learning algorithms analyze vital signs and other health
metrics to provide early warnings of medical conditions (Chen et al., 2017).
Research by Sultana and Chilamkurti (2016) applied random forests to detect anomalies in
authentication logs, achieving high accuracy in identifying suspicious activities. Similarly,
Eberle and Holder (2007) used support vector machines to classify login attempts, demonstrating
the effectiveness of supervised learning in anomaly detection.
Anomalies in web application logs were detected using K-means clustering in a study by Ahmed
et al. (2016), which successfully identified outliers representing suspicious activities. Another
study by Chandola et al. (2009) utilized DBSCAN to cluster authentication activities,
highlighting the potential of unsupervised learning in detecting unknown threats.
Chen et al. (2020) explored reinforcement learning for cyber security applications, developing
models that adapt to new attack patterns over time. Their study demonstrated the potential of
reinforcement learning to enhance the robustness of anomaly detection systems.
Related Work Using Machine Learning Models for Detecting Suspicious Activities
The use of machine learning models to detect suspicious activities in authentication logs has
been an area of significant research. Various studies have demonstrated the effectiveness of these
models in enhancing security measures by identifying and mitigating unauthorized access
attempts. Here are some notable peer-reviewed studies:
Du and Li (2016) explored the use of machine learning techniques for anomaly detection in log
data, focusing on system logs and authentication records. The study implemented several
machine learning algorithms, including Support Vector Machines (SVM), Random Forests, and
Neural Networks, to detect anomalies that could indicate potential security breaches. The results
showed that Random Forests and Neural Networks performed particularly well, achieving high
accuracy and low false-positive rates. This research underlines the applicability of machine
learning models in identifying suspicious activities in various types of log data, including
authentication logs. The study demonstrated that machine learning models could effectively
detect anomalies in log data, providing a robust solution for monitoring and enhancing system
security. The implementation of Random Forests and Neural Networks highlighted the potential
for these models to process large datasets and identify patterns indicative of suspicious activities.
Pan et al. (2018) investigated the use of machine learning models to detect malicious login
attempts in authentication systems. The study employed a combination of supervised learning
techniques, such as Logistic Regression and Gradient Boosting Machines (GBM), to analyze
login patterns and identify anomalies. The models were trained on historical login data, including
features like login frequency, geographic location, and time of login. The Gradient Boosting
Machine (GBM) algorithm showed superior performance, with high precision and recall rates in
detecting suspicious login attempts. The research provided evidence that machine learning
models could enhance the detection of malicious login attempts by analyzing login patterns and
identifying deviations from typical behavior. The success of the GBM algorithm in this context
demonstrated its effectiveness in processing complex datasets and providing accurate predictions.
Buczak and Guven (2016) surveyed various data mining and machine learning methods used for
intrusion detection in cyber security. The study covered a range of techniques, including
Decision Trees, Random Forests, and Neural Networks, applied to detect intrusions and
suspicious activities in network and authentication logs. The authors discussed the strengths and
weaknesses of each method, with Random Forests and Neural Networks showing high efficacy
in detecting complex and subtle anomalies. The survey highlighted the importance of machine
learning models in modern intrusion detection systems. It provided valuable insights into the
capabilities of different algorithms, emphasizing the potential for machine learning to
significantly improve the accuracy and reliability of detecting suspicious activities in various
security contexts.
Eberz et al. (2017) evaluated the use of behavioral biometrics for continuous user authentication.
The study applied machine learning algorithms to behavioral data, such as keystroke dynamics
and mouse movements, to continuously authenticate users. The researchers compared the
performance of various models, including Decision Trees and Support Vector Machines (SVM),
finding that these models could accurately differentiate between legitimate users and intruders.
This study highlighted the potential of machine learning models to enhance user authentication
through continuous monitoring of behavioral biometrics. The findings demonstrated that
behavioral data could provide a reliable basis for detecting suspicious activities and improving
overall security.
Le, Hoang, and Luu (2019) explored the application of deep learning techniques to detect
anomalous login activities. The study developed a deep learning model using Long Short-Term
Memory (LSTM) networks to analyze sequences of login events and identify anomalies. The
model was trained on a large dataset of login records and demonstrated high accuracy in
detecting suspicious login attempts.
The research showcased the effectiveness of deep learning models, specifically LSTM networks,
in processing sequential data and identifying anomalies in login activities. The study's success
underscored the potential for deep learning techniques to enhance security measures in
authentication systems.
Yavanoglu and Aydos (2017) reviewed various cyber security datasets used for training machine
learning algorithms to detect unauthorized access. The study highlighted the importance of
diverse and representative datasets in developing effective machine learning models. The
researchers discussed several public datasets and their applicability to different types of cyber
security problems, including authentication systems. The review provided valuable insights into
the availability and characteristics of cyber security datasets, emphasizing their role in training
robust machine learning models. The discussion of dataset selection and preparation underscored
the importance of data quality in developing effective security solutions.
Ngai et al. (2011) reviewed the application of data mining and machine learning techniques in
financial fraud detection. The study categorized various machine learning approaches, including
Neural Networks, Decision Trees, and Support Vector Machines (SVM), used to detect
fraudulent activities in online systems. The authors discussed the strengths and limitations of
each approach and provided examples of successful implementations. The review highlighted the
versatility of machine learning techniques in detecting fraudulent activities across different
domains, including financial systems. The discussion of various algorithms and their applications
provided a comprehensive overview of the state-of-the-art in fraud detection.
Zhang and Liu (2020) examined the use of machine learning approaches to improve security in
authentication systems. The study implemented and compared several algorithms, including
Logistic Regression, Random Forests, and Neural Networks, to detect suspicious login attempts
and enhance authentication processes. The results indicated that machine learning models could
significantly improve the accuracy and reliability of authentication systems. The research
demonstrated that machine learning models could be effectively integrated into authentication
systems to detect and prevent unauthorized access. The comparative analysis of different
algorithms provided insights into the most effective approaches for enhancing security.
Kim and Kim (2017) explored machine learning approaches for real-time anomaly detection in
streaming data, focusing on authentication logs. The study developed a real-time detection
system using machine learning models, such as Online SVM and Adaptive Random Forests, to
continuously monitor login activities and identify anomalies. The system demonstrated high
performance in detecting suspicious login attempts with minimal latency. The study highlighted
the importance of real-time detection in enhancing security measures for authentication systems.
The implementation of Online SVM and Adaptive Random Forests demonstrated the feasibility
of continuous monitoring and rapid response to potential security threats.
This chapter reviewed the key concepts, definitions, and machine learning algorithms relevant to
detecting suspicious authentication activities in web application logs. It highlighted the
importance of machine learning in cyber security and discussed the theoretical frameworks and
practical applications of these algorithms. The literature review provided insights into related
work and examples of similar applications, setting the stage for the subsequent chapters that will
detail the methodology, results, and conclusions of the study.
References for Chapter Two
Ahmed, M., Mahmood, A. N., & Hu, J. (2016). A survey of network anomaly detection
techniques. Journal of Network and Computer Applications, 60, 19-31.
Buczak, A. L., & Guven, E. (2016). A survey of data mining and machine learning methods for
cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(2), 1153-1176.
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM
Computing Surveys (CSUR), 41(3), 1-58.
Chen, J., Li, K., Li, K., & Xie, Y. (2017). A review of anomaly detection methods in networks.
IEEE Access, 5, 1397-1410.
Chen, T., Xu, H., & He, Y. (2020). Reinforcement learning for cyber-physical systems security.
Journal of cyber security, 6(1), tyaa011.
Cook, D. J., Feuz, K. D., & Krishnan, N. C. (2019). Transfer learning for activity recognition: A
survey. Knowledge and Information Systems, 36(3), 537-556.
Eberle, W., & Holder, L. (2007). Discovering anomalies in data through multigraph analysis.
Journal of Information and Data Management, 1(1), 30-54.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Goel, S., & Sharma, R. (2017). Analyzing user activity logs for monitoring malicious behavior.
Journal of Information Security and Applications, 35, 12-22.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
Kumar, A., & Raj, P. (2018). Predictive Analytics in cyber security: Machine Learning and Data
Mining Approaches. Springer.
Liu, L., Zhang, D., & Li, Y. (2018). Detection and defense of web application vulnerabilities
using machine learning. Security and Communication Networks, 2018.
Mukherjee, B., Heberlein, L. T., & Levitt, K. N. (1994). Network intrusion detection. IEEE
Network, 8(3), 26-41.
Ngai, E. W. T., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The application of data
mining techniques in financial fraud detection: A classification framework and an academic
review of literature. Decision Support Systems, 50(3), 559-569.
Sultana, S., & Chilamkurti, N. (2016). Survey on machine learning techniques for network
anomaly detection. Journal of Network and Computer Applications, 60, 19-31.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
Wang, W., Lu, Z., Qin, L., & Wang, J. (2019). A survey on the security of blockchain systems.
Future Generation Computer Systems, 105, 287-302.
Du, M., & Li, F. (2016). Anomaly detection in log data using machine learning techniques.
Proceedings of the 15th IEEE International Conference on Trust, Security and Privacy in
Computing and Communications (TrustCom).
Pan, S., Li, Y., Sun, W., & Wei, X. (2018). Detecting malicious login attempts using machine
learning. IEEE Access, 6, 42292-42301.
Ahmed, M., Mahmood, A. N., & Hu, J. (2016). A survey of network anomaly detection
techniques. Journal of Network and Computer Applications, 60, 19-31.
Buczak, A. L., & Guven, E. (2016). A survey of data mining and machine learning methods for
cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(2), 1153-1176.
Eberz, S., Rasmussen, K. B., Lenders, V., & Martinovic, I. (2017). Evaluating behavioral
biometrics for continuous authentication: Challenges and metrics. Proceedings of the 2017 ACM
on Asia Conference on Computer and Communications Security (ASIACCS).
Le, T., Hoang, D., & Luu, C. (2019). Detecting anomalous login activities using deep learning.
IEEE Transactions on Information Forensics and Security, 14(6), 1454-1463.
Yavanoglu, U., & Aydos, M. (2017). A review on cyber security datasets for machine learning
algorithms. Proceedings of the 2017 IEEE 21st International Conference on Computer
Supported Cooperative Work in Design (CSCWD).
Ngai, E. W. T., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The application of data
mining techniques in financial fraud detection: A classification framework and an academic
review of literature. Decision Support Systems, 50(3), 559-569.
Zhang, Y., & Liu, X. (2020). Machine learning approaches to improve security in authentication
systems. IEEE Access, 8, 20103-20113.
Kim, J., & Kim, H. (2017). Machine learning approaches to real-time anomaly detection for
streaming data. IEEE Transactions on Cybernetics, 47(3), 846-858.
Axelsson, S. (2000). The base-rate fallacy and its implications for the difficulty of intrusion
detection. Proceedings of the 6th ACM Conference on Computer and Communications Security,
1-7.
Buczak, A. L., & Guven, E. (2016). A survey of data mining and machine learning methods for
cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(2), 1153-1176.
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM
Computing Surveys (CSUR), 41(3), 1-58.
Greenberg, A. (2018). The untold story of NotPetya, the most devastating cyberattack in history.
Wired.
Kim, J., Kim, J., Cho, S., & Kim, J. H. (2017). Zero-day malware detection using transferred
generative adversarial networks based on deep autoencoders. Information Sciences, 433-434,
281-304.
Spafford, E. H. (1989). The internet worm program: An analysis. ACM SIGCOMM Computer
Communication Review, 19(1), 17-57.
Sommer, R., & Paxson, V. (2010). Outside the closed world: On using machine learning for
network intrusion detection. IEEE Symposium on Security and Privacy, 305-316.
3.1 Introduction
This chapter provides a detailed analysis of both the existing and proposed systems. It covers the
methodologies used, system analysis, requirement gathering, and system design. The chapter
aims to highlight the advantages and disadvantages of the current system and propose a more
efficient system using machine learning algorithms to detect suspicious authentication activities.
3.2 Methodology
Research system is the particular methodology or procedures used to distinguish, select, process,
and examine data about a topic. In a research paper, this section permits the reader to
fundamentally assess a review's general legitimacy and dependability. The followings techniques
both an overall term, used to allude to versatile programming improvement draws near, as well
as the name for James Martin's way to deal with quick turn of events. For a rule, RAD ways to
deal with programming advancement set less accentuation on arranging and more accentuation
on a versatile interaction. Models are frequently utilized notwithstanding or once in a while even
improvement devices.
necessities and arrangements advance through the cooperative exertion of self-coordinating and
cross-practical groups and their customer(s) end user(s). It advocates versatile preparation,
transformative turn of events, early conveyance, and constant improvement, and it energizes fast
There is episodic proof that taking on lithe practices and values works on the deftness of
each stage relies upon the expectations of the past one and relates to a specialization of errands.
development, it will in general be among the less iterative and adaptable methodologies, as
progress streams in to a great extent one course ("downwards" like a cascade) through the
periods of origination, inception, examination, plan, development, testing, sending and support.
The waterfall development model started in the assembling and development ventures; where the
exceptionally organized actual conditions implied that plan changes turned out to be restrictively
costly a whole lot earlier in the advancement interaction. At the point when initially took on for
programming advancement, there were no perceived options for information based innovative
work.
adopted as it is well suited for developing software that encourages rapid and flexible response to
To analyze the existing system for detecting suspicious authentication activities in web
application logs, we will consider a typical system used in many organizations: a rule-based
intrusion detection system (IDS). This system primarily relies on predefined rules to flag
potential security threats based on specific patterns found in authentication logs.
The rule-based IDS in use is designed to monitor and analyze authentication logs to detect
suspicious activities such as failed login attempts, unusual login times, and access from
unfamiliar IP addresses. The system consists of the following key components:
1. Log Collection: Authentication logs are collected from various sources, including web
servers, application servers, and authentication servers.
2. Log Parsing: The collected logs are parsed to extract relevant information such as user
IDs, timestamps, IP addresses, and login status.
3. Rule Engine: The core component of the system where predefined rules are applied to
the parsed log data to identify suspicious activities.
4. Alert Generation: When a rule is triggered, an alert is generated and sent to the security
team for further investigation.
5. Manual Review: Security analysts manually review the alerts to determine the validity of
the detected threats and take appropriate actions.
The existing rule-based IDS follows these steps to detect suspicious authentication activities:
1. Log Collection
Authentication logs are continuously collected from different sources and stored in a
central log repository.
Example: Web server logs, which include user login attempts, timestamps, and IP
addresses, are aggregated for analysis.
2. Log Parsing
Example: Parsing logs to identify user IDs, timestamps of login attempts, and login
statuses (successful or failed).
3. Rule Application
Predefined rules are applied to the parsed log data to detect anomalies.
Example: A rule might be set to flag more than five failed login attempts within a
five-minute window as suspicious.
4. Alert Generation
Example: If the number of failed login attempts exceeds the threshold, an alert is sent
to the security team.
5. Manual Review
Security analysts review the generated alerts to confirm the presence of suspicious
activities.
Example: Analysts check the context of the alerts, such as whether the IP address is
known to be problematic or if the login attempts are from a legitimate user
experiencing issues.
GOMBOL Corporation uses a rule-based IDS to monitor authentication activities across its web
applications. The system employs several predefined rules to detect anomalies, including:
1. Multiple Failed Login Attempts: Flags more than five failed login attempts within a
five-minute window.
2. Unusual Login Times: Flags logins during unusual hours (e.g., late at night) for users
who typically log in during business hours.
3. Geographical Anomalies: Flags logins from IP addresses located in regions where the
user has never logged in from before.
High False Positives: The system frequently generates alerts for legitimate activities,
such as users forgetting their passwords or traveling to different regions.
Inability to Detect New Threats: The system struggles to detect sophisticated attacks
that do not fit the predefined rules.
Resource Intensive: Security analysts spend a significant amount of time reviewing and
validating alerts, leading to fatigue and potential oversight of genuine threats.
GOMBOL Corporation recognizes the need to enhance its detection capabilities and
reduce the burden on its security team. The limitations of the existing rule-based IDS
highlight the necessity for a more advanced system that can adapt to evolving threats
and minimize false positives.
By analyzing the shortcomings of the existing system, we can identify the key areas
for improvement and design a more effective solution using machine learning
algorithms. This will be detailed in the subsequent sections, where the proposed
system is discussed.
Advantages:
3. Low Cost: Minimal resources are required to set up and maintain the system.
Disadvantages:
1. High False Positive Rate: The system often generates a large number of false positives
due to the rigid nature of rules.
2. Inflexibility: It is challenging to adapt the system to new types of threats that do not
match predefined rules.
3. Manual Effort: Requires significant manual effort to review and validate alerts.
4. Scalability Issues: As the number of logs and rules increases, the system's performance
can degrade.
In this section, we analyze the proposed system designed to enhance the detection of
suspicious authentication activities in web application logs using machine learning
algorithms. The proposed system leverages supervised machine learning techniques to
improve the accuracy and efficiency of anomaly detection. We will specifically focus on
two supervised ML techniques: Support Vector Machine (SVM) and Maximum Entropy
(MaxEnt).
Overview of the Proposed System
1. Data Collection: Logs are collected from web servers, application servers, and other
relevant sources.
2. Data Preprocessing: Collected logs are cleaned and preprocessed to extract relevant
features.
3. Feature Engineering: Important features are selected and engineered to enhance the
model's performance.
4. Machine Learning Engine: SVM and MaxEnt models are trained and applied to detect
suspicious activities.
5. Alert Generation: Alerts are generated based on the predictions of the ML models.
6. User Interface: A dashboard for security analysts to review alerts and system
performance.
Result
Support Vector Machine (SVM) is a supervised machine learning algorithm widely used
for classification tasks, including malicious web page detection. SVM works by finding
the hyperplane that best separates the data points of different classes. The main
characteristics of SVM include:
Kernel Trick: SVM can use different kernel functions (linear, polynomial, radial basis
function) to transform the input data into a higher-dimensional space where it is easier to
separate the classes.
Margin Maximization: SVM aims to maximize the margin between the separating
hyperplane and the nearest data points from each class, known as support vectors.
Probability Estimation: MaxEnt models estimate the probability of each class given the
input features, making it suitable for classification tasks.
Feature Weights: The model assigns weights to each feature, indicating its importance
in predicting the class label.
Flexibility: MaxEnt can handle a variety of feature types, including binary, categorical,
and continuous features.
Although MaxEnt has not been widely used for malicious web page detection, it has
shown promising results in related areas such as document and web page classification.
In our proposed system, MaxEnt can be employed to classify login attempts, leveraging
its ability to handle different types of features effectively.
o Logs are collected from various sources and stored in a centralized repository.
2. Feature Engineering
o Relevant features are engineered from the authentication logs, such as login time,
IP address, user agent, and number of failed login attempts.
o Feature selection techniques are applied to identify the most important features
for the classification task.
o SVM and MaxEnt models are trained using the training data, with hyperparameter
tuning performed to optimize model performance.
o The trained models are deployed to classify new login attempts in real-time.
o Anomalous activities are flagged, and alerts are generated for further investigation
by security analysts.
5. User Interface
o The dashboard includes features such as alert filtering, detailed log views, and
trend analysis.
The proposed system offers several advantages over the existing rule-based intrusion
detection system:
Improved Accuracy: Machine learning models can capture complex patterns and
relationships in the data, leading to more accurate detection of suspicious activities.
Reduced False Positives: By learning from historical data, the models can better
distinguish between legitimate and malicious activities, reducing the number of false
alerts.
Scalability: The system can handle large volumes of data and adapt to new types of
threats as they emerge.
Automated Detection: The system reduces the need for manual monitoring, allowing
security analysts to focus on investigating genuine threats.
Requirements for the proposed system were gathered through stakeholder interviews, surveys,
and document analysis. Both functional and non-functional requirements were identified.
Support Vector Machine (SVM) is one of the most widely used data classification techniques for
binary classification of high-dimensional data. Introduced by Boser et al. and later refined by
Cortes and Vapnik, SVM aims to find the optimal margin between training patterns and the
decision boundary on separable data. The main target of the SVM model is to determine an
optimal hyperplane that separates examples of different classes for given training data points.
The decision hyperplane is constructed by maximizing the distance of the hyperplane from the
nearest examples of different classes, known as support vectors.
where (xi,yi)(x_i, y_i)(xi,yi) are the training samples, xi∈Rnx_i \in \mathbb{R}^nxi∈Rn and
yi∈{−1,1}y_i \in \{-1, 1\}yi∈{−1,1}, ξi\xi_iξi are slack variables, CCC is the penalty parameter,
and ϕ(xi)\phi(x_i)ϕ(xi) is a kernel function.
Kernel Functions: SVM can utilize various kernel functions to handle non-separable patterns by
mapping the input data into a higher-dimensional space. Commonly used kernels include Linear
and Radial Basis Function (RBF):
In these equations, ccc represents the class type, ddd represents the document, αi\alpha_iαi are
the feature weights, and fi(d,c)f_i(d, c)fi(d,c) indicates the impact of feature iii on class ccc.
Various estimation algorithms can be used for learning these weights, such as Limited-Memory
Variable Metric (L-BFGS), Orthant Wise Limited-memory Quasi Newton (OWLQN), or
Stochastic Gradient Descent (SGD).
Extreme Learning Machine (ELM) is a learning algorithm for single-hidden layer feed-forward
neural networks (SLFN), designed to address the slowness of gradient-based training algorithms.
ELM selects input weights randomly and analytically determines the output weights to achieve
high generalization performance with extremely fast learning speed.
The output of an SLFN with LLL hidden nodes can be represented as:
where HHH is the hidden layer output matrix, β\betaβ is the output weight vector, and TTT is the
target matrix.
By employing these three machine learning models—SVM, MaxEnt, and ELM—our study aims
to leverage their unique strengths to improve the detection of suspicious authentication activities
in web application logs.
Start
Input
Data Cleaning
Regression
Reviewing the
Machine Reprocess
Learning
Model
Visualization
Stop
3.5.1 User Interface Design
Login Page
Username
Password
Submit
Register
Name
Address
Phone
Submit
3.5.3 Database Design