Method and System For Detecting Anomalous User Behaviors
Method and System For Detecting Anomalous User Behaviors
Method and System For Detecting Anomalous User Behaviors
email:{xixy,zhangtong17,dudong,gaoqing,zhaowen,zhangsk}@pku.edu.cn
Abstract—Malicious user behavior that does not trigger access A key problem discussed frequently is to detect compromised
violation or data leak alert is difficult to detect. Using the stolen user accounts and insiders within the company, which does
login credentials, the intruder doing espionage will first try not induce enormous data flow and/or any access violation
to stay undetected, silently collect data that he is authorized
to access from the company network. This paper presents an [6]. For example, an attacker may steal user credentials using
overview of User Behavior Analytics Platform built to collect logs, social engineering and access sensitive information or copy
extract features and detect anomalous users which may contain it to untrusted storage. In this scenario, security systems
potential insider threats. Besides, a multi-algorithms ensemble, such as firewalls, IDS [7] or Security Information and Event
combining OCSVM, RNN and Isolation Forest, is introduced. Management(SIEM), and Data Leak Prevention System(DLP)
The experiment showed that the system with an ensemble of
unsupervised anomaly detection algorithms can detect abnormal [8] cannot detect effectively. Relying on analysts to investigate
user behavior patterns. The experiment results indicate that attacks is costly and time-consuming, as they have to deal with
OCSVM and RNN suffer from anomalies in the training set, millions of logs.
and iF orest gives more false positives and false negatives, while User Behavior Analytics, which has been used in online
the ensemble of three algorithms has great performance and social media analysis [9] and improving web search ranking
achieves recall 96.55% and accuracy 91.24% on average.
Index Terms—anomaly detection, insider threat, user behavior, [10] , is emerging in security area. User behavior analytics
unsupervised learning, ensemble is a cyber security process about detection of insider threats,
targeted attacks, and financial fraud. They look at patterns
I. I NTRODUCTION of human behavior, and then apply algorithms and statistical
Insider threat has emerged in enterprise security and re- analysis to detect meaningful anomalies from those patterns
ceived increasing attention over last several years. A survey [11]. UBA collects various types of data such as organization
[1] by Haystack shows 56% of respondents feel that insider structure, user roles and job responsibilities, user activity trace
attacks have become more frequent. Privileged IT users such and geographical location. The analysis algorithms consider
as administrators with access to sensitive information, pose the factors including contextual information, continuous activities,
biggest insider threat. IT assets such as databases, file servers duration of sessions, and peer group activity to compare
and mobile devices are top assets at risk. anomaly behavior. UBA determines the baseline of normal
Insider threat is defined as any activity by military, gov- behavior of individual user or peer group according to history
ernment, or private company employees whose actions or data. The deviation of ongoing user activities compared with
inactions, by intent or negligence, result (or could result) in the past normal behavior is significant if the user acts abnormally
loss of critical information or valued assets [2]. Two types of [12].
insider threats are distinguished: malicious insider threats and This paper introduces an User Behavior Analytics Platform
unintentional insider threats [3]. The first threat is a current built to detect potential insider threats. Specifically, the plat-
or former employee, contractor, or business partner who has form can 1) collect and preprocess logs from systems and
or had authorized access to an organizations network, system, applications; 2) extract each user’s activity records from logs;
or data and intentionally exceeded or misused that access in 3) aggregate activity records and generate feature vector for
a manner that negatively affected the confidentiality, integrity, each user; 4) detect anomalous user access. Besides, an ensem-
or availability of the organizations information or information ble by multiple unsupervised anomaly detection algorithms
systems. The attempted attack by a Fannie Mae employee after is proposed and shows great performance in detecting user’s
being dismissed is a typical example of an insider threat likely anomalous access and operation within enterprise.
motivated by revenge [4]. The second form is from insiders This paper is organized as follows. Section 2 introduces
without malicious intent [5] such as human mistakes, errors. related work. User behavior analytics architecture and platform
which contains four components is presented in Section 3.
This work was partially supported by National Key Research and Devel-
opment Program of China (No. 2017YFB0802900). Section 4 introduces experiment scenario, data characteristics
DOI reference number: 10.18293/SEKE2018-036 and feature selection. In Section 5, anomaly detection algo-
rithms for user behavior analytics are demonstrated. Section 6
gives dataset, experiment and results. Besides, discussion and Anomaly Detection Results
comparison of algorithms are demonstrated. Finally, section 7
concludes the paper and provides future work. Anomaly Detection Component
II. R ELATED W ORK Replicator Neural
One Class SVM Isolation Forest
Network
Anomaly detection is an important problem that has been
researched within diverse research areas and application do-
mains including information security [13]. Research of ap-
plying anomaly detection is popular in intrusion detection Feature Vectors
[7], fraud detection [14] [15], medical and public health
anomaly detection and industrial damage detection. Many
anomaly detection techniques have been specifically developed Feature Extraction Component
for certain application domains, while others are more generic.
Anomaly detection techniques being applied to user behav-
ior analytics is increasingly popular. Veeramachaneni K et
Activity Records
al. [16] put forward AI 2 that combines analyst intelligence
with an ensemble of three outlier detection methods to detect
account takeover, new account fraud and service abuse. Activity Record Generation Component
Madhu Shashanka et al. [17] presented the User and Entity
Filter Normalize Enrich
Behavior Analytics(UEBA) module of the Niara Security
Analytics Platform which uses a SVD-based algorithm to
detect anomalies in user accessing server within an enterprise.
Both users historical baseline and peer baseline are applied Centralized Raw Log
with same algorithm.
Sapegin A et al. [18] proposed a poisson-based two-step
algorithm to identify anomaly user access to workstation Real-Time Data Collection Component
within Windows domain. However, the dataset is from sim-
ulation scenario and of limited features. The algorithms are
not persuasive enough and of limited extensibility. User Application
System Logs 3rd Party API
Directory Logs
Wei Ma et al. [19] defined a user behavior pattern and
proposed a knowledge-driven user pattern discovery approach
Fig. 1. Architecture of User Behavior Analytics Platform
which can extract users behavior patterns from audit logs from
distributed medical imaging systems. The work is emphasized
on extracting user behavior patterns and there is a long way A. Data Collection Component
to go before administrator directly use it.
Data collection component stores raw logs generated by
Li at al. [20] proposed a kind of security audit technology
systems and applications for further extraction and analysis.
based on one-class support vector machine detect the abnormal
The collected raw logs are stored in ElasticSearch [21] ,
behavior of database operations.
which is a distributed, JSON-based search and analytics engine
III. U SER B EHAVIOR A NALYTICS A RCHITECTURE AND designed for horizontal scalability, maximum reliability, and
P LATFORM easy management. Logs of users accessing ftp server within
In this section, an architecture of user behavior analytics enterprise and operations such as downloading, uploading,
is presented. Based on that, the implementation of our UBA deleting files or directories are collected. The user information
platform is described. can be gathered from activity directory of the enterprise.
Relying on analysts to investigate attacks is costly and Data source our platform can process includes:
time-consuming, as they have to deal with millions of logs 1) system logs,
and alerts. Our UBA platform collects logs about user-related 2) application logs such as web access logs and DLP logs,
events and user session activity in real-time or near real-time, 3) user directory logs, etc.
and compares each and every action to the corresponding
baseline of users to spot anomalies in their behavior. Based on B. Activity Record Generation Component
detection results, a risk label or score that reveals human risk Centralized raw logs are of respective unique formats from
will be assigned to every user, which is helpful and meaningful which feature cannot be extracted directly. For example,
for security analysts especially when analysts investigate or apache server logs and windows security logs consist of
monitor employees for suspicious behaviors or attacks. Fig.1 different items. Due to the lack of normalized formats, activity
provides the architecture composed of four components. Each record generation component
of them is described in the following. 1) builds a general schema for activity records,
2) generates regular expressions as filters for each type of TABLE I
log to extract useful information, L OGS AND CORRESPONDING EVENTS
3) fills the schema with extracted information. LOG EVENT
Then the activity records with user information are generated, CONNECT LOG CONNECT Event
LOGIN LOG Login Event
which will be processed in the following Feature Extraction DOWNLOAD LOG Download Event
Component. Upload Event
UPLOAD LOG Create File Event
C. Feature Extraction Component Remote Copy Event
Delete File Event
After generating normalized activity records with user DELETE LOG
Delete Directory Event
information attached, we compute user behavioral features MKDIR LOG Make Directory Event
over an interval of time such as 24 hours. For performance RMDIR LOG Remove Directory Event
Rename Event
consideration, the strategy from [16] is applied. Each hour RENAME LOG
Remotely Move Files Event
we retrieve activity records within last hour and compute the
features labeled with last hour. In midnight, we only need to
retrieve the 23 feature sets and activity records within last hour and all can be considered as normal behaviors so we simulate
rather than activity records within last 24 hours. several abnormal operations for each user as testing dataset.
Based on investigation in the enterprise, abnormal behaviors
D. Anomaly Detection Component mainly include four categories shown in Table II.
With features extracted for each user, anomaly detection
component detects anomalous users on a daily basis. The TABLE II
C ATEGORIES OF ABNORMAL BEHAVIOR
component is designed losely coupled, flexible and indepen-
dent from other components. The details of algorithms are anomaly ID Description
demonstrated in V. anomaly 1 multiple login attempts and failures
anomaly 2 anomalous downloads operations
IV. DATA C HARACTERISTICS anomaly 3 anomalous delete operations
anomaly 4 operations at non-working hours
In this section, a scenario within a typical software company
is introduced. The behavior of employees accessing ftp files
C. Feature Selection
and data within work groups are monitored and audit logs are
generated and collected by UBA platform presented before. With activity records generated by activity record generation
Then dataset and feature selection is introduced. component, feature extraction component produces a feature
vector for each user daily, which characterize the pattern of
A. Experiment Scenario users’ access to ftp server and operations. The features are
Consider a file server within an enterprise, authorized em- shown in Table III. The features for a user daily is denoted
ployees can access the server for files and data with different
authorization, which normally is configured by the adminis- TABLE III
T HE LIST OF F EATURES . 21 FEATURES ARE EXTRACTED AND USED IN
trator. For example, one can read, write, upload, download or TOTAL .
delete files or directories. As documents and data are important
information, accessing and operations need to be monitored for Feature ID Description
1 number of total connections of the day
possible actions from compromised accounts or rogue users. 2 timestamp of first login attempt of the day
UBA platform monitors the access patterns and operation 3 timestamp of last login attempt of the day
patterns of each user while accessing the server and files. 4 number of login success of the day
5 number of login fail of the day
B. Dataset 6 total download bytes
7 number of download success of the day
The ftp server logs are collected by the data collection 8 number of download fail of the day
component presented before. In total 8 kinds of logs are 9 largest download bytes of the day
10 total upload bytes
collected and each corresponds to 1 or more types of events. 11 number of upload success of the day
For example, Download Log only represents Download Event 12 number of upload fail of the day
and the log carries information including timestamp, user 13 largest upload bytes of the day
name, SUCCESS/FAIL flag and client IP. An UPLOAD LOG 14 number of delete success of the day
15 number of delete fail of the day
may represent an upload file/directory event, create file event 16 number of mkdir success of the day
or remote copy event and cannot be distinguished by content 17 number of mkdir fail of the day
as different events share same format. Table I shows the 18 number of rmdir success of the day
19 number of rmdir fail of the day
representation map between logs and events. 20 number of rename success of the day
We collected operation logs within a software company for 21 number of rename fail of the day
3 months and selected four employees with explicit different
behavior patterns. Activity records generated were checked by 21-dimention vector x = (x1 , x2 , ..., x21 ).
Input Output
𝑥1 𝑦1
𝑥2 𝑦2
𝑥3 𝑦3
…
…
…
𝑥n 𝑦n
i=1
s(x, n) = 2− c(n) , (5)
0, s(x, n) > ε1
40 fiF orest (x; X) = (7)
1, s(x, n) ≤ ε1
35
0, err(x) > ε2
30 fRN N (x; X) = (8)
1, err(x) ≤ ε2
25
where “; X” indicates the model is trained with X as training
20 set. As recall is an important measure metric in security,
we apply strictly filtering strategy and regard data point as
15
abnormal as long as any of the three algorithms outputs 0,
10 shown by formula 9 and 10.
5
anomalies
dataset normal anomaly 1 anomaly 2 anomaly 3 anomaly 4 total
proportion
training set 1 1000 0 0 0 0 1000 0.00%
training set 2 1000 6 6 3 6 1021 2.06%
training set 3 1000 12 12 6 12 1042 4.03%
testing set 100 45 47 32 50 274 63.50%
TABLE V
D ETECTION R ATE OF OCSVM, RNN AND iF orest WITH DIFFERENT TRAINING SETS
TABLE VI
ACCURACY, P RECISION AND R ECALL OF OCSVM, RNN AND iF orest WITH DIFFERENT TRAINING SETS
mean reconstruction error of normal data is 0,539, much testing set - abnormal 4
lower than abnormal data in testing dataset (4.947, 3.887,
20
8.627, 2.409). However, the time cost of training is much
higher than other algorithms.
15
Isolation Forest has the worst performance of the three
algorithms as Table V and Table VI shows. With training 10
set 2, the anomaly score of test data from Isolation Forest is
presented in Fig.6. The normal data has a lower anomaly score 5
0.390 than anomaly data (0.490,0.487,0.475,0.442). However,
scores of some data are pretty close as Fig.6 shows, especially 0
the data of operations at non-working hour. In each category 0 1 2 3 4 5 6
of anomaly data we simulated, data is anomaly in only few training step ×10 4
dimensions. At training stage, attribute and split point is
randomly selected. Statistically it’s hard to split data with Fig. 5. Mean reconstruction error during training with Replicator Neural
anomalous attributes before the tree goes deep. However, if Network.
the training set has more complicated and real anomalies,
iF orest can be of better performance. Besides, the threshold When anomalies in training set increase, algorithms alone are
ε1 can be adjusted flexibly, which is a valuable characteristic. less reliable. With 2.06% anomalies in training set, the ensem-
In addition, iF orest didn’t suffer an obvious reduction with ble gives recall=96.55% and accuracy=90.88%. With 4.03%
more anomalies mixed in the training set. anomalies in the training set, RNN has recall=72.41% and
Table VI shows that the ensemble and strictly filtering accuracy=79.56%, while the ensemble has great performance
strategy improve robustness and performance especially recall. recall=93.10% and accuracy=91.24%. It can be a good and
0.3 0.3 0.3 [9] Y. Amichai-Hamburger and G. Vinitzky, “Social network use and
personality,” Computers in human behavior, vol. 26, no. 6, pp. 1289–
0.2 0.2 0.2 1295, 2010.
[10] E. Agichtein, E. Brill, and S. Dumais, “Improving web search ranking
0.1 0.1 0.1
by incorporating user behavior information,” in Proceedings of the
29th annual international ACM SIGIR conference on Research and
development in information retrieval. ACM, 2006, pp. 19–26.
0 0 0
0.3 0.4 0.5 0.6 0.3 0.4 0.5 0.6 0.3 0.4 0.5 0.6 [11] T. Bussa, A. Litan, and T. Phillips, “Market guide for user and entity be-
(a) testing set - normal (b) testing set - abnormal 1 (c) testing set - abnormal 2 havior analytics,” URL: https://www. gartner. com/doc/3538217/market-
0.3 0.3
guide-user-entity-behavior ( 29.07. 2017), 2016.
[12] W. Ma, “User behavior pattern based security provisioning for dis-
tributed systems,” Ph.D. dissertation, 2016.
0.2 0.2
[13] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,”
ACM computing surveys (CSUR), vol. 41, no. 3, p. 15, 2009.
0.1 0.1
[14] D. J. Weston, D. J. Hand, N. M. Adams, C. Whitrow, and P. Juszczak,
“Plastic card fraud detection using peer group analysis,” Advances in
0 0
0.3 0.4 0.5 0.6
Data Analysis and Classification, vol. 2, no. 1, pp. 45–62, 2008.
0.3 0.4 0.5 0.6
(d) testing set - abnormal 3 (e) testing set - abnormal 4 [15] M. Ahmed, A. N. Mahmood, and M. R. Islam, “A survey of anomaly
detection techniques in financial domain,” Future Generation Computer
Systems, vol. 55, pp. 278–288, 2016.
Fig. 6. Anomaly Score of Isolation Forest for test data of different categories [16] K. Veeramachaneni, I. Arnaldo, V. Korrapati, C. Bassias, and K. Li,
“Aiˆ 2: training a big data machine to defend,” in Big Data Security
on Cloud (BigDataSecurity), IEEE International Conference on High
optional strategy especially when security analysts focus on Performance and Smart Computing (HPSC), and IEEE International
Conference on Intelligent Data and Security (IDS), 2016 IEEE 2nd
recall. International Conference on. IEEE, 2016, pp. 49–54.
[17] M. Shashanka, M.-Y. Shen, and J. Wang, “User and entity behavior
VII. C ONCLUSION AND F UTURE W ORK analytics for enterprise security,” in Big Data (Big Data), 2016 IEEE
International Conference on. IEEE, 2016, pp. 1867–1874.
This paper presents an overview of UBA architecture and [18] A. Sapegin, A. Amirkhanyan, M. Gawron, F. Cheng, and C. Meinel,
platform for detecting anomalous user behaviors within enter- “Poisson-based anomaly detection for identifying malicious user be-
haviour,” in International Conference on Mobile, Secure and Pro-
prise. The platform, composed of four components working grammable Networking. Springer, 2015, pp. 134–150.
independently, is suitable for running on distributed platforms. [19] W. Ma, K. Sartipi, and D. Bender, “Knowledge-driven user behavior pat-
The anomaly detection component contains an ensemble of tern discovery for system security enhancement,” International Journal
of Software Engineering and Knowledge Engineering, vol. 26, no. 03,
OCSVM, RNN and Isolation Forest. Strictly filtering strategy pp. 379–404, 2016.
is applied and can improve the performance and robustness no [20] Y. Li, T. Zhang, Y. Y. Ma, and C. Zhou, “Anomaly detection of user
matter whether there exist anomalies in the training set. behavior for database security audit based on ocsvm,” in Information
Science and Control Engineering (ICISCE), 2016 3rd International
The sequences of events contain valuable information about Conference on. IEEE, 2016, pp. 214–219.
users and we will focus on anomaly detection of sequence [21] elasticsearch.io. https://www.elastic.co/products/elasticsearch (accessed
data. Besides, the peer group analysis, which may play an December, 2017).
[22] B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and
important role in practice, can be introduced into the UBA J. C. Platt, “Support vector method for novelty detection,” in Advances
platform in the future. in neural information processing systems, 2000, pp. 582–588.
[23] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Data Mining,
R EFERENCES 2008. ICDM’08. Eighth IEEE International Conference on. IEEE, 2008,
pp. 413–422.
[1] New haystax technology survey shows most organizations ill-prepared [24] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support
for insider threats. https://haystax.com/blog/2017/03/29/new-haystax- vector machines,” ACM Transactions on Intelligent Systems and
technology-survey-shows-most-organizations-ill-prepared-for-insider- Technology, vol. 2, pp. 27:1–27:27, 2011, software available at
threats/ (accessed December, 2017). http://www.csie.ntu.edu.tw/ cjlin/libsvm.
[2] A. P. Moore, K. A. Kennedy, and T. J. Dover, “Introduction to the
special issue on insider threat modeling and simulation,” Computational
and Mathematical Organization Theory, vol. 22, no. 3, pp. 261–272,
2016.
[3] D. M. Cappelli, A. P. Moore, and R. F. Trzeciak, The CERT guide
to insider threats: how to prevent, detect, and respond to information
technology crimes (Theft, Sabotage, Fraud). Addison-Wesley, 2012.
[4] U. A. Office. Fannie mae corporate intruder sentenced to over
three years in prison for attempting to wipe out fannie mae
financial data. https://archives.fbi.gov/archives/baltimore/press-
releases/2010/ba121710.htm/ (accessed December, 2017).
[5] F. I. P. Bureau, “Unintentional insider threats: A foundational study,”
2013.
[6] M. Uma and G. Padmavathi, “A survey on various cyber attacks and
their classification.” IJ Network Security, vol. 15, no. 5, pp. 390–396,
2013.
[7] A. L. Buczak and E. Guven, “A survey of data mining and machine
learning methods for cyber security intrusion detection,” IEEE Commu-
nications Surveys & Tutorials, vol. 18, no. 2, pp. 1153–1176, 2016.
[8] A. Shabtai, Y. Elovici, and L. Rokach, A survey of data leakage detection
and prevention solutions. Springer Science & Business Media, 2012.