Method and System For Detecting Anomalous User Behaviors

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Method and System for Detecting Anomalous User

Behaviors: An Ensemble Approach


Xiangyu Xi∗† , Tong Zhang∗† , Guoliang Zhao§ , Dongdong Du†‡ , Qing Gao‡ , Wen Zhao † and Shikun Zhang†
∗ School of Software and Microelectronics, Peking University
† National Engineering Research Center for Software Engineering, Peking University
‡ School of Electronics Engineering and Computer Science, Peking University
§ CASIC-CQC Software Testing and Assessment Technology(Beijing) Corporation

email:{xixy,zhangtong17,dudong,gaoqing,zhaowen,zhangsk}@pku.edu.cn

Abstract—Malicious user behavior that does not trigger access A key problem discussed frequently is to detect compromised
violation or data leak alert is difficult to detect. Using the stolen user accounts and insiders within the company, which does
login credentials, the intruder doing espionage will first try not induce enormous data flow and/or any access violation
to stay undetected, silently collect data that he is authorized
to access from the company network. This paper presents an [6]. For example, an attacker may steal user credentials using
overview of User Behavior Analytics Platform built to collect logs, social engineering and access sensitive information or copy
extract features and detect anomalous users which may contain it to untrusted storage. In this scenario, security systems
potential insider threats. Besides, a multi-algorithms ensemble, such as firewalls, IDS [7] or Security Information and Event
combining OCSVM, RNN and Isolation Forest, is introduced. Management(SIEM), and Data Leak Prevention System(DLP)
The experiment showed that the system with an ensemble of
unsupervised anomaly detection algorithms can detect abnormal [8] cannot detect effectively. Relying on analysts to investigate
user behavior patterns. The experiment results indicate that attacks is costly and time-consuming, as they have to deal with
OCSVM and RNN suffer from anomalies in the training set, millions of logs.
and iF orest gives more false positives and false negatives, while User Behavior Analytics, which has been used in online
the ensemble of three algorithms has great performance and social media analysis [9] and improving web search ranking
achieves recall 96.55% and accuracy 91.24% on average.
Index Terms—anomaly detection, insider threat, user behavior, [10] , is emerging in security area. User behavior analytics
unsupervised learning, ensemble is a cyber security process about detection of insider threats,
targeted attacks, and financial fraud. They look at patterns
I. I NTRODUCTION of human behavior, and then apply algorithms and statistical
Insider threat has emerged in enterprise security and re- analysis to detect meaningful anomalies from those patterns
ceived increasing attention over last several years. A survey [11]. UBA collects various types of data such as organization
[1] by Haystack shows 56% of respondents feel that insider structure, user roles and job responsibilities, user activity trace
attacks have become more frequent. Privileged IT users such and geographical location. The analysis algorithms consider
as administrators with access to sensitive information, pose the factors including contextual information, continuous activities,
biggest insider threat. IT assets such as databases, file servers duration of sessions, and peer group activity to compare
and mobile devices are top assets at risk. anomaly behavior. UBA determines the baseline of normal
Insider threat is defined as any activity by military, gov- behavior of individual user or peer group according to history
ernment, or private company employees whose actions or data. The deviation of ongoing user activities compared with
inactions, by intent or negligence, result (or could result) in the past normal behavior is significant if the user acts abnormally
loss of critical information or valued assets [2]. Two types of [12].
insider threats are distinguished: malicious insider threats and This paper introduces an User Behavior Analytics Platform
unintentional insider threats [3]. The first threat is a current built to detect potential insider threats. Specifically, the plat-
or former employee, contractor, or business partner who has form can 1) collect and preprocess logs from systems and
or had authorized access to an organizations network, system, applications; 2) extract each user’s activity records from logs;
or data and intentionally exceeded or misused that access in 3) aggregate activity records and generate feature vector for
a manner that negatively affected the confidentiality, integrity, each user; 4) detect anomalous user access. Besides, an ensem-
or availability of the organizations information or information ble by multiple unsupervised anomaly detection algorithms
systems. The attempted attack by a Fannie Mae employee after is proposed and shows great performance in detecting user’s
being dismissed is a typical example of an insider threat likely anomalous access and operation within enterprise.
motivated by revenge [4]. The second form is from insiders This paper is organized as follows. Section 2 introduces
without malicious intent [5] such as human mistakes, errors. related work. User behavior analytics architecture and platform
which contains four components is presented in Section 3.
This work was partially supported by National Key Research and Devel-
opment Program of China (No. 2017YFB0802900). Section 4 introduces experiment scenario, data characteristics
DOI reference number: 10.18293/SEKE2018-036 and feature selection. In Section 5, anomaly detection algo-
rithms for user behavior analytics are demonstrated. Section 6
gives dataset, experiment and results. Besides, discussion and Anomaly Detection Results
comparison of algorithms are demonstrated. Finally, section 7
concludes the paper and provides future work. Anomaly Detection Component
II. R ELATED W ORK Replicator Neural
One Class SVM Isolation Forest
Network
Anomaly detection is an important problem that has been
researched within diverse research areas and application do-
mains including information security [13]. Research of ap-
plying anomaly detection is popular in intrusion detection Feature Vectors
[7], fraud detection [14] [15], medical and public health
anomaly detection and industrial damage detection. Many
anomaly detection techniques have been specifically developed Feature Extraction Component
for certain application domains, while others are more generic.
Anomaly detection techniques being applied to user behav-
ior analytics is increasingly popular. Veeramachaneni K et
Activity Records
al. [16] put forward AI 2 that combines analyst intelligence
with an ensemble of three outlier detection methods to detect
account takeover, new account fraud and service abuse. Activity Record Generation Component
Madhu Shashanka et al. [17] presented the User and Entity
Filter Normalize Enrich
Behavior Analytics(UEBA) module of the Niara Security
Analytics Platform which uses a SVD-based algorithm to
detect anomalies in user accessing server within an enterprise.
Both users historical baseline and peer baseline are applied Centralized Raw Log
with same algorithm.
Sapegin A et al. [18] proposed a poisson-based two-step
algorithm to identify anomaly user access to workstation Real-Time Data Collection Component
within Windows domain. However, the dataset is from sim-
ulation scenario and of limited features. The algorithms are
not persuasive enough and of limited extensibility. User Application
System Logs 3rd Party API
Directory Logs
Wei Ma et al. [19] defined a user behavior pattern and
proposed a knowledge-driven user pattern discovery approach
Fig. 1. Architecture of User Behavior Analytics Platform
which can extract users behavior patterns from audit logs from
distributed medical imaging systems. The work is emphasized
on extracting user behavior patterns and there is a long way A. Data Collection Component
to go before administrator directly use it.
Data collection component stores raw logs generated by
Li at al. [20] proposed a kind of security audit technology
systems and applications for further extraction and analysis.
based on one-class support vector machine detect the abnormal
The collected raw logs are stored in ElasticSearch [21] ,
behavior of database operations.
which is a distributed, JSON-based search and analytics engine
III. U SER B EHAVIOR A NALYTICS A RCHITECTURE AND designed for horizontal scalability, maximum reliability, and
P LATFORM easy management. Logs of users accessing ftp server within
In this section, an architecture of user behavior analytics enterprise and operations such as downloading, uploading,
is presented. Based on that, the implementation of our UBA deleting files or directories are collected. The user information
platform is described. can be gathered from activity directory of the enterprise.
Relying on analysts to investigate attacks is costly and Data source our platform can process includes:
time-consuming, as they have to deal with millions of logs 1) system logs,
and alerts. Our UBA platform collects logs about user-related 2) application logs such as web access logs and DLP logs,
events and user session activity in real-time or near real-time, 3) user directory logs, etc.
and compares each and every action to the corresponding
baseline of users to spot anomalies in their behavior. Based on B. Activity Record Generation Component
detection results, a risk label or score that reveals human risk Centralized raw logs are of respective unique formats from
will be assigned to every user, which is helpful and meaningful which feature cannot be extracted directly. For example,
for security analysts especially when analysts investigate or apache server logs and windows security logs consist of
monitor employees for suspicious behaviors or attacks. Fig.1 different items. Due to the lack of normalized formats, activity
provides the architecture composed of four components. Each record generation component
of them is described in the following. 1) builds a general schema for activity records,
2) generates regular expressions as filters for each type of TABLE I
log to extract useful information, L OGS AND CORRESPONDING EVENTS
3) fills the schema with extracted information. LOG EVENT
Then the activity records with user information are generated, CONNECT LOG CONNECT Event
LOGIN LOG Login Event
which will be processed in the following Feature Extraction DOWNLOAD LOG Download Event
Component. Upload Event
UPLOAD LOG Create File Event
C. Feature Extraction Component Remote Copy Event
Delete File Event
After generating normalized activity records with user DELETE LOG
Delete Directory Event
information attached, we compute user behavioral features MKDIR LOG Make Directory Event
over an interval of time such as 24 hours. For performance RMDIR LOG Remove Directory Event
Rename Event
consideration, the strategy from [16] is applied. Each hour RENAME LOG
Remotely Move Files Event
we retrieve activity records within last hour and compute the
features labeled with last hour. In midnight, we only need to
retrieve the 23 feature sets and activity records within last hour and all can be considered as normal behaviors so we simulate
rather than activity records within last 24 hours. several abnormal operations for each user as testing dataset.
Based on investigation in the enterprise, abnormal behaviors
D. Anomaly Detection Component mainly include four categories shown in Table II.
With features extracted for each user, anomaly detection
component detects anomalous users on a daily basis. The TABLE II
C ATEGORIES OF ABNORMAL BEHAVIOR
component is designed losely coupled, flexible and indepen-
dent from other components. The details of algorithms are anomaly ID Description
demonstrated in V. anomaly 1 multiple login attempts and failures
anomaly 2 anomalous downloads operations
IV. DATA C HARACTERISTICS anomaly 3 anomalous delete operations
anomaly 4 operations at non-working hours
In this section, a scenario within a typical software company
is introduced. The behavior of employees accessing ftp files
C. Feature Selection
and data within work groups are monitored and audit logs are
generated and collected by UBA platform presented before. With activity records generated by activity record generation
Then dataset and feature selection is introduced. component, feature extraction component produces a feature
vector for each user daily, which characterize the pattern of
A. Experiment Scenario users’ access to ftp server and operations. The features are
Consider a file server within an enterprise, authorized em- shown in Table III. The features for a user daily is denoted
ployees can access the server for files and data with different
authorization, which normally is configured by the adminis- TABLE III
T HE LIST OF F EATURES . 21 FEATURES ARE EXTRACTED AND USED IN
trator. For example, one can read, write, upload, download or TOTAL .
delete files or directories. As documents and data are important
information, accessing and operations need to be monitored for Feature ID Description
1 number of total connections of the day
possible actions from compromised accounts or rogue users. 2 timestamp of first login attempt of the day
UBA platform monitors the access patterns and operation 3 timestamp of last login attempt of the day
patterns of each user while accessing the server and files. 4 number of login success of the day
5 number of login fail of the day
B. Dataset 6 total download bytes
7 number of download success of the day
The ftp server logs are collected by the data collection 8 number of download fail of the day
component presented before. In total 8 kinds of logs are 9 largest download bytes of the day
10 total upload bytes
collected and each corresponds to 1 or more types of events. 11 number of upload success of the day
For example, Download Log only represents Download Event 12 number of upload fail of the day
and the log carries information including timestamp, user 13 largest upload bytes of the day
name, SUCCESS/FAIL flag and client IP. An UPLOAD LOG 14 number of delete success of the day
15 number of delete fail of the day
may represent an upload file/directory event, create file event 16 number of mkdir success of the day
or remote copy event and cannot be distinguished by content 17 number of mkdir fail of the day
as different events share same format. Table I shows the 18 number of rmdir success of the day
19 number of rmdir fail of the day
representation map between logs and events. 20 number of rename success of the day
We collected operation logs within a software company for 21 number of rename fail of the day
3 months and selected four employees with explicit different
behavior patterns. Activity records generated were checked by 21-dimention vector x = (x1 , x2 , ..., x21 ).
Input Output

𝑥1 𝑦1

𝑥2 𝑦2

𝑥3 𝑦3



𝑥n 𝑦n

Fig. 3. Replicator Neural Network with three hidden layers

B. Replicator Neural Network


Replicator Neural Network is an artificial feed-forward
O multi-layer neural network with an output layer having the
same number of nodes as the input layer. The purpose of
Fig. 2. diagram of OCSVM hyperplane Replicator Neural Network is to produce the output data which
as is similar as the input data. Fig.4 presents the structure of
the fully connected RNN with three hidden layers.
V. A LGORITHM Replicator Neural Network is effective in anomaly
UBA platform practically is fed with data without label, detection as an unsupervised machine learning algorithm
which motivates us to use unsupervised anomaly detection because anomalies are few and there exist some common
techniques. In practice, it’s unknown whether training set patterns in normal data. By the trained RNN, the common
contains abnormal data points and the proportion, different al- patterns representing bulk of the data can be well reproduced,
gorithms are of better performance under different conditions. while anomalies will have a much higher reconstruction
For example, when training set contains normal instances error. The reconstruction error for a d-dimensional instance
only, Replicator Neural Network and OCSVM work better, but x = (x1 , x2 , ..., xd ) is computed as follow:
Isolation Forest might suffer a small reduction. As a result, an
d
ensemble of three unsupervised anomaly detection algorithms X
e= (xi − yi )2 (4)
is used to improve the robustness and performance.
i=1

A. One Class SVM in which d is the dimension of input vector x and y =


(y1 , y2 , ..., yd ) is the reconstructed output.
OCSVM, proposed by Scholkopf [22], has been applied to
anomaly detection. As Fig.2 shows, the OCSVM algorithm C. Isolation Forest
maps input data into a high dimensional feature space via a
kernel and iteratively finds the maximal margin hyperplane Since anomalies are few and different and therefore they
which best separates the training data from the origin. are more susceptible to isolation. Based on the concept of
isolation, Isolation Forest [23] builds a set of iT rees for a
given data set, then anomalies are those instances which have
1 2 1
Pn
min 2 ||w|| + νn i=1 ζi −ρ short average path lengths on the iT rees. For example, in
w,ζi ,ρ
Fig.4, the red outlier (8.7, 9.2) is isolated at first split, while
s.t. (wT φ(xi )) > ρ − ζi , i = 1, ..., n (1) normal points marked as blue need more than 3 splits in the
ζi > 0, i = 1, ..., n isolation tree.
To be specific, for a given dataset, iTrees are constructed
The decision function presented is f (x) = sgn(wT φ(x) − ρ). by recursively partitioning the given training set until instances
After solving the dual problem below: are isolated or a specific tree height is reached. There are only
1
P two variables in this method:
min 2 ij αi αj k(xi , xj )
α 1) the number of trees to build t, and
1
s.t. 0 ≤ αi ≤ νl0
, i = 1, ..., n (2) 2) the sub-sampling size ψ
Pn
i=1 αi = 1 Path length h(x) of a point x is measured by the number
of edges x traverses an iT ree from the root node until the
The decision function is given by :
traversal is terminated at an external node. The anomaly score
Xn s of an instance x is defined as:
f (x) = sgn( (αi K(xi , x) − ρ)). (3) E(h(x))

i=1
s(x, n) = 2− c(n) , (5)

0, s(x, n) > ε1
40 fiF orest (x; X) = (7)
1, s(x, n) ≤ ε1
35

0, err(x) > ε2
30 fRN N (x; X) = (8)
1, err(x) ≤ ε2
25
where “; X” indicates the model is trained with X as training
20 set. As recall is an important measure metric in security,
we apply strictly filtering strategy and regard data point as
15
abnormal as long as any of the three algorithms outputs 0,
10 shown by formula 9 and 10.
5

0 f (x; X) = fOCSV M (x; X) + fiF orest (x; X) + fRN N (x; X)


0 5 10 15 20 25 30 35 40
 (9)
0, f (x; X) < 3
s(x; X) = (10)
(a) Training Data 1, f (x; X) = 3

Given the kth user’s historical behavior Xk =


x>16.0 [x1k , x2k , ..., xik ..., xm
k ], in which i ∈ {1, 2, ..., m} denotes
the index of days and xik = (xi,1 i,2 i,21 T
k , xk ..., xk ) is the feature
Y N
vector of kth user in ith day.
y>26.3 (8.7,9.2) For kth user with feature vector to be detected and denoted
Y Y
by x̂k = (x̂1k , x̂2k ..., x̂21 T
k ) , the prediction is given by:

x>29.8 x>30.1 scorek = s(x̂k , Xk ) (11)


Y N Y N VI. E XPERIMENT AND R ESULTS
(32.1,29.8) y>35.0 (31.4,18.8) y>22.6 A. Experiment Setup
Y N Y N
Data preserved on ftp server is not enough so we perform
a simulation after fitting the data collected with a polynomial
(28.5,37.1) x>24.5 (29.3,24.1) (25.3,21.1)
distribution. Besides, a small proportion 2.06% and 4.03% of
Y N anomalous user behaviors is mixed into training set to find out
performance when training sets mixed with different purities.
(26.2,28.9) (22.7,33.2)
Hence three training sets are used as Table IV shows and the
data sets are composed of five categories.
(b) Generated Isolation Tree In OCSVM, we exploit LIBSVM [24] and RBF kernel
K(xi , xj ) = exp(−γ||xi − xj ||2 ), γ > 0, where γ = 0.01
Fig. 4. Isolation Tree Generated with data set and ν = 0.05 is selected as parameters.
In iT ree,number of trees t = 200 and the sub-sampling
size ψ = 200. The threshold ε1 = 0.45.
where E(h(x)) is the average of h(x) from the trained A RNN with 3 hidden layers is applied and the number of
collection of isolation trees. neurons in each layer is [20, 8, 4, 8, 20]. Activation function
z −z
tanh(z) = eez −e +e−z is selected. The threshold selected is
ε2 = 1.20. Based on stochastic gradient descent, the training
D. Ensemble and Strictly Filtering
contains 60,000 train epochs and batch-size is set 20.
We combine the predictions of three algorithms introduced
before and apply the strictly filtering strategy to predict B. Results and Discussion
whether a user is anomalous or not. OCSVM directly produces The detection rate of single algorithm in different category
label y ∈ {0, 1}. Output of RNN is reconstruction error of testing data is shown in Table V. Assuming abnormal
err ∈ IR while iF orest generates anomaly score s(x, n) = data points is our focus and marked as positive, the overall
E(h(x))
2− c(n) ∈ (0, 1). Labels are given by comparing output with accuracy, precision and recall is shown in Table VI.
corresponding threshold. Output 0 represents abnormal while With anomaly-free training set, OCSVM has best perfor-
1 represents normal. mance with recall=100% and accuracy=96.72%. All of the
anomalies can be detected. With more anomaly points in the
training set, performance of OCSVM gets worse. With 4.03%

0,
fOCSV M (x; X) = (6) anomaly points, the recall is 75.86% and accuracy is 81.39%.
1
TABLE IV
C OMPOSITION OF T RAINING SETS AND T ESTING SET. T RAINING SETS WITH 0%, 2.06% AND 4.03% OF ANOMALIES MIXED INTO ARE USED .

anomalies
dataset normal anomaly 1 anomaly 2 anomaly 3 anomaly 4 total
proportion
training set 1 1000 0 0 0 0 1000 0.00%
training set 2 1000 6 6 3 6 1021 2.06%
training set 3 1000 12 12 6 12 1042 4.03%
testing set 100 45 47 32 50 274 63.50%

TABLE V
D ETECTION R ATE OF OCSVM, RNN AND iF orest WITH DIFFERENT TRAINING SETS

training set 1(0.00%) training set 2(2.06%) training set 3(4.03%)


category
OCSVM RNN iF orest OCSVM RNN iF orest OCSVM RNN iF orest
normal 91.00% 92.00% 51.00% 91.00% 88.00 % 87.00% 91.00% 92.00% 87.00%
anomaly 1 100.00% 100.00% 100.00% 93.33% 100.00% 91.11% 82.22% 62.22% 77.78%
anomaly 2 100.00% 100.00% 100.00% 79.59% 97.87% 68.09% 89.36% 55.32% 65.96%
anomaly 3 100.00% 100.00% 100.00% 100.00% 100.00% 68.75% 100.00% 81.25% 100.00%
anomaly 4 100.00% 92.00% 78.00% 54.00% 90.00% 38.00% 42.00% 92.00% 36.00%

TABLE VI
ACCURACY, P RECISION AND R ECALL OF OCSVM, RNN AND iF orest WITH DIFFERENT TRAINING SETS

training set 1(0.00%) training set 2(2.06%) training set 3(4.03%)


category
accuracy precision recall accuracy precision recall accuracy precision recall
OCSVM 96.72% 95.08% 100.00% 83.58% 93.88% 79.31% 81.39% 94.29% 75.86%
RNN 95.62% 95.51% 97.70% 93.43% 93.33% 96.55% 79.56% 94.03% 72.41%
iF orest 78.10% 76.89% 93.68% 73.58% 89.76% 65.52% 74.09% 89.92% 66.67%
ensemble 91.60% 88.32% 100.00% 90.88% 89.84% 96.55% 91.24% 93.10% 93.10%

RNN has the similar performance and trend to


OCSVM. loss
Pd Fig.5 shows 2
the mean reconstruction error
35
e = i=1 (x i − y i ) of 5 categories of test data during training set
RNN training with 2.06% anomalies in training set. The testing set - normal
30
training process converged and abnormal data has higher testing set - abnormal 1
testing set - abnormal 2
reconstruction error which makes separating possible. The testing set - abnormal 3
25
reconstruction error

mean reconstruction error of normal data is 0,539, much testing set - abnormal 4
lower than abnormal data in testing dataset (4.947, 3.887,
20
8.627, 2.409). However, the time cost of training is much
higher than other algorithms.
15
Isolation Forest has the worst performance of the three
algorithms as Table V and Table VI shows. With training 10
set 2, the anomaly score of test data from Isolation Forest is
presented in Fig.6. The normal data has a lower anomaly score 5
0.390 than anomaly data (0.490,0.487,0.475,0.442). However,
scores of some data are pretty close as Fig.6 shows, especially 0
the data of operations at non-working hour. In each category 0 1 2 3 4 5 6
of anomaly data we simulated, data is anomaly in only few training step ×10 4
dimensions. At training stage, attribute and split point is
randomly selected. Statistically it’s hard to split data with Fig. 5. Mean reconstruction error during training with Replicator Neural
anomalous attributes before the tree goes deep. However, if Network.
the training set has more complicated and real anomalies,
iF orest can be of better performance. Besides, the threshold When anomalies in training set increase, algorithms alone are
ε1 can be adjusted flexibly, which is a valuable characteristic. less reliable. With 2.06% anomalies in training set, the ensem-
In addition, iF orest didn’t suffer an obvious reduction with ble gives recall=96.55% and accuracy=90.88%. With 4.03%
more anomalies mixed in the training set. anomalies in the training set, RNN has recall=72.41% and
Table VI shows that the ensemble and strictly filtering accuracy=79.56%, while the ensemble has great performance
strategy improve robustness and performance especially recall. recall=93.10% and accuracy=91.24%. It can be a good and
0.3 0.3 0.3 [9] Y. Amichai-Hamburger and G. Vinitzky, “Social network use and
personality,” Computers in human behavior, vol. 26, no. 6, pp. 1289–
0.2 0.2 0.2 1295, 2010.
[10] E. Agichtein, E. Brill, and S. Dumais, “Improving web search ranking
0.1 0.1 0.1
by incorporating user behavior information,” in Proceedings of the
29th annual international ACM SIGIR conference on Research and
development in information retrieval. ACM, 2006, pp. 19–26.
0 0 0
0.3 0.4 0.5 0.6 0.3 0.4 0.5 0.6 0.3 0.4 0.5 0.6 [11] T. Bussa, A. Litan, and T. Phillips, “Market guide for user and entity be-
(a) testing set - normal (b) testing set - abnormal 1 (c) testing set - abnormal 2 havior analytics,” URL: https://www. gartner. com/doc/3538217/market-
0.3 0.3
guide-user-entity-behavior ( 29.07. 2017), 2016.
[12] W. Ma, “User behavior pattern based security provisioning for dis-
tributed systems,” Ph.D. dissertation, 2016.
0.2 0.2
[13] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,”
ACM computing surveys (CSUR), vol. 41, no. 3, p. 15, 2009.
0.1 0.1
[14] D. J. Weston, D. J. Hand, N. M. Adams, C. Whitrow, and P. Juszczak,
“Plastic card fraud detection using peer group analysis,” Advances in
0 0
0.3 0.4 0.5 0.6
Data Analysis and Classification, vol. 2, no. 1, pp. 45–62, 2008.
0.3 0.4 0.5 0.6
(d) testing set - abnormal 3 (e) testing set - abnormal 4 [15] M. Ahmed, A. N. Mahmood, and M. R. Islam, “A survey of anomaly
detection techniques in financial domain,” Future Generation Computer
Systems, vol. 55, pp. 278–288, 2016.
Fig. 6. Anomaly Score of Isolation Forest for test data of different categories [16] K. Veeramachaneni, I. Arnaldo, V. Korrapati, C. Bassias, and K. Li,
“Aiˆ 2: training a big data machine to defend,” in Big Data Security
on Cloud (BigDataSecurity), IEEE International Conference on High
optional strategy especially when security analysts focus on Performance and Smart Computing (HPSC), and IEEE International
Conference on Intelligent Data and Security (IDS), 2016 IEEE 2nd
recall. International Conference on. IEEE, 2016, pp. 49–54.
[17] M. Shashanka, M.-Y. Shen, and J. Wang, “User and entity behavior
VII. C ONCLUSION AND F UTURE W ORK analytics for enterprise security,” in Big Data (Big Data), 2016 IEEE
International Conference on. IEEE, 2016, pp. 1867–1874.
This paper presents an overview of UBA architecture and [18] A. Sapegin, A. Amirkhanyan, M. Gawron, F. Cheng, and C. Meinel,
platform for detecting anomalous user behaviors within enter- “Poisson-based anomaly detection for identifying malicious user be-
haviour,” in International Conference on Mobile, Secure and Pro-
prise. The platform, composed of four components working grammable Networking. Springer, 2015, pp. 134–150.
independently, is suitable for running on distributed platforms. [19] W. Ma, K. Sartipi, and D. Bender, “Knowledge-driven user behavior pat-
The anomaly detection component contains an ensemble of tern discovery for system security enhancement,” International Journal
of Software Engineering and Knowledge Engineering, vol. 26, no. 03,
OCSVM, RNN and Isolation Forest. Strictly filtering strategy pp. 379–404, 2016.
is applied and can improve the performance and robustness no [20] Y. Li, T. Zhang, Y. Y. Ma, and C. Zhou, “Anomaly detection of user
matter whether there exist anomalies in the training set. behavior for database security audit based on ocsvm,” in Information
Science and Control Engineering (ICISCE), 2016 3rd International
The sequences of events contain valuable information about Conference on. IEEE, 2016, pp. 214–219.
users and we will focus on anomaly detection of sequence [21] elasticsearch.io. https://www.elastic.co/products/elasticsearch (accessed
data. Besides, the peer group analysis, which may play an December, 2017).
[22] B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and
important role in practice, can be introduced into the UBA J. C. Platt, “Support vector method for novelty detection,” in Advances
platform in the future. in neural information processing systems, 2000, pp. 582–588.
[23] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Data Mining,
R EFERENCES 2008. ICDM’08. Eighth IEEE International Conference on. IEEE, 2008,
pp. 413–422.
[1] New haystax technology survey shows most organizations ill-prepared [24] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support
for insider threats. https://haystax.com/blog/2017/03/29/new-haystax- vector machines,” ACM Transactions on Intelligent Systems and
technology-survey-shows-most-organizations-ill-prepared-for-insider- Technology, vol. 2, pp. 27:1–27:27, 2011, software available at
threats/ (accessed December, 2017). http://www.csie.ntu.edu.tw/ cjlin/libsvm.
[2] A. P. Moore, K. A. Kennedy, and T. J. Dover, “Introduction to the
special issue on insider threat modeling and simulation,” Computational
and Mathematical Organization Theory, vol. 22, no. 3, pp. 261–272,
2016.
[3] D. M. Cappelli, A. P. Moore, and R. F. Trzeciak, The CERT guide
to insider threats: how to prevent, detect, and respond to information
technology crimes (Theft, Sabotage, Fraud). Addison-Wesley, 2012.
[4] U. A. Office. Fannie mae corporate intruder sentenced to over
three years in prison for attempting to wipe out fannie mae
financial data. https://archives.fbi.gov/archives/baltimore/press-
releases/2010/ba121710.htm/ (accessed December, 2017).
[5] F. I. P. Bureau, “Unintentional insider threats: A foundational study,”
2013.
[6] M. Uma and G. Padmavathi, “A survey on various cyber attacks and
their classification.” IJ Network Security, vol. 15, no. 5, pp. 390–396,
2013.
[7] A. L. Buczak and E. Guven, “A survey of data mining and machine
learning methods for cyber security intrusion detection,” IEEE Commu-
nications Surveys & Tutorials, vol. 18, no. 2, pp. 1153–1176, 2016.
[8] A. Shabtai, Y. Elovici, and L. Rokach, A survey of data leakage detection
and prevention solutions. Springer Science & Business Media, 2012.

You might also like