ID 429 Anodot Ultimate Guide To Building A Machine Learning Outlier Detection System Part II
ID 429 Anodot Ultimate Guide To Building A Machine Learning Outlier Detection System Part II
ID 429 Anodot Ultimate Guide To Building A Machine Learning Outlier Detection System Part II
TO BUILDING A
MACHINE
LEARNING
OUTLIER
DETECTION
SYSTEM
Part II:
Learning Normal
Time Series Behavior
INTRODUCTION
Outlier detection helps companies determine
when something changes in their normal business
patterns. When done well, it can give a company
the insight it needs to investigate the root cause
Outlier detection is an
of the change, make decisions, and take actions
that can save money (or prevent losing it) and imperative for online
potentially create new business opportunities.
businesses today,
High-velocity online businesses need real-time outlier however building an
detection; waiting for days or weeks after the outlier
occurs is simply too late to have a material impact
effective system in-house
on a fast-paced business. This puts constraints on is a complex task. It is a
the system to learn to identify outliers quickly, even
if there are a million or more relevant metrics and
particular challenge to
the underlying data patterns are complicated .
Logo Mark
Word Mark
2 Ultimate guide to building a machine learning outlier detection system. Part II.
The techniques described within this paper are
well grounded in data science principles and
have been adapted or utilized extensively by the
High velocity online
mathematicians and data scientists at Anodot.
The veracity of these techniques has been proven businesses need real-
in practice across hundreds of millions of metrics
time outlier detection;
from Anodot’s large customer base. A company
that wants to create its own automated outlier waiting for days or weeks
detection system would encounter challenges
like those described within this document.
after the outlier occurs is
simply too late to have a
material impact on a fast-
paced business.
3 Ultimate guide to building a machine learning outlier detection system. Part II.
A GENERAL
FRAMEWORK
FOR LEARNING
NORMAL
BEHAVIOR
The general process of any outlier detection method
is to take data, learn what is normal, and then apply a
statistical test to determine whether any data point for
the same time series in the future is normal or abnormal.
GENERAL SCHEME
Model the normal behavior Devise a statistical test to Apply the test for each
of the metric(s) using a determine if samples are sample. Flag as anomaly if it
statistical model. explainded by the model. does not pass the test.
4 Ultimate guide to building a machine learning outlier detection system. Part II.
The graph below is a normal distribution represented by There are many different distributions that can be
an average standard deviation. Given a large number of assumed on data; however, given a very large dataset,
data points, 99.7% of the data points submitted should there are most likely many different types of behavior in
fall within the average, plus or minus three times the the data. This very fact has been thoroughly researched
standard deviation. This model is illustrated with the for hundreds of years, and even more so in the last 50
formula in Figure 2. years as data science has become important in the
computing world. But the question is, given a huge
Making this assumption means that if the data comes amount of literature, techniques and models to choose
from a known distribution, then 99.7% of the data points from, how can someone choose only one model?
should fall within these bounds. If a data point is outside
these bounds, it can be called an outlier because the The answer is that it is not possible to choose just one.
probability of it happening normally is very small.
At Anodot, we look at a vast number of time series data
This is a very simple model to use and to estimate. It and see a wide variety of data behaviors, many kinds of
is well known and taught in basic statistics classes, patterns, and diverse distributions that are inherent to
requiring only computation of the average and the
standard deviation. However, assuming any type of data all possible metrics. There has to be some way to classify
will behave like the normal distribution is naïve; most each signal to decide which should be modeled with a
data does not behave this way. This model is, therefore, normal distribution, and which should be modeled with
simple to apply, but usually much less accurate than a different type of distribution and technique.
other models.
Logo Mark
Word Mark
5 Ultimate guide to building a machine learning outlier detection system. Part II.
Choosing just one model does not work, and we
have seen it even within a single company when they
measure many different metrics. Each metric behaves
differently. In Part I of this document series, we used
the example of a person’s vital signs operating as
a complete system. Continuing with that example,
the technique for modeling the normal behavior of
a person’s heart rate may be very different from that
which models his or her temperature reading.
6 Ultimate guide to building a machine learning outlier detection system. Part II.
A SINGLE MODEL
DOES NOT FIT ALL
METRICS
In the Anodot system, every dataset that comes in goes are applied on metrics that are not smooth, the result
will either be a lot of false-positives or there will be
many outliers that are not detected (i.e. false-negatives),
SMOOTH IRREGULAR
(STATIONARY) SAMPLING
DISCRETE “STEP”
Figure 4.
7 Ultimate guide to building a machine learning outlier detection system. Part II.
Knowing what a data pattern looks like in order to apply Let us consider how this affects the company building
an appropriate model is a very complex task. its own detection system. The company’s data scientist
will spend several weeks classifying the data for the
If a company has 10 metrics, it is possible to graph the company’s 1,000 metric measurements and make
data points with a statistician. With only 10 metrics, a determination for a metric model. It could be that
this is feasible to do manually; however, with many a week from now, what the data scientist did in
thousands or millions of metrics, there is no practical classifying the model is irrelevant for some of them—
way to do this manually. The company would have to but it may not be clear for which ones.
design an algorithm that would determine the proper
data model to use for each metric. What is needed, then, is an automated process that
constantly looks at the changing nature of data signals
There is another aspect we have observed quite often and decides what the right model is for the moment. It
with the data we see from our customers: the model is not static.
that is right today, may not be right tomorrow. In
Figure 5, we see how a metric’s behavior can change
overnight.
Logo Mark
Word Mark
8 Ultimate guide to building a machine learning outlier detection system. Part II.
THE IMPORTANCE
OF MODELING
SEASONALITY
Other important aspects that should be included in
the algorithms and the model is whether the data has
seasonal patterns and what the seasonal periods are. A
9 Ultimate guide to building a machine learning outlier detection system. Part II.
Often we see, not just a single seasonal pattern, but These two patterns are intertwined in a complicated
multiple seasonal patterns and even different types way. There is almost a sine wave for the weekly pattern,
of multiple seasonal patterns, like the two examples and another faster wave for the daily pattern. In signal
shown in Figure 7. processing, this is called amplitude modulation, and it
is normal for this metric. If we do not account for the
Figure 7 shows an example of a real metric with two fact that these patterns co-exist, then we do not know
seasonal patterns working together at the same time. what normal is. If we know how to detect it and take it
In this case, they are weekly and daily seasonal patterns.
The image shows that Fridays and weekends tend to ones shown in orange in Figure 7 above. The values in
be lower, while the other days of the week are higher. orange indicate a drop in activity which may be normal
There is a pattern that repeats itself week after week, so on a weekend but not on a weekday. If we do not know
this is the weekly seasonal pattern. There is also a daily to distinguish between these patterns, we will not
seasonal pattern that illustrates the daytime hours and understand the outlier, so we either miss it or we create
nighttime hours; the pattern tends to be higher during false-positives.
the day and lower during the night.
Logo Mark
Word Mark
10 Ultimate guide to building a machine learning outlier detection system. Part II.
Figure 8 below shows an example of another type of
multiple seasonal patterns—one with additive signals.
Logo Mark
Word Mark
11 Ultimate guide to building a machine learning outlier detection system. Part II.
CAN A SEASONAL
PATTERN BE Two problems with
assuming a seasonal
ASSUMED? pattern:
At Anodot, we have observed millions of metrics and
• May require too many
built algorithms that detect the seasonal patterns
– if any – that exist in them. Some of them – in fact, data points to obtain a
most of them – do not have a seasonal pattern. Out
reasonable baseline
of millions of metrics that Anodot has seen, about
14% of them have a season to them, meaning 86% of
• May produce a poor
the metrics have no season at all. Out of the metrics
with a seasonal pattern, we have observed that normal model
70% had a 24-hour pattern to them, and 26% had
weekly patterns. The remainder of the metrics with
a seasonal pattern had other types of patterns—four
hours, six hours, and so on.
Logo Mark
Word Mark
12 Ultimate guide to building a machine learning outlier detection system. Part II.
Second, if the wrong seasonal pattern is assumed,
the resulting normal model may be completely off.
For example, if the data point is assumed to be a daily
seasonal pattern, but it is actually a 7-hour pattern,
then comparing 8 AM one day to 8 AM another day
is not relevant. We would need to compare 8 AM one
7 HOUR
24 HOUR
Logo Mark
Figure 9. Comparing a 7-hour seasonal pattern with an assumed 24-hour seasonal pattern.
Word Mark
13 Ultimate guide to building a machine learning outlier detection system. Part II.
EXAMPLE
METHODS
TO DETECT
SEASONALITY
Now that we have established the importance of Another technique is autocorrelation of signals, also
determining if seasonality is present in the data, we will known as serial correlation or autocorrelogram (ACF),
the correlation of a signal with itself at different
points in time. Informally, it is the similarity between
One method uses Fourier transform of signals, observations as a function of the time lag between
a technique in mathematics that takes a signal,
patterns. Compared to the Fourier transform method, it
frequencies that are local maximums (peaks) in the is more accurate and less sensitive to missing data, but
power of the Fourier transform. Those peaks tend to it is computationally expensive.
occur where there are seasonal patterns. This technique
Anodot developed a proprietary algorithm which we
there are multiple seasonal patterns. Additionally, it is call Vivaldi (patent pending). At a high level, Vivaldi
implements detection using the ACF method, but
weekly, monthly or yearly, and this technique is very overcomes its shortcomings by applying a smart
sensitive to any missing data. Also, issues like aliasing subsampling technique, computing only a small subset
in the Fourier transform can cause multiple peaks to
be present, some of which are not the actual seasonal complexity. In addition, to accurately identify multiple
frequency, but rather artifacts of the Fourier transform seasonal patterns, the method is applied on multiple
computation.
been proven to be accurate both theoretically and
empirically, while very fast to compute.
Logo Mark
Word Mark
14 Ultimate guide to building a machine learning outlier detection system. Part II.
DETECTION AT
SCALE REQUIRES
1
ADAPTIVE
LEARNING
ALGORITHMS
Companies that want immediate insight to changes in
15 Ultimate guide to building a machine learning outlier detection system. Part II.
We can contrast an online learning model to a model There are various examples of online adaptive learning
that uses data in batch mode. For example, a video models that learn the normal behavior of time series
surveillance system that needs to recognize human data that can be found in data science, statistics and
images will learn to recognize faces by starting with a signal processing literature. Among them are Simple
dataset of a million pictures that includes faces and non- Moving Average, Double/Triple Exponential (Holt-
faces. It learns what a face is and what a non-face is in Winters) and Kalman Filters + ARIMA and variations.
batch mode before it starts receiving any real data points.
The following is an example of how a simple moving
In the online learning paradigm, the machine never average is calculated and how it is applied to outlier
iterates over the data. It gets a single data point, learns detection. We want to compute the average over a
what it can from it, and then throws it away. It gets time series, but we do not want the average from the
another data point, learns what it can from it, throws beginning of time until present. Instead, we want the
it away, and so on. The machine never goes back to average during a window of time because we know
previously used data to relearn things; this is similar to we need to be adaptive and things could change over
how our brains learn. When we encounter something, time. In this case, we have a moving average with a
we learn what we can from it and move on, rather than window size of seven days, and we measure the metric
storing it for later use. every day. For example, we look at the stock price at the
end of every trading day. The simple moving average
An online adaptive learning algorithm works by would compute the average of the stock price over the
initializing a model of what is normal. It takes a new last seven days. Then we compare tomorrow’s value to
data point in the next second, minute, hour or whatever
timeframe is appropriate. First, the machine tests if the
current data point is an outlier or not, based on what it outlier and if not, then it is not an outlier. Using a simple
already knows. If it marks the data point as not being moving average is a straightforward way of considering
an outlier, then it updates the current model about whether we have an outlier or not.
what is normal based on that data point. And then it
repeats the process as individual new data points come The other models listed above are (much) more
in sequentially. complex versions of that but, if one can understand a
simple moving average, then the other models can be
The machine never goes back to previously viewed understood as well.
data points to put them into a current context. The
machine cannot say, “Based on what I see now, I know
16 Ultimate guide to building a machine learning outlier detection system. Part II.
THE IMPACT
OF LEARNING
PITFALLS
All of these adaptive online algorithms have some If our learning rate is too slow, meaning our moving
notion of learning rate. In the stock price example, we average window is very large, then we would adapt very
looked at the average value over the last seven days slowly to any changes in that stock price. If there are
of the stock price and then compared the next day to big changes in the stock price, then the baseline – the
that value. In this example, the seven-day period is a
parameter known as the “learning rate.” Why not 30 be very large, and we will be very insensitive to changes.
days? Why not 180 days? The shorter we make the
learning rate, the more of an effect each daily data point If we make the rate too fast – i.e., the window is very
has on the moving average. If we make it a moving small – then we will adapt too quickly and we might
average of the last three days, it will learn any changes miss things. We might think that outliers are not
that happen faster. If we make it 365 days, then it will outliers because we are adapting to them too quickly.
learn very slowly because every day will have a very
small effect on that average. These scenarios are depicted in Figure 10 below.
TOO SLOW
TOO FAST
Word Mark
17 Ultimate guide to building a machine learning outlier detection system. Part II.
How do we know what the learning rate should be? Continuing the example of learning the stock price
If we have a small number of time series – 100 or model using the moving average method, if we include
fewer – we could inspect and change the parameters an anomalous data point in the learning process, the
as needed. However, a manual method will not work stock price now becomes anomalous as well. If we use it
when we have a large number of time series, so the the next day to compute the next moving average, then
algorithms need to automatically tune themselves. we completely shift the average toward that outlier. Is
that okay or not okay? Good or bad? What happens in
There are many different metrics and each one has reality is, if we allow it to shift the average or shift the
its own behavior. The rate at which these metrics parameters of the model as usual, then if the outlier
change could be fast or slow depending on what persists beyond that single data point, we will start
they are; there is no one set of parameters that shifting the normal behavior towards that outlier. If the
outlier lasts for a while, then at some point we will say
necessary to provide an accurate baseline for millions this is the new normal, and we might even miss other
of metrics. This is something that is often overlooked outliers that come along later. Or, whenever it goes
by companies building outlier detection systems back to normal, we will say that is an outlier as well.
(incidentally, auto-tuning is built into the Anodot
system). Auto-tuning is not an easy task, but it is an Updating the model with every data point (including
important one for achieving more accurate results. abnormal ones), is one strategy, but it is not a very
good one.
There is another pitfall to be aware of. If we have a
Logo Mark
Word Mark
18 Ultimate guide to building a machine learning outlier detection system. Part II.
ADAPTING THE
LEARNING RATE
A better strategy is to adapt the learning rate by
assigning weight to the validity of the data points, with
an outlier carrying a lower weight than a normal value.
This is a tactic that Anodot uses. Whenever Anodot
sees that a data point is an outlier, the system assigns
that value a very low weight when adapting the model
parameters.
19 Ultimate guide to building a machine learning outlier detection system. Part II.
Another example would be a merger. One company previous normal state, measurements will be off. There
acquires another company, the stock price goes up
or down and it may stay at that new value for a long adaptive we are.
time. The valuation of the company has changed quite
suddenly, and the system eventually needs to adapt to In the Anodot system, when we see outliers, we adapt
that new valuation. the learning rate in the model by giving the anomalous
data points a lower weight. If the outlier persists for a
In the online world, these types of changes happen a long enough time, we begin to apply higher and higher
lot. For example, a company has a Web application and weights until the anomalous data points have a normal
after a large marketing campaign, the number of users weight like any other data point, and then we model to
quickly increases 25 percent. If the campaign was good, that new state. If it goes back to normal, then nothing
the number of users may stay elevated for the long happens; it just goes back to the previous state and
term. When a SaaS company adds a new customer, its everything is okay.
application metrics will jump, and that is normal. They
might want to know about that outlier in the beginning, These two approaches to updating the learning rate are
but then they will want the outlier detection system to shown below. In Figure 11, the model is updated without
learn the new normal. weighting the outliers. In this instance, most of the
outlier is actually missed by the model being created.
These kinds of events happen frequently; we must not
ignore them by not allowing those data points to affect In Figure 12, outliers are weighted differently to
anything from now until eternity. On the other hand, we minimize their impact on normal, unless it becomes
do not want the system to learn too quickly, otherwise apparent that the outliers are the new normal. This
all outliers will be very short, and if it goes back to the method allows for the outlier to be fully captured.
Logo Mark
Word Mark
20 Ultimate guide to building a machine learning outlier detection system. Part II.
OTHER
METHODS
FOR LEARNING
NORMAL
BEHAVIORAL
PATTERNS
This paper covers online adaptive learning methods
Part I of this
document series. These are the methods that Anodot
has selected for its solution; however, there are other
methods for learning normal behaviors in data patterns.
We summarize them in the table below according to
Part I.
GMM No No No Both
DBScan No No No Multivariate
K-Means No No No Multivariate
21 Ultimate guide to building a machine learning outlier detection system. Part II.
SUMMARY
This document outlines a general framework for
learning normal behavior in a time series of data.
This is important because any outlier detection needs
a model of normal behavior to determine whether a
new data point is normal or abnormal.