Introduction to Machine Learning Research on Time Series

Umaa Rebbapragada Tufts University Advisor: Carla Brodley 1/29/07

Machine Learning (ML)

Originally a subeld of AI Extraction of rules and patterns from data sets Focused on:
Computational complexity Memory

Machine Learning Tasks for Time Series

Classication Clustering Semi-supervised learning Anomaly Detection

Univariate time series Time series databases

Single Time Series

A single long time series can be converted into a set of smaller time series by sliding a window incrementally across the time series :

Window length is usually a user-specied parameter.

Challenges of Times Series Data

High dimensional Voluminous Requires fast technique

Brute Force Similarity Search

Given query time series Q, the best match by sequential scanning is found by:

O(nd) Finding the nearest neighbor for each time series in the database is prohibitive.

Similarity Search
Clustering and classication methods perform many similarity calculations Some require storage of the k nearest neighbors of each data instance Critical that these calculations be fast

Speeding up Similarity Search

Alternate time series representations Search databases faster New similarity metrics

Data Mining Time Series Toolbox

Indexing Dimensionality Reduction Segmentation Discretization Similarity metric

Faster than a sequential scan Insertions and deletions do not require rebuilding the entire index Partition the data into regions Search regions that contain a likely match Requires a similarity metric that obeys triangle inequality

R-trees kd-trees linear quad-trees grid-les

Indexing on Times Series Data

High dimensionality slows down speed of computation Curse of dimensionality inhibits efciency of of indexing

Dimensionality Reduction
Reduces the size of the time series Distance on transformed data should lower bound the original distance

This guarantees no false dismissals (false negatives)

Dimensionality Reduction: DFT, DWT, SVD

Represent time series using subsets of
Fourier coefcients Wavelet coefcients eigenvalue/vectors

Euclidean-distance is lower-bounded on DFT1, DWT2, SVD3

Gemini Framework
Faloutsos et al., 1994 Map each time series to a lower dimension Store in multi-dimensional indexing structure

Piecewise Aggregate Approximation (PAA)

Represent the time series in smaller, less complex segments.
Piecewise Linear Approximation (PLA) Minimum Bounding Rectangles (MBR)

Piecewise Linear Approximation (PLA)

Minimum-Bounding Rectangles (MBR)

Transforms a real-valued time series into a sequence of characters from a discrete alphabet Dimensionality reduction implicit Allows use of string functions on time series


Is Euclidean Distance Best Metric?

Everything discussed so far used ED as similarity metric Is it the best similarity metric for time series?

Drawbacks of Euclidean Distance

Requires two time series to have same dimensionality 1-to-1 alignment of the time axis

Cross Correlation
Cross correlation with convolution can nd optimal phase shift to maximize similarity

Cross Correlation
Optimal phase shift (to left) of solid line is 0.3

Dynamic Time Warping (DTW)

DTW allows many-to-one alignment Time series need not be same size


Time Axis

DTW Algorithm

DTW Algorithm

Drawbacks of DTW
Computationally expensive Does not adhere to triangle inequality => cannot use it for indexing

Making DTW Faster

Global constraints:

Sakoe-Chiba Band

Itakura Parallelogram

Making DTW Faster

Other Areas of Research

Anomaly Detection Change Point Detection

Thesis Research
Anomaly detection methods
fast preserve interesting features

Thank You

