Using Deep Learning To Detect Price Change Indications in Financial Markets

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2017 25th European Signal Processing Conference (EUSIPCO)

Using Deep Learning to Detect Price Change


Indications in Financial Markets
Avraam Tsantekidis∗, Nikolaos Passalis∗, Anastasios Tefas∗,
Juho Kanniainen†, Moncef Gabbouj‡ and Alexandros Iosifidis‡§
∗ Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
{avraamt, passalis}@csd.auth.gr, [email protected]
† Laboratory of Industrial and Information Management, Tampere University of Technology, Tampere, Finland

[email protected]
‡ Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland

{moncef.gabbouj, alexandros.iosifidis}@tut.fi
§ Department of Engineering, Electrical and Computer Engineering, Aarhus University, Denmark

[email protected]

Abstract—Forecasting financial time-series has long been Recently there have been multiple solution to the
among the most challenging problems in financial market aforementioned limitations of handcrafted systems using
analysis. In order to recognize the correct circumstances to machine learning models. Given some input features
enter or exit the markets investors usually employ statistical
models (or even simple qualitative methods). However, the machine learning models can be used to predict the
inherently noisy and stochastic nature of markets severely behaviour of various aspects of financial markets [1], [2],
limits the forecasting accuracy of the used models. The [3], [4]. This has led several organizations, such as hedge
introduction of electronic trading and the availability of funds and investment firms, to create machine learning
large amounts of data allow for developing novel machine models alongside the conventional mathematical models
learning techniques that address some of the difficulties
faced by the aforementioned methods. In this work we for conducting their trading operation.
propose a deep learning methodology, based on recurrent With the introduction of electronic trading and the au-
neural networks, that can be used for predicting future tomation that followed has increased the trading volume
price movements from large-scale high-frequency time- thus producing a immense amount of data that represent-
series data on Limit Order Books. The proposed method ing the trades happening in exchanges. Exchanges have
is evaluated using a large-scale dataset of limit order book
events. been gathering this trading data, creating comprehensive
logs of every transaction, selling them to financial insti-
I. I NTRODUCTION tutions that analyze them to discover signals that provide
foresight for changes in the market, which can in turn be
Using mathematical models to gain an advantage in used by algorithms to make the profitably manage invest-
financial markets is the main consideration of the field of ments. However, applying machine learning techniques
quantitative analysis. The main hypothesis of the field is on such large-scale data is not a straightforward task.
that the utilization time-series of values like the price and Being able to utilize the information at this scale can
volume of financial products produced by the market can provide strategies for many different market conditions
be analyzed with mathematical and statistical models to but also safeguard from volatile market movements.
extract predictions about the current state of the market The main contribution of this work is the proposal of
and future changes in metrics, such as the price volatility a deep learning methodology, based on recurrent neural
and direction of movement. However, these mathemati- networks, that can be used for predicting future mid-
cal models rely on handcrafted features and have their price movements from large-scale high-frequency limit
parameters tuned manually by observation, which can order data.
reduce the accuracy of their predictions. Furthermore, In Section 2 related work on machine learning models
asset price movements in the financial markets very that were applied on financial data is briefly presented.
frequently exhibit irrational behaviour since they are Then, the used large-scale dataset is described in detail
largely influenced by human activity that mathematical in Section 3. In Section 4 the proposed deep learning
models fail to capture.

ISBN 978-0-9928626-7-1 © EURASIP 2017 2580


2017 25th European Signal Processing Conference (EUSIPCO)

methodology is introduced, while in Section 5 the ex- which is essential for effectively scaling to such large-
perimental evaluation is provided. Finally, conclusions scale data.
are drawn and future work is discussed in Section 6.
III. H IGH F REQUENCY L IMIT O RDER DATA
II. R ELATED W ORK
In financial equity markets a limit order is a type of
Recent Deep Learning methods has been shown to order to buy or sell a specific number of shares within
significantly improve upon previous machine learning a set price. For example, a sell limit order (ask) of $10
techniques in tasks such as speech recognition [5], with volume of 100 indicates that the seller wishes to sell
image captioning [6], [7], and question answering [8]. the 100 shares for no less that $10 each. Respectively,
Deep Learning models, such as Convolutional Neu- a buy limit order (bid) of $10 it means that the buyer
ral Networks (CNNs) [9], and Recurrent Neural Net- wishes to buy a specified amount of shares for no more
works (RNNs), e.g., the Long Short-Term Memory Units than $10 each.
(LSTMs) [10], have greatly contributed in the increase Consequently the orderbook has two sides, the bid
of performance on these fields, with ever deeper archi- side, containing buy orders with prices pb (t) and vol-
tectures producing even better results [11]. umes vb (t), and the ask side, containing sell orders with
In Deep Portfolio Theory [12], the authors use autoen- prices pa (t) and volumes va (t). The orders are sorted
(1)
coders to optimize the performance of a portfolio and on both sides based on the price. On the bid side pb (t)
beat several profit benchmarks, such as the biotechnol- is the is the highest available buy price and on the ask
(1)
ogy IBB Index. Similarly in [2], a Restricted Boltzmann side pa (t) is the lowest available sell price.
Machine (RBM) is used to encode monthly closing Whenever a bid order price exceeds an ask order price
(i) (j)
prices of stocks and then it is fine-tuned to predict the pb (t) > pa (t), they “annihilate”, executing the orders
direction the price of each stock will move (above or and exchanging the traded assets between the investors.
below the median change). This strategy is compared to If there are more than two orders that fulfill the price
a simple momentum strategy and it is established that range requirement the effect chains to them as well.
the proposed method achieves significant improvements Since the orders do not usually have the same requested
in annualized returns. volume, the order with the greater size remains in the
The daily data of the S&P 500 market fund prices and orderbook with the remaining unfulfilled volume.
Google domestic trends of 25 terms like “bankruptcy” Several tasks arise from this data ranging from the
and “insurance” are used as the input to a recurrent prediction of the price trend and the regression of the
neural network that it is trained to predict the volatility future value of a metric, e.g., volatility, to the detection
of the market fund’s price [3]. This method greatly im- of anomalous events that cause price jumps, either up-
proves upon existing benchmarks, such as autoregressive wards or downwards. These tasks can lead to interesting
GARCH and Lasso techniques. applications, such as protecting the investments when
An application using high frequency limit orderbook market condition are unreliable, or taking advantage of
(LOB) data is [4], where the authors create a set of such conditions to create automated trading techniques
handcrafted features, such as price differences, bid- for profit.
ask spreads, and price and volume derivatives. Then, Methods utilizing this data often use subsampling
a Support Vector Machine (SVM) is trained to predict techniques, such as the OHLC (Open-High-Low-Close)
whether the mid-price will move upwards or downward resampling [13], to limit the number of values exist
in the near future using these features. However, only for each timeframe, e.g., every minute or every day.
2000 data points are used for training the SVM in each Even though the OHLC method preserves the trend
training round, limiting the prediction accuracy of the features of the market movements, it removes all the
model. microstructure information of the markets. Note that it
To the best of our knowledge this is the first work that is difficult to preserve all the information contained in
uses a Limit Order Book data on such a large-scale with the LOB data, since orders arrive inconsistently and most
more than 4 million events to train LSTMs for predicting methods require a specific number of features for each
the price movement of stocks. The method proposed in time step. This is one of the problems RNNs can solve
this paper is also combined with an intelligent normaliza- and take full advantage of the information contained in
tion scheme that takes into account the differences in the the data, since they can natively handle this inconsistent
price scales between different stocks and time periods, amount of incoming orders.

ISBN 978-0-9928626-7-1 © EURASIP 2017 2581


2017 25th European Signal Processing Conference (EUSIPCO)

IV. LSTM S FOR F INANCIAL DATA Note that each consecutive depth sample is only
The input data consists of 10 orders for each side of slightly different from the previous one. Thus the short-
the LOB (bid and ask). Each order is described by 2 term changes between prices are very small and noisy.
values, the price and the volume. In total we have 40 In order to filter such noise from the extracted labels we
values for each timestep. The stock data, provided by use the following smoothed approach. First, the mean of
Nasdaq Nordic, come from the Finnish companies Kesko the previous k mid-prices, denoted by mb , and the mean
Oyj, Outokumpu Oyj, Sampo, Rautaruukki and Wartsila of the next k mid-prices, denoted by ma , are defined as:
Oyj. The time period used for collecting that data ranges k
1X
from the 1st to the 14th June 2010 (only business days mb (t) = pt−i (3)
are included), while the data are provided by the Nasdaq k i=0
Nordic data feeds [14]. 1X
k
The dataset is made up of 10 days for 5 different ma (t) = pt+i (4)
k i=1
stocks and the total number of messages is 4.5 million
with equally many separate depths. Since the price and where pt is the mid price as described in Equation (2).
volume range is much greater than the range of the Then, a label lt that express the direction of price move-
values of the activation function of our neural network, ment at time t is extracted by comparing the previously
we need to normalize the data before feeding them to defined quantities (mb and ma ):
the network. To this end, standardization (z-score) is 
employed to normalize the data:  1, if mb (t) > ma (t) · (1 + α)

x − x̄ lt = −1, if mb (t) < ma (t) · (1 − α) (5)
xnorm = (1) 
σx̄ 
0, otherwise
where x is the vector of values we want to normalize, x̄
where the threshold α is set as the least amount of
is the mean value of the data and σx̄ is the standard
change in price that must occur for it to be considered
deviation of the data. Instead of simply normalizing
upward or downward. If the price does not exceed this
all the values together, we take into account the scale
limit, the sample will be considered to belong to the
differences between order prices and order volumes
stationary class. Therefore, the resulting label expresses
and we use a separate normalizer, with different mean
the current trend we wish to predict. Note that this
and standard deviation, for each of them. Also, since
process is applied for every time step in our data.
different stocks have different price ranges and drastic
An improved version of RNNs, namely the
distributions shifts might occur in individual stocks for
LSTM [10], is employed to classify our data. The
different days, the normalization of the current day’s
LSTM solves the problem of vanishing gradients, which
values uses the mean and standard deviation calculated
makes virtually impossible for an RNN to learn to
using previous day’s data.
correlate temporally distant events. This is achieved
We want to predict the direction towards which the
by protecting its hidden activation using gates between
price will change. In this work the term price is used to
each of its transaction points with the rest of its layer.
refer to the mid-price of a stock, which is defined as the
The hidden activation that is protected is called the cell
mean between the best bid price and best ask price at
state. The following equations describe the behavior of
time t:
(1) (1)
pa (t) + pb (t) the LSTM model [10]:
pt = (2)
2
This is a virtual value for the price since no order can ft = σ(Wxf x + Whf ht−1 + bf ) (6)
happen at that exact price, but predicting its upwards it = σ(Wxi x + Whi ht−1 + bi ) (7)
or downwards movement provides a good estimate of
the price of the future orders. A set of discrete choices c′t = tanh(Whc ht−1 + Wxc xt + bc ) (8)
must be constructed from our data to use as targets for ct = ft ct−1 + it c′t (9)
our classification model. Simply using pt > pt+k to ot = σ(Woc ct + Woh ht−1 + bo ) (10)
determine the direction of the mid price would intro- ht = ot σ(ct ) (11)
duce unmanageable amount of noise, since the smallest
change would be registered as an upward or downward where ft , it and ot are the activations of the input, forget
movement. and output gates at time-step t, which control how much

ISBN 978-0-9928626-7-1 © EURASIP 2017 2582


2017 25th European Signal Processing Conference (EUSIPCO)

of the input and the previous state will be considered and 0.28
0.27 train
how much of the cell state will be included in the hidden 0.26
test

activation of the network. The protected cell activation at 0.25

Cost
0.24
time-step t is denoted by ct , whereas ht is the activation 0.23
that will be given to other components of the model. 0.22
0.21
The parameters of the model are learned by minimiz- 0.20
0 50 100 150 200 250 300
ing the categorical cross entropy loss defined as: LSTM step

L
X Fig. 1: Mean cost per recurrent step of the LSTM
L(W) = − yi · log ŷi (12)
network
i=1

where L is the number of different labels


TABLE I: Experimental results for different prediction
and the notation W is used to refer
horizons k
to the parameters of the LSTM, i.e.,
Wxf , Whf , Wxi , Whi , Whc , Wxc , Woc , Woh , bf , bi ,
Model Mean Recall Mean Prec. Mean F1 Cohen’s κ
bc , and bo . The ground truth vector is denoted by y, Prediction Horizon k = 10
while ŷ is the predicted label distribution. The loss SVM 39.62% 44.92% 35.88% 0.068
is summed over all samples in each batch. The most MLP 47.81% 60.78% 48.27% 0.226
LSTM 60.77% 75.92% 66.33% 0.500
commonly used method to minimize the loss function Prediction Horizon k = 20
defined in Equation (12) and learn the parameters W SVM 45.08% 47.77% 43.20% 0.139
of the model is gradient descent [15]: MLP 51.33% 65.20% 51.12% 0.255
LSTM 59.60% 70.52% 62.37% 0.430
∂L Prediction Horizon k = 50
W′ = W − η · (13) SVM 46.05 % 60.30% 49.42% 0.243
∂W
MLP 55.21% 67.14% 55.95% 0.324
where W′ are the parameters of the model after each LSTM 60.03% 68.50% 61.43% 0.411
gradient descent step and η is the learning rate. In
this work we utilize the Adaptive Moment Estimation
algorithm, known as ADAM [16], which ensures that
the learning steps are scale invariant with respect to the Experimentally we found out that to avoid over-fitting
parameter gradients. the hidden layer of the LSTM should contain 32 to 64
The input to the LSTM is a sequence of vectors X = hidden neurons. If more hidden neurons are used, then
{x0 , x1 , . . . , xn } that represent the LOB depth at each the network can easily overfit the data, while if less
time step t. Each xt is fed sequentially to the LSTM hidden neurons are used the network under-fits the data
and its output yt expresses the categorical distribution reducing the accuracy of the predictions.
for the three direction labels (upward, downward and We use an LSTM with 40 hidden neurons followed
stationary), as described in Equation (5), for each time- by a feed-forward layer with Leaky Rectifying Linear
step t. Units as activation function [17] . We split our dataset as
follows. The first 7 days are used to train the network,
V. E XPERIMENTAL E VALUATION while the next 3 days are used as test data. We train
In our first attempt to train an LSTM network to the same model for 3 different prediction horizons k, as
predict the mid-price trend direction we noticed a very defined in Equations (3) and (4).
interesting pattern in the mean cost per recurrent step, To measure the performance of our model we use
as shown in Figure 1. The cost is significantly higher on Kohen’s kappa [18], which is used to measure the
the initial steps before it eventually settles. This happens concordance between sets of given answers, taking into
because it is not possible for the network to build a consideration the possibility of random agreements hap-
correct internal representation having seen only a few pening. We also report the mean recall, precision and F1
samples of the depth. To avoid this unnecessary source score between all 3 classes. Recall is the number true
of error and noise in the training gradients, we do not positive samples divided by the sum of true positives
propagate the error for the first 100 recurrent steps. These and false negatives, while precision is the number of
steps are treated as a ”burn-in” sequence, allowing the true positive divided by the sum of true positives and
network to observe a portion of the LOB depth timeline false positives. F1 score is the harmonic mean of the
before making an accountable prediction. precision and recall metrics.

ISBN 978-0-9928626-7-1 © EURASIP 2017 2583


2017 25th European Signal Processing Conference (EUSIPCO)

The results of our experiments are shown in Table I. [2] L. Takeuchi and Y.-Y. A. Lee, “Applying deep learning to
We compare our results with those of a Linear SVM enhance momentum trading strategies in stocks,” 2013.
[3] R. Xiong, E. P. Nichols, and Y. Shen, “Deep learning
model and an MLP model with Leaky Rectifiers as stock volatility with google domestic trends,” arXiv preprint
activation function. The SVM model is trained using arXiv:1512.04916, 2015.
stochastic gradient descent since the dataset is too large [4] A. N. Kercheval and Y. Zhang, “Modelling high-frequency limit
order book dynamics with support vector machines,” Quantitative
to use a closed-form solution. The MLP model uses a Finance, vol. 15, no. 8, pp. 1315–1329, 2015.
single hidden layer with 128 neurons with Leaky ReLU [5] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition
activations. The regularization parameter of the SVM with deep recurrent neural networks,” in Proceedings of the
IEEE international conference on Acoustics, Speech and Signal
was chosen using cross validation on a split from the Processing (icassp), 2013, pp. 6645–6649.
training set. Since both models are sequential, we feed [6] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov,
the concatenation of the previous 100 depth samples as R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image
caption generation with visual attention.” in Proceedings of the
input and we use as prediction target the price movement International Conference on Machine Learning, vol. 14, 2015,
associated with the last depth sample. The proposed pp. 77–81.
method significantly outperforms all the other evaluated [7] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep
captioning with multimodal recurrent neural networks (m-rnn),”
models, especially for short term prediction horizons arXiv preprint arXiv:1412.6632, 2014.
(t = 10 and t = 20). [8] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, “Visual7w:
Grounded question answering in images,” in Proceedings of the
VI. C ONCLUSION IEEE Conference on Computer Vision and Pattern Recognition,
2016, pp. 4995–5004.
In this work we trained an LSTM network on high [9] Y. LeCun, Y. Bengio et al., “Convolutional networks for images,
frequency LOB data, applying a temporally aware nor- speech, and time series,” The handbook of brain theory and
malization scheme on the volumes and prices of the neural networks, vol. 3361, no. 10, p. 1995, 1995.
[10] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
LOB depth. The proposed approach was evaluated using Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
different prediction horizons and it was demonstrated [11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
that it performs significantly better than other techniques, image recognition,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2016, pp. 770–778.
such as Linear SVMs and MLPs, when trying to predict [12] J. Heaton, N. Polson, and J. Witte, “Deep portfolio theory,” arXiv
short term price movements. preprint arXiv:1605.07230, 2016.
There are several interesting future research directions. [13] D. Yang and Q. Zhang, “Drift-independent volatility estimation
based on high, low, open, and close prices,” The Journal of
First, more data can be used to train the proposed model, Business, vol. 73, no. 3, pp. 477–492, 2000.
scaling up to a billion training samples, to determine [14] M. Siikanen, J. Kanniainen, and J. Valli, “Limit order books and
if using more data leads to better classification per- liquidity around scheduled and non-scheduled announcements:
Empirical evidence from nasdaq nordic,” Finance Research Let-
formance. With more data also increase the ”burn-in” ters, vol. to appear, 2016.
phase along with the prediction horizon to gauge the [15] P. J. Werbos, “Backpropagation through time: what it does and
models ability to predict the trend further into the future. how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp.
1550–1560, 1990.
Also, an attention mechanism [6], [19], can be intro- [16] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
duced to allow the network to capture only the relevant tion,” arXiv preprint arXiv:1412.6980, 2014.
information and avoid noise. Finally, more advanced [17] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities
improve neural network acoustic models,” in Proceedings of the
trainable normalization techniques can be used, as it was International Conference on Machine Learning, vol. 30, no. 1,
established that normalization is essential to ensure that 2013.
the learned model will generalize well on unseen data. [18] J. Cohen, “A coefficient of agreement for nominal scales,”
Educational and Psychological Measurement, vol. 20, no. 1, pp.
ACKNOWLEDGMENT 37–46, 1960.
[19] K. Cho, A. Courville, and Y. Bengio, “Describing multimedia
The research leading to these results has received content using attention-based encoder-decoder networks,” IEEE
funding from the H2020 Project BigDataFinance MSCA- Transactions on Multimedia, vol. 17, no. 11, pp. 1875–1886,
2015.
ITN-ETN 675044 (http://bigdatafinance.eu), Training for
Big Data in Financial Research and Risk Management.
Alexandros Iosifidis was supported from the Academy of
Finland Postdoctoral Research Fellowship (No. 295854).
He joined Aarhus University on August 2017.
R EFERENCES
[1] M. F. Dixon, D. Klabjan, and J. H. Bang, “Classification-based
financial markets prediction using deep neural networks,” 2016.

ISBN 978-0-9928626-7-1 © EURASIP 2017 2584

You might also like