Deep Policy Gradient Methods in Commodity Markets
Deep Policy Gradient Methods in Commodity Markets
in Commodity Markets
UNIVERSITY OF OSLO
Spring 2023
Deep Policy Gradient
Methods in Commodity
Markets
http://www.duo.uio.no/
i
Abstract
The energy transition has increased the reliance on intermittent energy sources,
destabilizing energy markets and causing unprecedented volatility, culminating
in the global energy crisis of 2021. In addition to harming producers and con-
sumers, volatile energy markets may jeopardize vital decarbonization efforts.
Traders play an important role in stabilizing markets by providing liquidity and
reducing volatility. Forecasting future returns is an integral part of any financial
trading operation, and several mathematical and statistical models have been
proposed for this purpose. However, developing such models is non-trivial due
to financial markets’ low signal-to-noise ratios and nonstationary dynamics.
This thesis investigates the effectiveness of deep reinforcement learning meth-
ods in commodities trading. It presents related work and relevant research in
algorithmic trading, deep learning, and reinforcement learning. The thesis for-
malizes the commodities trading problem as a continuing discrete-time stochas-
tic dynamical system. This system employs a novel time-discretization scheme
that is reactive and adaptive to market volatility, providing better statistical
properties for the sub-sampled financial time series. Two policy gradient algo-
rithms, an actor-based and an actor-critic-based, are proposed for optimizing
a transaction-cost- and risk-sensitive trading agent. The agent maps historical
price observations to market positions through parametric function approxima-
tors utilizing deep neural network architectures, specifically CNNs and LSTMs.
On average, the deep reinforcement learning models produce an 83 percent
higher Sharpe ratio than the buy-and-hold baseline when backtested on front-
month natural gas futures from 2017 to 2022. The backtests demonstrate that
the risk tolerance of the deep reinforcement learning agents can be adjusted us-
ing a risk-sensitivity term. The actor-based policy gradient algorithm performs
significantly better than the actor-critic-based algorithm, and the CNN-based
models perform slightly better than those based on the LSTM. The backtest
results indicate the viability of deep reinforcement learning-based algorithmic
trading in volatile commodity markets.
ii
List of Figures
2.1 Time series cross-validation (backtesting) compared to standard
cross-validation from [HA18]. . . . . . . . . . . . . . . . . . . . . 13
3.1 An illustration of the relationship between the capacity of a func-
tion approximator and the generalization error from [GBC16]. . . 19
3.2 Feedforward neural network from [FFLb]. . . . . . . . . . . . . . 20
3.3 ReLU activation function from [FFLb]. . . . . . . . . . . . . . . . 25
3.4 Leaky ReLU activation function from [FFLb]. . . . . . . . . . . . 26
3.5 An example of the effect of weight decay with parameter λ on a
high-dimensional polynomial regression model from [GBC16]. . . 27
3.6 3D convolutional layer from [FFLc]. . . . . . . . . . . . . . . . . 30
3.7 Recurrent neural network from [FFLa]. . . . . . . . . . . . . . . . 31
3.8 LSTM cell from [FFLa]. . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Agent-environment interaction from [SB18]. . . . . . . . . . . . . 35
7.1 Policy network architecture . . . . . . . . . . . . . . . . . . . . . 58
7.2 Q-network architecture . . . . . . . . . . . . . . . . . . . . . . . . 58
7.3 Convolutional sequential information layer architecture . . . . . . 60
7.4 LSTM sequential information layer architecture . . . . . . . . . . 61
8.1 The training-validation-test split . . . . . . . . . . . . . . . . . . 68
8.2 Cumulative logarithmic trade returns for λσ = 0 . . . . . . . . . 70
8.3 Boxplot of monthly logarithmic trade returns for λσ = 0 . . . . . 70
8.4 Cumulative logarithmic trade returns for λσ = 0.01 . . . . . . . . 71
8.5 Boxplot of monthly logarithmic trade returns for λσ = 0.01 . . . 71
8.6 Cumulative logarithmic trade returns for λσ = 0.1 . . . . . . . . 72
8.7 Boxplot of monthly logarithmic trade returns for λσ = 0.1 . . . . 72
8.8 Cumulative logarithmic trade returns for λσ = 0.2 . . . . . . . . 73
8.9 Boxplot of monthly logarithmic trade returns for λσ = 0.2 . . . . 73
iii
List of Tables
1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2 Backtest results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
iv
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 2
I Background 4
2 Algorithmic trading 5
2.1 Commodity markets . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Financial trading . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Modern portfolio theory . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Efficient market hypothesis . . . . . . . . . . . . . . . . . . . . . 7
2.5 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Mapping forecasts to market positions . . . . . . . . . . . . . . . 9
2.7 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.8 Sub-sampling schemes . . . . . . . . . . . . . . . . . . . . . . . . 11
2.9 Backtesting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Deep learning 16
3.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 No-free-lunch theorem . . . . . . . . . . . . . . . . . . . . 17
3.1.2 The curse of dimensionality . . . . . . . . . . . . . . . . . 17
3.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Function approximation . . . . . . . . . . . . . . . . . . . 17
3.3 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 Feedforward neural networks . . . . . . . . . . . . . . . . 19
3.3.2 Parameter initialization . . . . . . . . . . . . . . . . . . . 20
3.3.3 Gradient-based learning . . . . . . . . . . . . . . . . . . . 21
3.3.4 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.5 Activation function . . . . . . . . . . . . . . . . . . . . . . 25
3.3.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.7 Batch normalization . . . . . . . . . . . . . . . . . . . . . 27
3.3.8 Universal approximation theorem . . . . . . . . . . . . . . 28
3.3.9 Deep neural networks . . . . . . . . . . . . . . . . . . . . 29
3.3.10 Convolutional neural networks . . . . . . . . . . . . . . . 29
3.3.11 Recurrent neural networks . . . . . . . . . . . . . . . . . . 30
4 Reinforcement learning 34
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Markov decision process . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Infinite Markov decision process . . . . . . . . . . . . . . 36
4.2.2 Partially observable Markov decision process . . . . . . . 36
4.3 Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
v
4.4 Value function and policy . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Function approximation . . . . . . . . . . . . . . . . . . . . . . . 39
4.6 Policy gradient methods . . . . . . . . . . . . . . . . . . . . . . . 39
4.6.1 REINFORCE . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6.2 Actor-critic . . . . . . . . . . . . . . . . . . . . . . . . . . 41
II Methodology 44
5 Problem Setting 45
5.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Time Discretization . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Action Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7 Network topology 56
7.1 Network input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.2 Policy network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.3 Q-network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.4 Sequential information layer . . . . . . . . . . . . . . . . . . . . . 59
7.4.1 Convolutional neural network . . . . . . . . . . . . . . . . 60
7.4.2 Long Short-Term Memory . . . . . . . . . . . . . . . . . . 60
7.5 Decision-making layer . . . . . . . . . . . . . . . . . . . . . . . . 61
7.6 Network optimization . . . . . . . . . . . . . . . . . . . . . . . . 61
7.6.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 62
III Experiments 64
8 Experiment and Results 65
8.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . 65
8.1.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.1.2 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . 66
8.1.3 Training scheme . . . . . . . . . . . . . . . . . . . . . . . 66
8.1.4 Performance metrics . . . . . . . . . . . . . . . . . . . . . 67
8.1.5 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.3 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3.1 Risk/reward . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3.2 RL models . . . . . . . . . . . . . . . . . . . . . . . . . . 75
vi
8.3.3 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.4 Discussion of model . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.4.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.4.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.4.3 Interpretability and trust . . . . . . . . . . . . . . . . . . 79
9 Future work 82
10 Conclusion 83
List of Abbreviations 84
vii
1 Introduction
1.1 Motivation
The transition to sustainable energy sources is one of the most critical challenges
facing the world today. By 2050, the European Union aims to become carbon
neutral [eur]. However, rising volatility in energy markets, culminating in the
2021 global energy crisis, complicates this objective. Supply and demand forces
determine price dynamics, where an ever-increasing share of supply stems from
intermittent renewable energy sources such as wind and solar power. Increasing
reliance on intermittent energy sources leads to unpredictable energy supply,
contributing to volatile energy markets [SRH20]. Already volatile markets are
further destabilized by evolutionary traits such as fear and greed, causing human
commodity traders to overreact [Lo04]. Volatile markets are problematic for
producers and consumers, and failure to mitigate these concerns may jeopardize
decarbonization targets.
Algorithmic trading agents can stabilize commodity markets by systemati-
cally providing liquidity and aiding price discovery [Isi21, Nar13]. Developing
these methods is non-trivial as financial markets are non-stationary with com-
plicated dynamics [Tal97]. Machine learning (ML) has emerged as the preferred
method in algorithmic trading due to its ability to learn to solve complicated
tasks by leveraging data [Isi21]. The majority of research on ML-based algo-
rithmic trading has focused on forecast-based supervised learning (SL) meth-
ods, which tend to ignore non-trivial factors such as transaction costs, risk,
and the additional logic associated with mapping forecasts to market positions
[Fis18]. Reinforcement learning (RL) presents a suitable alternative to account
for these factors. In reinforcement learning, autonomous agents learn to per-
form tasks in a time-series environment through trial and error without human
supervision. Around the turn of the millennium, Moody and his collaborators
[MW97, MWLS98, MS01] made several significant contributions to this field,
empirically demonstrating the advantages of reinforcement learning over super-
vised learning for algorithmic trading.
In the last decade, the deep learning (DL) revolution has made exceptional
progress in areas such as image classification [HZRS15] and natural language
processing [VSP+ 17], characterized by complex structures and high signal-to-
noise ratios. The strong representation ability of deep learning methods has even
translated to forecasting low signal-to-noise financial data [XNS15, HGMS18,
MRC18]. In complex, high-dimensional environments, deep reinforcement learn-
ing (deep RL), i.e., integrating deep learning techniques into reinforcement
learning, has yielded impressive results. Noteworthy contributions include achiev-
ing superhuman play in Atari games [MKS+ 13] and chess [SHS+ 17], and training
a robot arm to solve the Rubik’s cube [AAC+ 19]. A significant breakthrough
was achieved in 2016 when the deep reinforcement learning-based computer pro-
gram AlphaGo[SHM+ 16] beat top Go player Lee Sedol. In addition to learning
by reinforcement learning through self-play, AlphaGo uses supervised learning
techniques to learn from a database of historical games. In 2017, an improved
1
version called AlphaGo Zero[SSS+ 17], which begins with random play and relies
solely on reinforcement learning, comprehensively defeated AlphaGo. Deep RL
has thus far been primarily studied in the context of game-playing and robotics,
and its potential application to financial trading remains largely unexplored.
Combining the two seems promising, given the respective successes of reinforce-
ment learning and deep learning in algorithmic trading and forecasting.
2
• Chapter 6: Description of reinforcement learning algorithms.
• Chapter 7: Description of the neural network function approximators.
• Chapter 8: Detailed results from experiments.
• Chapter 9: Suggested future work.
3
Part I
Background
4
2 Algorithmic trading
A phenomenon commonly described as an arms race has resulted from fierce
competition in financial markets. In this phenomenon, market participants com-
pete to remain on the right side of information asymmetry, which further reduces
the signal-to-noise ratio and the frequency at which information arrives and is
absorbed by the market [Isi21]. An increase in volatility and the emergence of a
highly sophisticated breed of traders called high-frequency traders have further
complicated already complex market dynamics. In these developed, modern
financial markets, the dynamics are so complex and change at such a high fre-
quency that humans will have difficulty competing. Indeed, there is reason to
believe that machines already outperform humans in the world of financial trad-
ing. The algorithmic hedge fund Renaissance Technologies, founded by famed
mathematician Jim Simons, is considered the most successful hedge fund ever.
From 1988 to 2018, Renaissance Technologies’ Medallion fund generated 66 per-
cent annualized returns before fees relying exclusively on algorithmic strategies
[Zuc19]. In 2020, it was estimated that algorithmic trading accounts for around
60-73 percent of U.S. and European equity trading, up from just 15 percent in
2003 [Int20]. Thus, it is clear that algorithms already play a significant role in
financial markets. Due to the rapid progress of computing power1 relative to
human evolution, this importance will likely only grow.
This chapter provides an overview of this thesis’s subject matter, algorith-
mic trading on commodity markets, examines related work, and justifies the
algorithmic trading methods described in part II. Section 2.1 presents a brief
overview of commodity markets and energy futures contracts. Sections 2.2, 2.3,
and 2.4 introduce some basic concepts related to trading financial markets that
are necessary to define and justify a trading agent’s goal of maximizing risk-
adjusted returns. This goal has two sub-goals: forecasting returns and mapping
forecasts to market positions, which are discussed separately in sections 2.5 and
2.6. Additionally, these sections provide an overview of how the concepts intro-
duced in the following chapters 3 and 4 can be applied to algorithmic trading
and provide an overview of related research. The sections 2.7 and 2.8 describe
how to represent a continuous financial market as discrete inputs to an algorith-
mic trading system. To conclude, section 2.9 introduces backtesting, a form of
cross-validation used to evaluate algorithmic trading agents.
5
orders in the limit order book when they arrive on the market. A trade occurs
at the price set by the first order in the event of an overlap. The exchange
makes money by charging a fee for every trade, usually a small percentage of
the total amount traded.
The basis of energy trade is energy futures, a derivative contract with energy
products as the underlying asset [CGLL19]. Futures contracts are standardized
forward contracts listed on stock exchanges. They are interchangeable, which
improves liquidity. Futures contracts obligate a buyer and seller to transact a
given quantity of the underlying asset at a future date and price. The quantity,
quality, delivery location, and delivery date are all specified in the contract.
Futures contracts are usually identified by their expiration month. The “front-
month” is the nearest expiration date and usually represents the most liquid
market. Natural gas futures expire three business days before the first calendar
day of the delivery month. To avoid physical delivery of the underlying com-
modity, the contract holder must sell their holding to the market before expiry.
Therefore, the futures and underlying commodity prices converge as the deliv-
ery date approaches. A futures contract achieves the same outcome as buying
a commodity on the spot market on margin and storing it for a future date.
The relative price of these alternatives is connected as it presents an arbitrage
opportunity. The difference in price between a futures contract and the spot
price of the underlying commodity will therefore depend on the financing cost,
storage cost, and convenience yield of holding the physical commodity over the
futures contract. Physical traders use futures as a hedge while transporting
commodities from producer to consumer. If a trader wishes to extend the ex-
piry of his futures contract, he can “roll” the contract by closing the contract
about to expire and entering into a contract with the same terms but a later
expiry date [CGLL19]. The “roll yield” is the difference in price for these two
contracts and might be positive or negative. The exchange clearinghouse uses a
margin system with daily settlements between parties to mitigate counterparty
risk [CGLL19].
6
William F. Sharpe. The Sharpe ratio compares excess return with the standard
deviation of investment returns and is defined as
E[rt − r̄] E[rt ]
Sharpe ratio = p ≃ (2.1)
var[rt − r̄] σ rt
where E[rt ] is the expected return over T samples, r̄ is the risk-free rate, and
σrt > 0 is the standard deviation of the portfolio’s excess return. Due to neg-
ligible low interest rates, the risk-free rate is commonly set to r̄ = 0. The
philosophy of MPT is that the investor should be compensated through higher
returns for taking on higher risk. The St. Petersburg paradox2 illustrates why
maximizing expected reward in a risk-neutral manner might not be what an in-
dividual wants. Although market participants have wildly different objectives,
this thesis will adopt the MPT philosophy of assuming investors want to maxi-
mize risk-adjusted returns. Hence, the goal of the trading agent described in this
thesis will be to maximize the risk-adjusted returns represented by the Sharpe
ratio. Maximizing future risk-adjusted returns can be broken down into two
sub-goals; forecasting future returns and mapping the forecast to market posi-
tions. However, doing so in highly efficient and competitive financial markets is
non-trivial.
Practitioners and certain parts of academia heavily dispute the EMH. Be-
havioral economists reject the idea of rational markets and believe that human
evolutionary traits such as fear and greed distort market participants’ deci-
sions, creating irrational markets. The Adaptive Market Hypothesis (AMH)
[Lo04] reconciles the efficient market hypothesis with behavioral economics by
applying evolution principles (competition, adaptation, and natural selection)
to financial interactions. According to the AHM, what behavioral economists
label irrational behavior is consistent with an evolutionary model of individuals
adapting to a changing environment. Individuals within the market are contin-
ually learning the market dynamics, and as they do, they adapt their trading
strategies, which in turn changes the dynamics of the market. This loop creates
2 For an explanation of the paradox, see the article [Pet22].
7
complicated price dynamics. Traders who adapt quickly to changing dynamics
can exploit potential inefficiencies. Based on the AHM philosophy, this the-
sis hypothesizes that there are inefficiencies in financial markets that can be
exploited, with the recognition that such opportunities are limited and chal-
lenging to discover.
2.5 Forecasting
Unless a person is gambling, betting on the price movements of volatile financial
assets only makes sense if the trader has a reasonable idea of where the price
is moving. Since traders face non-trivial transaction costs, the expected value
of a randomly selected trade is negative. Hence, as described by the gambler’s
ruin, a person gambling on financial markets will eventually go bankrupt due to
the law of large numbers. Forecasting price movements, i.e., making predictions
based on past and present data, is a central component of any financial trading
operation and an active field in academia and industry. Traditional approaches
include fundamental analysis, technical analysis, or a combination of the two
[GD34]. These can be further simplified into qualitative and quantitative ap-
proaches (or a combination). A qualitative approach, i.e., fundamental analysis,
entails evaluating the subjective aspects of a security [GD34], which falls outside
the scope of this thesis. Quantitative (i.e., technical) traders use past data to
make predictions [GD34]. The focus of this thesis is limited to fully quantitative
approaches.
Developing quantitative forecasts for the price series of financial assets is non-
trivial as financial markets are non-stationary with a low signal-to-noise ratio
[Tal97]. Furthermore, modern financial markets are highly competitive and ef-
fective. As a result, easily detectable signals are almost certainly arbitraged out.
Researchers and practitioners use several mathematical and statistical models to
identify predictive signals leading to excess returns. Classical models include the
autoregressive integrated moving average (ARIMA) and the generalized autore-
gressive conditional heteroskedasticity (GARCH). The ARIMA is a linear model
and a generalization of the autoregressive moving average (ARMA) that can be
applied to time series with nonstationary mean (but not variance) [SSS00]. The
assumption of constant variance (i.e., volatility) is not valid for financial markets
where volatility is stochastic [Tal97]. The GARCH is a non-linear model devel-
oped to handle stochastic variance by modeling the error variance as an ARMA
model [SSS00]. Although the ARIMA and GARCH have practical applications,
their performance in modeling financial time series is generally unsatisfactory
[XNS15, MRC18].
Over the past 20 years, the availability and affordability of computing power,
storage, and data have lowered the barrier of entry to more advanced algo-
rithmic methods. As a result, researchers and practitioners have turned their
attention to more complex machine learning methods because of their ability
to identify signals and capture relationships in large datasets. Initially, there
was a flawed belief that the low signal-to-noise ratio leaves viable only simple
forecasts such as those based on low-dimensional ordinary least squares [Isi21].
8
With the recent deep learning revolution, deep neural networks have demon-
strated strong representation abilities when modeling time series data [SVL14].
The Makridakis competition evaluates time series forecasting methods. In its
fifth installment held in 2020, all 50 top-performing models were based on deep
learning architectures [MSA22]. A considerable amount of recent empirical re-
search suggests that deep learning models significantly outperform traditional
models like the ARIMA and GARCH when forecasting financial time series
[XNS15, MRC18, SNN18, SGO20]. These results are somewhat puzzling. The
risk of overfitting is generally higher for noisy data like financial data. More-
over, the loss function for DNNs is non-convex, which makes finding a global
minimum impossible. Despite the elevated overfitting risk and the massive over-
parameterization of DNNs, they still demonstrate stellar generalization. Thus,
based on recent research, the thesis will apply deep learning techniques to model
financial time series.
A review of deep learning methods in financial time series forecasting [SGO20]
found that LSTMs were the preferred choice in sequence modeling, possibly due
to their ability to remember both long- and short-term dependencies. Convolu-
tional neural networks are another common choice. CNNs are best known for
their ability to process 2D grids such as images; however, they have shown a
solid ability to model 1D grid time series data. Using historical prices, Hiransha
et al. [HGMS18] tested FFNs, vanilla RNNs, LSTMs, and CNNs on forecast-
ing next-day stock market returns on the National Stock Exchange (NSE) of
India and the New York Stock Exchange (NYSE). In the experiment, CNNs
outperformed other models, including the LSTM. These deep learning models
can extract generalizable patterns from the price series alone [SGO20].
9
Moody and Wu [MW97] and Moody et al. [MWLS98] empirically demon-
strated the advantages of reinforcement learning relative to supervised learning.
In particular, they demonstrated the difficulty of accounting for transaction
costs using a supervised learning framework. A significant contribution is their
model-free policy-based RL algorithm for trading financial instruments recur-
rent reinforcement learning (RRL). The name refers to the recursive mechanism
that stores the past action as an internal state of the environment, allowing the
agent to consider transaction costs. The agent outputs market positions and is
limited to a discrete action space at ∈ {−1, 0, 1}, corresponding to maximally
short, no position, and maximally long. At time t, the previous action at−1 is
fed into the policy network fθ along with the external state of the environment
st in order to make the trade decision, i.e.,
at = fθ (st , at−1 )
where fθ is a linear function, and the external state is constructed using the
past 8 returns. The return rt is realized at the end of the period (t − 1, t] and
includes the returns resulting from the position at−1 held through this period
minus transaction costs incurred at time t due to a difference in the new position
at from the old at−1 . Thus, the agent learns the relationship between actions
and the external state of the environment and the internal state.
Moody and Saffel [MS01] compared their actor-based RRL algorithm to
the value-based Q-learning algorithm when applied to financial trading. The
algorithms are tested on two real financial time series; the U.S. dollar/British
pound foreign exchange pair and the S&P 500 stock index. While both perform
better than a buy-and-hold baseline, the RRL algorithm outperforms Q-learning
on all tests. The authors argue that actor-based algorithms are better suited
to immediate reward environments and may be better able to deal with noisy
data and quickly adapt to non-stationary environments. They point out that
critic-based RL suffers from the curse of dimensionality and that when extended
to function approximation, it sometimes fails to converge even in simple MDPs.
Deng et al. [DBK+ 16] combine Moody’s direct reinforcement learning frame-
work with a recurrent neural network to introduce feature learning through deep
learning. Another addition is the use of continuous action space. To constrain
actions to the interval [−1, 1], the RNN output is mapped to a tanh function.
Jiang et al. [JXL17] presents a deterministic policy gradient algorithm that
trades a portfolio of multiple financial instruments. The policy network is mod-
eled using CNNs and LSTMs, taking each period’s closing, lowest, and highest
prices as input. The DNNs are trained on randomly sampled mini-batches of
experience. These methods account for transaction costs but not risk. Zhang
et al. [ZZW+ 20] present a deep RL framework for a risk-averse agent trading
a portfolio of instruments using both CNNs and LSTMs. Jin and El-Saawy
[JES16] suggest that adding a risk-term to the reward function that penalizes
the agent for volatility produces a higher Sharpe ratio than optimizing for the
Sharpe ratio directly. Zhang et al. [ZZW+ 20] apply a similar risk-term penalty
to the reward function.
10
2.7 Feature engineering
Any forecast needs some predictor data, or features, to make predictions. While
ML forecasting is a science, feature engineering is an art and arguably the most
crucial part of the ML process. Feature engineering and selection for financial
forecasting are only limited by imagination. Features range from traditional
technical indicators (e.g., Moving Average Convergence Divergence, Relative
Strength Index) [ZZR20] to more modern deep learning-based techniques like
analyzing social-media sentiment of companies using Natural Language Process-
ing [ZS10] or using CNNs on satellite images along with weather data to predict
cotton yields [TOdSMJZ20]. Research in feature engineering and selection is
exciting and potentially fruitful but beyond this thesis’s scope. The most re-
liable predictor of future prices of a financial instrument tends to be its past
price, at least in the short term [Isi21]. Therefore, in this thesis, higher-order
features are not manually extracted. Instead, only the price series are analyzed.
daily returns being below five percent between 1962 and 2004 (10 698 observations) would be
approximately 0.0005. However, it happened 8 times [Has07].
11
infinite variance can approximate returns over fixed periods. In 1967, Mandel-
brot and Taylor [MT67] argued that returns over a fixed number of transactions
may be close to Independent and Identically Distributed (IID) Gaussian. Sev-
eral empirical studies have since confirmed this [Cla73, AG00]. Clark [Cla73]
discovered that sampling by volume instead of transactions exhibits better sta-
tistical properties, i.e., closer to IID Gaussian distribution. Sampling by volume
instead of ticks has intuitive appeal. While tick bars count one transaction of n
contracts as one bar, n transactions of one contract count as n bars. Sampling
according to transaction volume might lead to significant sampling frequency
variations for volatile securities. When the price is high, the volume will be
lower, and therefore the number of observations will be lower, and vice versa,
even though the same value might be transacted. Therefore, sampling by the
monetary value transacted, also called dollar bars, may exhibit even better sta-
tistical properties [DP18]. Furthermore, for equities, sampling by monetary
value exchanged makes an algorithm more robust against corporate actions like
stock splits, reverse splits, stock offerings, and buybacks. To maintain a suitable
sampling frequency, the sampling threshold may need to be adjusted if the total
market size changes significantly.
Although periodic feature extraction reduces the number of observations that
must be processed, it scales linearly in computation and memory requirements
per observation. A history cut-off is often employed to represent the state by
only the n most recent observations to tackle this problem. Representing the
state of a partially observable MDP by the n most recent observations is a
common technique used in many reinforcement learning applications. Mnih et
al. [MKS+ 13] used 4 stacked observations as input to the DQN agent that
achieved superhuman performance on Atari games to capture the trajectory of
moving objects on the screen. The state of financial markets is also usually
approximated by stacking past observations [JXL17, ZZR20, ZZW+ 20].
2.9 Backtesting
Assessing a machine learning model involves estimating its generalization error
on new data. The most widely used method for estimating generalization error
is Cross-Validation (CV), which assumes that observations are IID and drawn
from a shared underlying data-generating distribution. However, the price of
a financial instrument is a nonstationary time series with an apparent tempo-
ral correlation. Conventional cross-validation ignores this temporal component
and is thus unsuitable for assessing a time series forecasting model. Instead,
backtesting, a form of cross-validation for time series, is used. Backtesting is
a historical simulation of how the model would have performed should it have
been run over a past period. The purpose of backtesting is the same as for
cross-validation; to determine the generalization error of an ML algorithm.
To better understand backtesting, it is helpful to consider an algorithmic
trading agent’s objective and how it operates to achieve it. The algorithmic
trading process involves the agent receiving information and, based on that
information, executing trades at discrete time steps. These trades are intended
12
to achieve a specific objective set by the stakeholder, which, in the philosophy
of modern portfolio theory, is maximizing risk-adjusted returns. Thus, assessing
an algorithmic trading agent under the philosophy of modern portfolio theory
entails estimating the risk-adjusted returns resulting from the agent’s actions.
However, when testing an algorithmic trading agent, it cannot access data ahead
of the forecasting period, as that would constitute information leakage to the
agent. For this reason, conventional cross-validation fails in algorithmic trading.
The most precise way to assess the performance of an algorithmic trading
agent is to deploy it to the market, let it trade with the intended amount of
capital, and observe its performance. However, this task would require consid-
erable time since low-frequency trading algorithms are typically assessed over
long periods, often several years5 . Additionally, any small error would likely
result in devastating losses, making algorithmic trading projects economically
unfeasible.
year. Low-frequency trading algorithms typically make fewer than ten trades per day. In
order to obtain sufficient test samples, the agent must trade for several years. Second, testing
the algorithmic trading agent under various market conditions is crucial. A successful model
in particular market conditions may be biased towards those conditions and fail to generalize
to other market conditions.
13
ization error. Backtesting involves a series of tests that progress sequentially
through time, where every test set consists of a single observation. At test n,
the model trains on the training set consisting of observations prior to the ob-
servation in the test set (i < n). The forecast is, therefore, not based on future
observations, and there is no leakage from the test set to the training set. Then
the backtest progresses to the subsequent observation n + 1, where the training
set increases6 to include observations i < n + 1. The backtest progresses until
there are no test sets left. The backtesting process is formalized in algorithm 1.
Algorithm 1 Backtesting
Train the model on the first k observations (of T total observations).
for i = 0, 1, ..., T − k do
Select observation k + i from the test set.
Register trade ak+i .
Train the model using observations at times t ≤ k + i.
end for
Measure performance using registered trades ak , ak+1 , ..., aT and the correspond-
ing prices at time k, k + 1, ..., T .
14
as expected [DP18]. Random historical patterns might exhibit excellent per-
formance in a backtest. However, it should be viewed cautiously if no ex-ante
logical foundation exists to explain the performance [AHM19].
15
3 Deep learning
An intelligent agent requires some means of modeling the dynamics of the sys-
tem in which it operates. Modeling financial markets is complicated due to low
signal-to-noise ratios and non-stationary dynamics. The dynamics are highly
nonlinear; thus, several traditional statistical modeling approaches cannot cap-
ture the system’s complexity. Moreover, reinforcement learning requires param-
eterized function approximators, rendering nonparametric learners, e.g., support
vector machines and random forests, unsuitable. Additionally, parametric learn-
ers are generally preferred when the predictor data is well-defined[GBC16], such
as when forecasting security prices using historical data. In terms of nonlinear
parametric learners, artificial neural networks comprise the most widely used
class. Research suggests that deep neural networks, such as LSTMs and CNNs,
effectively model financial data [XNS15, HGMS18, MRC18]. Therefore, the al-
gorithmic trading agent proposed in this thesis will utilize deep neural networks
to model commodity markets.
This chapter introduces the fundamental concepts of deep learning relevant
to this thesis, starting with some foundational concepts related to machine learn-
ing 3.1 and supervised learning 3.2. Next, section 3.3 covers artificial neural
networks, central network topologies, and how they are initialized and opti-
mized to achieve satisfactory results. The concepts presented in this chapter
are presented in the context of supervised learning but will be extended to the
reinforcement learning context in the next chapter (4).
16
3.1.1 No-free-lunch theorem
The no-free-lunch theorem states that there exists no single universally supe-
rior ML learning algorithm that applies to all possible datasets. Fortunately,
this does not mean ML research is futile. Instead, domain-specific knowledge
is required to design successful ML models. The no-free-lunch theorem results
only hold when averaged over all possible data-generating distributions. If the
types of data-generating distributions are restricted to classes with certain sim-
ilarities, some ML algorithms perform better on average than others. Instead
of trying to develop a universally superior machine learning algorithm, the fo-
cus of ML research should be on what ML algorithms perform well on specific
data-generating distributions.
17
the approximate relationship between (x, y). This relationship can be defined
as
y = f (x) + ϵ (3.1)
where ϵ is some irreducible error that is independent of x where E[ϵ] = 0 and
V ar(ϵ) = σϵ2 . All departures from a deterministic relationship between (x, y)
are captured via the error ϵ. The objective of function approximation is to
approximate the function f with a model fˆ. In reality, this means finding the
optimal model parameters θ. Ordinary least squares can estimate linear models’
parameters θ.
Bias-variance tradeoff Bias and variance are two sources of error in model
estimates. Bias measures the in-sample expected deviation between the model
estimate and the target and is defined as
where the last term V ar(ϵ) is the irreducible noise error due to a target com-
ponent ϵ not predictable by x. The bias can be made arbitrarily small using a
more complex model; however, this will increase the model variance, or gener-
alization error, when switching from in-sample to out-of-sample. This is known
as the bias-variance tradeoff.
Overfitting A good ML model minimizes the model error, i.e., the training
error (bias) and the generalization error (variance). This is achieved at some
optimal complexity level, dependent on the data and the model. Increasing
model complexity, or capacity, can minimize training error. However, such a
model is unlikely to generalize well. Therefore, the difference between training
error and generalization error will be high. This is known as overfitting and
happens when the complexity of the ML model exceeds that of the underlying
problem. Conversely, underfitting is when the training error is high because the
model’s complexity is lower than that of the underlying problem.
To minimize model error, the complexity of the ML model, or its induc-
tive bias, must align with that of the underlying problem. The principle of
parsimony, known as Occam’s razor, states that among competing hypotheses
explaining observations equally well, one should pick the simplest one. This
heuristic is typically applied to ML model selection by selecting the simplest
model from models with comparable performance.
18
Figure 3.1: An illustration of the relationship between the capacity of a function
approximator and the generalization error from [GBC16].
Recent empirical evidence has raised questions about the mathematical foun-
dations of machine learning. Complex models such as deep neural networks
have been shown to decrease both training error and generalization error with
growing complexity [ZBH+ 21]. Furthermore, the generalization error keeps de-
creasing past the interpolation limit. These surprising results contradict the
bias-variance tradeoff that implies that a machine learning model should bal-
ance over- and underfitting. Belkin et al. [BHMM19] reconciled these conflict-
ing ideas by introducing a “double descent” curve to the bias-variance tradeoff
curve. This increases performance when increasing the model capacity beyond
the interpolation point.
19
is a mapping hθ that is a composition of multivariate functions f1 , f2 , ..., fk , g,
where k is the number of layers in the neural network. It is defined as
The functions fj , j = 1, 2, ..., k represent the network’s hidden layers and are
composed of multivariate functions. The function at layer j is defined as
where ϕj is the activation function and bj is the bias at layer j. The activation
function is used to add nonlinearity to the network. The network’s final output
layer function g can be tailored to suit the problem the network is solving, e.g.,
linear for Gaussian output distribution or Softmax distribution for categorical
output distribution.
20
backpropagation. However, too large values can result in exploding values, a
problem particularly prevalent in recurrent neural √ networks. The initial scale
of the weights is usually set to something like 1/ m where m is the number of
inputs to the network layer.
Kaiming initialization [HZRS15] is a parameter initialization method that
takes the type of activation function (e.g., Leaky-ReLU) used to add nonlinearity
to the neural network into account. The key idea is that the initialization
method should not exponentially reduce or magnify the magnitude of input
signals. Therefore, each layer is initialized at separate scales depending on their
size. Let ml ∈ N+ be the size of the inputs into the layer l ∈ N+ . Kaiming He
et al. recommend initializing weights such that
1
ml Var[θl ] = 1
2
Which corresponds to an initialization scheme of
wl ∼ N (0, 2/ml )
Due to neural nets’ nonlinearity, most loss functions are non-convex, meaning
it is impossible to find an analytical solution to ∇J(θ) = 0. Instead, iterative,
gradient-based optimization algorithms are used. There are no convergence
guarantees, but it often finds a satisfactorily low value of the loss function
relatively quickly. Gradient descent-based algorithms adjust the weights θ in the
direction that minimizes the MSE loss function. The update rule for parameter
weights in gradient descent is defined as
where α > 0 is the learning rate and the gradient ∇J(θt ) is the partial derivatives
of the objective function with respect to each weight. The learning rate defines
the rate at which the weights move in the direction suggested by the gradient of
the objective function. Gradient-based optimization algorithms, also called first-
order optimization algorithms, are the most common optimization algorithms
for neural networks [GBC16].
21
Stochastic gradient descent Optimization algorithms that process the en-
tire training set simultaneously are known as batch learning algorithms. Using
the average of the entire training set allows for calculating a more accurate
gradient estimate. The speed at which batch learning converges to a local min-
ima will be faster than online learning. However, batch learning is not suitable
for all problems, e.g., problems with massive datasets due to the high computa-
tional costs of calculating the full gradient or problems with dynamic probability
distributions.
Instead, Stochastic Gradient Descent (SGD) is often used when optimizing
neural networks. SGD replaces the gradient in conventional gradient descent
with a stochastic approximation. Furthermore, the stochastic approximation is
only calculated on a subset of the data. This reduces the computational costs
of high-dimensional optimization problems. However, the loss is not guaranteed
to decrease when using a stochastic gradient estimate. SGD is often used for
problems with continuous streams of new observations rather than a fixed-size
training set. The update rule for SGD is similar to the one for GD but replaces
the true gradient with a stochastic estimate
where ∇θ J (j) (θ) is the stochastic estimate of the gradient computed from ob-
PN
servation j. The total loss is defined as J(θ) = j=1 J (j) (θ) where N ∈ N is
the total number of observations. The learning rate at time t is αt > 0. Due
to the noise introduced by the SGD gradient estimate, gradually decreasing the
learning rate over time is crucial to ensure convergence. Stochastic approxima-
tion
P theory guarantees
P 2 convergence to a local optima if α satisfies the conditions
α = ∞ and α < ∞. It is common to adjust the learning rate using the
t
following update rule αt = (1 − β)α0 + βατ , where β = , and the learning rate
τ
is kept constant after τ iterations, i.e., ∀t ≥ τ , αt = ατ [GBC16].
Due to hardware parallelization, simultaneously computing the gradient of
N observations will usually be faster than computing each gradient separately
[GBC16]. Neural networks are, therefore, often trained on mini-batches, i.e.,
sets of more than one but less than all observations. Mini-batch learning is an
intermediate approach to fully online learning and batch learning where weights
are updated simultaneously after accumulating gradient information over a sub-
set of the total observations. In addition to providing better estimates of the
gradient, mini-batches are more computationally efficient than online learning
while still allowing training weights to be adjusted periodically during train-
ing. Therefore, minibatch learning can be used to learn systems with dynamic
probability distributions. Samples of the mini-batches should be independent
and drawn randomly. Drawing ordered batches will result in biased estimates,
especially for data with high temporal correlation.
Due to noisy gradient estimates, stochastic gradient descent and mini-batches
of small size will exhibit higher variance than conventional gradient descent dur-
ing training. The higher variance can be helpful to escape local minima and
22
find new, better local minima. However, high variance can also lead to prob-
lems such as overshooting and oscillation that can cause the model to fail to
converge. Several extensions have been made to stochastic gradient descent to
circumvent these problems.
23
a moving average of gradients E[g]t = β1 E[g]t + (1 − β1 )gt with learning rate
β1 > 0. Like RMSProp, Adam also stores a moving average of squared gradients
E[g 2 ]t with learning rate β2 > 0. The Adam update rule is given as
α ˆ t
θt+1 = θt − q E[g] (3.12)
E[gˆ2 ]t + ϵ
3.3.4 Backpropagation
Gradient-based optimization requires a method for computing a function’s gra-
dient. For neural nets, the gradient of the loss function with respect to the
weights of the network ∇θ J(θ) is usually computed using the backpropagation
algorithm (backprop) introduced in 1986 [RHW85]. Backprop calculates the
gradient of the loss function with respect to each weight in the network. This is
done by iterating backward through the network layers and repeatedly applying
the chain rule. The chain rule of calculus is used when calculating derivatives of
functions that are compositions of other functions with known derivatives. Let
y, z : R → R be functions defined as y = g(x) and z = f (g(x)) = f (y). By the
chain rule
dz dz dy
= (3.13)
dx dy dx
Generalizing further, let x ∈ Rm , y ∈ Rn , and define mappings g : Rm → Rn
and f : Rn → R. If y = g(x) and z = f (y), then the chain rule is
∂z X ∂z ∂yj
= (3.14)
∂xi j
∂yj ∂xi
24
By recursively applying the chain rule, a scalar’s gradient can be expressed
for any node in the network that produced it. This is done recursively, starting
from the output layer and going back through the layers of the network to avoid
storing subexpressions of the gradient or recomputing them several times.
The derivative is undefined for x = 0, but it has subdifferential [0, 1], and
it conventionally takes the value ReLU ′ (0) = 0 in practical implementations.
Since ReLU is a piecewise linear function, it optimizes well with gradient-based
methods.
ReLU suffers from what is known as the dying ReLU problem, where a
large gradient could cause a node’s weights to update such that the node will
25
never output anything but 0. Such nodes will not discriminate against any
input and are effectively “dead”. This problem can be caused by unfortunate
weight initialization or a too-high learning rate. Generalizations of the ReLU
function, like the Leaky ReLU (LReLU) activation function, has been proposed
to combat the dying ReLU problem [GBC16]. Leaky ReLU allows a small
“leak” for negative values proportional to some slope coefficient α, e.g., α =
0.01, determined before training. This allows small gradients to travel through
inactive nodes. Leaky ReLU will slowly converge even on randomly initialized
weights but can also reduce performance in some applications [GBC16].
3.3.6 Regularization
Minimization of generalization error is a central objective in machine learning.
The representation capacity of large neural networks, expressed by the universal
approximation theorem (3.3.8), comes at the cost of increased overfitting risk.
Consequently, a critical question in ML is how to design and train neural net-
works to achieve the lowest generalization error. Regularization addresses this
question. Regularization is a set of techniques designed to reduce generalization
error, possibly at the expense of training error.
Regularization of estimators trades increased bias for reduced variance. If
effective, it reduces model variance more than it increases bias. Weight decay
is used to regularize ML loss functions by adding the squared L2 norm of the
parameter weights Ω(θ) = 12 ||θ||22 as a regularization term to the loss function
˜ = J(θ) + λΩ(θ)
J(θ) (3.18)
26
Figure 3.5: An example of the effect of weight decay with parameter λ on a
high-dimensional polynomial regression model from [GBC16].
during training, preventing units from co-adapting too much. Dropout can be
considered an ensemble method, where an ensemble of “thinned” sub-networks
trains the same underlying base network. It is computationally inexpensive and
only requires setting one parameter α ∈ [0, 1), which is the rate at which nodes
are eliminated.
Early stopping is a common and effective implicit regularization technique
that addresses how many epochs a model should be trained to achieve the low-
est generalization error. The training data is split into training and validation
subsets. The model is iteratively trained on the training set, and at predefined
intervals in the training cycle, the model is tested on the validation set. The
error on the validation set is used as a proxy for the generalization error. If the
performance on the validation set improves, a copy of the model parameters is
stored. If performance worsens, the learning terminates, and the model param-
eters are reset to the previous point with the lowest validation set error. Testing
too frequently on the validation set risks premature termination. Temporary
dips in performance are prevalent for nonlinear models, especially when trained
with reinforcement learning algorithms when the agent explores the state and
action space. Additionally, frequent testing is computationally expensive. On
the other hand, infrequent testing increases the risk of not registering the model
parameters near their performance peak. Early stopping is relatively simple but
comes at the cost of sacrificing parts of the training set to the validation set.
27
constantly chasing a moving target. The distribution of inputs during training
is forever changing. This is known as internal covariate shift, making the net-
work sensitive to initial weights and slowing down training by requiring lower
learning rates.
Batch normalization (batch norm) is a method of adaptive reparametriza-
tion used to train deep networks. It was introduced in 2015 [IS15] to help stabi-
lize and speed up training deep neural networks by reducing internal covariate
shift. Batch norm normalizes the output distribution to be more uniform across
dimensions by standardizing the activations of each input variable for each mini-
batch. Standardization rescales the data to standard Gaussian, i.e., zero-mean
unit variance. The following transformation is applied to a mini-batch of acti-
vations to standardize it
x(k) − E[x(k) ]
x̂(k)
norm = p (3.19)
V ar[x(k) ] + ϵ
where the ϵ > 0 is a small number such as 10−8 added to the denominator for
numerical stability. Normalizing the mean and standard deviation can, however,
reduce the expressiveness of the network [GBC16]. Applying a second transfor-
mation step to the mini-batch of normalized activations restores the expressive
power of the network
x̃(k) = γ x̂(k)
norm + β (3.20)
where β and γ are learned parameters that adjust the mean and standard devi-
ation, respectively. This new parameterization is easier to learn with gradient-
based methods. Batch normalization is usually inserted after fully connected
or convolutional layers. It is conventionally inserted into the layer before acti-
vation functions but may also be inserted after. Batchnorm speeds up learning
and reduces the strong dependence on initial parameters. Additionally, it can
have a regularizing effect and sometimes eliminate the need for dropout.
28
theorem establishes that there are no theoretical constraints on the expressivity
of neural networks. However, it does not guarantee that the training algorithm
will be able to learn that function, only that it can be learned for an extensive
enough network.
29
y-axis and shifted
Z
s(t) = (x ∗ w)(t) = x(a)w(t − a)da (3.24)
30
inputs to another. The recurrent structure enables networks to exhibit temporal
dynamic behavior. RNNs scale far better than feedforward networks for longer
sequences and are well-suited to processing sequential data. However, they can
be cumbersome to train as their recurrent structure precludes parallelization.
Furthermore, conventional batch norm is incompatible with RNNs, as the re-
current part of the network is not considered when computing the normalization
statistic.
31
value of the weight matrix is > 1, the gradient will exponentially increase as it
backpropagates through the RNN cells. Conversely, if the largest singular value
is < 1, the opposite happens, where the gradient will shrink exponentially. For
the gradient of RNNs, this will result in either exploding or vanishing gradi-
ents. This is why vanilla RNNs trained with gradient-based methods do not
perform well, especially when dealing with long-term dependencies. Bengio et
al. [BSF94] present theoretical and experimental evidence supporting this con-
clusion. Exploding gradients lead to large updates that can have a detrimental
effect on model performance. The standard solution is to clip the parameter
gradients above a certain threshold. Gradient clipping can be done element-wise
or by the norm over all parameter gradients. Clipping the gradient norm has
an intuitive appeal over elementwise clipping. Since all gradients are normal-
ized jointly with the same scaling factor, the gradient still points in the same
direction, which is not necessarily the case for element-wise gradient clipping
[GBC16]. Let ∥g∥ be the norm of the gradient g and v > 0 be the norm thresh-
old. If the norm crosses over the threshold ∥g∥ > v, the gradient is clipped to
gv
g← (3.26)
∥g∥
Gradient clipping solves the exploding gradient problem and can improve perfor-
mance for reinforcement learning with nonlinear function approximation [ARS+ 20].
For vanishing gradients, however, the whole architecture of the recurrent net-
work needs to be changed. This is currently a hot topic of research [GBC16].
32
Figure 3.8: LSTM cell from [FFLa].
i σ
f σ h
= W t−1 (3.27)
o σ xt
g tanh
where σ is the sigmoid activation function. The cell state ct and hidden state
ht are updated according to the following rules
ct = f ⊙ ct−1 + i ⊙ g (3.28)
33
4 Reinforcement learning
An algorithmic trading agent maps observations of some predictor data to mar-
ket positions. This mapping is non-trivial, and as noted by Moody et al.
[MWLS98], accounting for factors such as risk and transaction costs is diffi-
cult in a supervised learning setting. Fortunately, reinforcement learning pro-
vides a convenient framework for optimizing risk- and transaction-cost-sensitive
algorithmic trading agents.
The purpose of this chapter is to introduce the fundamental concepts of
reinforcement learning relevant to this thesis. A more general and compre-
hensive introduction to reinforcement learning can be found in “Reinforcement
Learning: An Introduction” by Richard Sutton and Andrew Barto [SB18]. An
overview of deep reinforcement learning may be found in “Deep Reinforcement
Learning” by Aske Plaat [Pla22]. This chapter begins by introducing reinforce-
ment learning 4.1 and the Markov decision process framework 4.2, and some
foundational reinforcement learning concepts 4.3, 4.4. Section 4.5 discusses how
the concepts introduced in the previous chapter (3) can be combined with re-
inforcement learning to generalize over high-dimensional state spaces. Finally,
section 4.6 introduces policy gradient methods, which allow an agent to optimize
a parameterized policy directly.
4.1 Introduction
Reinforcement Learning (RL) is the machine learning paradigm that studies
how an intelligent agent can learn to make optimal sequential decisions in a
time series environment under stochastic or delayed feedback. It is based on
the concept of learning optimal behavior to solve complex problems by training
in an environment that incorporates the structure of the problem. The agent
optimizes a policy that maps states to actions through reinforcement signals
from the environment in the form of numerical rewards. The goal of using
RL to adjust the parameters of an agent is to maximize the expected reward
generated due to the agent’s actions. This goal is accomplished through trial
and error exploration of the environment. A key challenge of RL is balancing
exploring uncharted territory and exploiting current knowledge, known as the
exploration-exploitation tradeoff. Although it has been studied for many years,
the exploration-exploitation tradeoff remains unsolved. Each action must be
tried multiple times in stochastic environments to get a reliable estimate of its
expected reward. For environments with non-stationary dynamics, the agent
must continuously explore to learn how the distribution changes over time. The
agent-environment interaction in RL is often modeled as a Markov decision
process.
34
where
• S is a countable non-empty set of states (state space).
• A is a countable non-empty set of actions (action space)
• p(s′ |s, a) = P r(st+1 = s′ |st = s, at = a) is the transition probability
matrix.
• R ⊂ R is the set of all possible rewards.
• γ ∈ [0, 1] is the discount rate.
The agent interacts with the environment at discrete time steps t = 0, 1, 2, 3, ...,
which are not necessarily fixed intervals of real-time. At each step t, the agent
receives a representation of the state of the environment st ∈ S, where s0 ∈ S
is the initial state drawn from some initial state distribution p0 ∈ ∆(S). Based
on the state st = s, the agent chooses one of the available actions in the current
state at ∈ A(s). After performing the action at , the agent receives an immediate
numerical reward rt+1 ∈ R and the subsequent state representation st+1 ∈ S.
This interaction with a Markov decision process produces a sequence known as
a trajectory: s0 , a0 , r1 , s1 , a1 , r2 , s2 , a2 , r3 , .... This sequence is finite for episodic
tasks (with the termination time usually labeled T ); for continuing tasks, it is
infinite.
35
for all s ∈ S, and a ∈ A(s). Note that the one-step transition function depends
only on the current state s and not previous states, i.e., the state has the Markov
property. Essentially, MDPs are Markov chains with actions and rewards. The
transition probabilities p : S × S × A → [0, 1] are defined through the dynamics
function p(s′ , r|s, a), as
X
p(s′ |s, a) = P r(st = s′ |st−1 = s, at−1 = a) = p(s′ , r|s, a) (4.3)
r∈R
36
state approximates the environment state sat ≈ st . However, a single observa-
tion o is not a Markovian state signal. Direct mapping between observation and
action is insufficient for optimal behavior, and a memory of past observations
is required. The history of a POMDP is a sequence of actions and observations
ht = {o1 , a1 , ..., ot , at }. The agent state can be defined as the history sat = ht .
However, storing and processing the complete history of every action scales lin-
early with time, both in memory and computation. A more scalable alternative
is a stateful sequential model like a recurrent neural network (RNN). In this
model, the agent state is represented by the network sat = fθ (sat−1 , ot ).
A state can be split into an agent’s internal state and the environment’s
external state. Anything that cannot be changed arbitrarily by the agent is
considered external and, thus, part of the external environment. On the other
hand, the internal data structures of the agent that the agent can change are
part of the internal environment.
4.3 Rewards
The goal of a reinforcement learning agent is to maximize the expected return
E[Gt ], where the return Gt is defined as the sum of rewards
where γ ∈ [0, 1] is the discount rate used to scale future rewards. Setting γ = 0
suggests that the agent is myopic, i.e., only cares about immediate rewards. As
long as γ < 1 and the reward sequence is bounded, the discounted return Gt
is finite. Discounting allows reinforcement learning to be used in continuing
problems.
Reinforcement signals rt+1 from the environment can be immediate or de-
layed. Games and robot control are typical examples of delayed reward environ-
ments, where an action affects not only the immediate reward but also the next
state and, through that, all subsequent rewards. An example of delayed reward
is when chess players occasionally sacrifice a piece to gain a positional advantage
later in the game. Although sacrificing a piece in isolation is poor, it can still be
optimal long-term. Consequently, temporal credit assignment is a fundamen-
tal challenge in delayed reward environments. AlphaZero [SHS+ 17] surpassed
human-level play in chess in just 24 hours, starting from random play, using
37
reinforcement learning. Interestingly, AlphaZero seems unusually (by human
standards) open to material sacrifices for long-term positional compensation,
suggesting that the RL algorithm estimates delayed reward better than human
players. Throughout this thesis, financial trading is modeled as a stochastic
immediate reward environment. This choice is justified in chapter 5. Therefore,
the problem reduces to an associative reinforcement learning problem, a specific
instance of the full reinforcement learning problem. It requires generalization
and trial-and-error exploration but not temporal credit assignment. The meth-
ods presented in this chapter will only be those relevant in an immediate reward
environment. Unless otherwise stated, the discount rate γ, a tradeoff between
immediate and delayed rewards, is assumed to be zero, making the agent my-
opic. As a result, the return Gt in an immediate reward environment is defined
as the immediate reward
∞
X
Gt = γ k rt+k+1 = rt+1 (4.7)
k=0
π : S → ∆(A) (4.8)
The action-value function Qπ (s, a) is the expected return when performing ac-
tion a in state s and then following the policy π. It is defined ∀s ∈ S, a ∈ A(s)
as
Qπ (s, a) = Eπ [Gt |st = s, at = a] (4.11)
38
An example of a value-based policy is the ϵ-greedy policy, defined as
(
arg maxa Qπ (s, a) with probability 1 − ϵ
π(a|s, ϵ) = (4.12)
sample random action a ∼ A(s) with probability ϵ
where ϵ ∈ [0, 1] is the exploration rate.
Reinforcement learning algorithms are divided into on-policy and off -policy
algorithms. The same policy that generates the trajectories is being optimized
for on-policy algorithms. In contrast, for off-policy algorithms, the policy gener-
ating trajectories differs from the one being optimized. For off-policy learning,
the exploration can be delegated to an explorative behavioral policy while the
agent optimizes a greedy target policy.
39
′
optimize it directly. The policy’s parameter vector is θ ∈ Rd , with the policy
defined as
πθ (a|s) = P r{at = a|st = s, θt = θ} (4.13)
Continuous action space is modeled by learning the statistics of the prob-
ability of the action space. A natural policy parameterization in continuous
action spaces is the Gaussian distribution a ∼ N (µθ (s), σθ (s)2 ) defined as
(a − µθ (s))2
−
1 2σθ (s)2
πθ (a|s) = √ e (4.14)
σθ (s) 2π
where µθ (s) ∈ R and σθ (s) ∈ R+ are parametric function approximations of the
mean and standard deviation, respectively. The mean decides the space where
the agent will favor actions, while the standard deviation decides the degree of
exploration. It is important to note that this gives a probability density, not a
probability distribution like the softmax distribution.
For policy gradient methods in the continuous time setting, the goal of opti-
mizing the policy πθ is to find the parameters θ that maximize the average rate
of return per time step [SB18]. The performance measure J for the policy πθ
in the continuing setting is defined in terms of the average rate of reward per
time step as
Z Z
J(πθ ) = dπ (s) r(s, a)πθ (a|s)dads
S A
= Es∼dπ ,a∼πθ [r(s, a)] (4.15)
40
allowing the agent to simulate paths and update the policy parameter at every
step [SB18].
4.6.1 REINFORCE
REINFORCE is an on-policy direct policy optimization algorithm derived using
the policy gradient theorem [SB18]. The algorithm is on-policy. Consequently,
the agent will encounter the states in the proportions specified by the steady-
state distribution. Using the policy gradient theorem, the calculation of the
policy gradient reduces to a simple expectation. The only problem is estimating
the action-value function Qπ (s, a). REINFORCE solves this problem by using
the sampled return Gt as an unbiased estimate of the action-value function
Qπ (st , at ). Observing that the state-value is equal to the expectation of the
sampled return, i.e., Eπ [Gt |st , at ] = Qπ (st , at ), the following expression for the
policy gradient can be defined
This expression can be sampled on each time step t, and its expectation equals
the gradient. The gradient ascent policy parameter update for REINFORCE is
defined as
θt+1 = θt + αGt ∇θ log πθt (at |st ) (4.19)
where α is the step size. The direction of the gradient is in the parameter space
that increases the probability of repeating action at on visits to st in the future
the most [SB18]. The higher the return, the more the agent wants to repeat that
action. The update is inversely proportional to the action probability to adjust
for different frequencies of visits to states, i.e., some states might be visited
often and have an advantage over less visited states.
While REINFORCE is unbiased and only requires estimating the policy, it
might exhibit high variance due to the high variability of sampled returns (if
the trajectory space is large). High variance leads to unstable learning updates
and slower convergence. Furthermore, the stochastic policy used to estimate
the gradient can be disadvantageous in critical domains such as health care or
finance. Thankfully, both these problems can be solved by a class of policy
gradient methods called actor-critic methods.
4.6.2 Actor-critic
Policy-based reinforcement learning is effective in high-dimensional and con-
tinuous action space, while value-based RL is more sample efficient and more
convenient for online learning. Actor-Critic (AC) methods seek to combine the
best of both worlds where a policy-based actor chooses actions, and the value-
based critic critique those actions. The actor optimizes the policy parameters
using stochastic gradient ascent in the direction suggested by the critic. The
critic’s value function is optimized using stochastic gradient descent to minimize
41
the loss to the target. This use of a critic introduces bias since the critique is an
approximation of the return and not actual observed returns like in actor-based
algorithms like REINFORCE. There are numerous actor-critic algorithms like
advantage actor-critic (A2C) [SB18], asynchronous advantage actor-critic (A3C)
[MBM+ 16], and proximal policy optimization (PPO) [SWD+ 17], that have ex-
hibited impressive performance in a variety of applications. These methods rely
on stochastic policies and computing the advantage function. For critical do-
mains such as finance, a deterministic policy directly optimized by a learned
action-value function might be more appropriate. Fortunately, the policy gra-
dient framework can be extended to deterministic policies [SLH+ 14, LHP+ 15].
The idea behind deterministic actor-critic algorithms is based on Q-learning,
where a network Q(s, a) approximates the return. Q-learning can be extended
to high-dimensional state spaces by defining the Q-network as a function ap-
′
proximator Qϕ (s, a) : S × A → R, parameterized by ϕ ∈ Rb . If the Q-network
is optimal (Q∗ϕ ), finding the optimal action (a∗ ) in a small discrete action space
is trivial; a∗ (s) = arg maxa Q∗ϕ (s, a). However, the exhaustive computations re-
quired for this process are not feasible in high-dimensional or continuous action
spaces due to the curse of dimensionality. This problem can be bypassed by
′
learning a deterministic policy µθ (s) : S → A, parameterized by θ ∈ Rd , as an
approximator to a(s), such that maxa Qϕ (s, a) ≈ Qϕ (s, µ(s)).
Initially, there was a belief that the deterministic policy gradient did not exist;
however, it was proven by Silver et al. [SLH+ 14], which provides the following
expression for the gradient
Z
∇θ J(µθ ) = dµ (s)∇θ µθ (s)∇a Qµ (s, a)|a=µθ (s) ds
S
= Es∼dµ [∇θ µθ (s)∇a Qµ (s, a)|a=µθ (s) ] (4.21)
The deterministic policy gradient theorem holds for both on-policy and off-
policy methods. Deterministic policies only require integrating over the state
space and not both the state and action space like stochastic policies. The true
action-value can be approximated by a parameterized critic, i.e., Qϕ (s, a) ≈
Qµ (s, a).
42
to sub-optimal solutions. To solve this problem, the deterministic actor-critic
algorithm learns off-policy by introducing an exploration policy µ′θ defined as
µ′θ (s) = µθ (s) + W (4.22)
where W is sampled noise from a noise-generating function. The exploration
policy µ′θ explores the environment and generates trajectories that optimize the
target policy µθ and Q-network Qϕ .
Replay memory Learning policies and Q-networks with large nonlinear func-
tion approximators is generally considered difficult and unstable and do not
come with convergence guarantees. Another challenge of combining deep neural
networks with reinforcement learning is that most ML optimization algorithms
assume that samples are independent and identically distributed (IID). The IID
assumption is rarely valid for RL agents sequentially exploring the state space.
Furthermore, minibatch learning is advantageous as it efficiently utilizes hard-
ware optimization. The introduction of replay memory [MKS+ 13] addresses
these problems and trains large nonlinear function approximators stably and
robustly. A replay memory D = {τt−k+1 , τt−k+2 , ..., τt } is a finite cache storing
the past k transitions τt = (st , at , rt ). A minibatch B ⊆ D of |B| > 0 transitions
are randomly sampled from the replay memory and used to update both the
policy and Q-network.
Randomly sampled batches are ineffective for training recurrent neural net-
works, which carry forward hidden states through the mini-batch. Deep Re-
current Q-Network (DRQN) [HS15] is an extension of DQN for recurrent neu-
ral networks. DRQN uses experience replay like DQN; however, the sampled
batches are in sequential order. The randomly sampled batch B ⊆ D consists
of the transitions B = {τi , τi+1 , ..., τi+|B|−2 , τi+|B|−1 }, where i is some random
starting point for the batch. The RNNs initial hidden state is zeroed at the
start of the mini-batch update but then carries forward through the mini-batch.
43
Part II
Methodology
44
5 Problem Setting
In reinforcement learning, the agent learns through interaction with the envi-
ronment. Thus, developing a model of the environment, in this case, the com-
modity market, is necessary to optimize an algorithmic trading agent through
reinforcement signals. Commodities trading involves sequential decision-making
in a stochastic and nonstationary environment to achieve some objective out-
lined by the stakeholder. This chapter describes a discrete-time Markov decision
process that models this environment 9 . Neither the strong assumption of count-
able state-actions nor the assumption of full environment observability can be
satisfied. Thus, based on the previously proposed financial markets dynami-
cal system [ZZR20, Hua18, LXZ+ 18], this chapter presents an infinite partially
observable MDP for commodities trading.
5.1 Assumptions
Since the model will be tested ex-post by backtesting, described in section 2.9,
it is necessary to make a couple of simplifying assumptions about the markets
the agent operates in:
1. No slippage, i.e., there is sufficient liquidity in the market to fill any orders
placed by the agent, regardless of size, at the quoted price. In other
words, someone is always willing to take the opposite side of the agent’s
trade. This assumption relates to external factors that may affect the
price between the time the agent is quoted the price and the time the
order is filled10 .
2. No market impact, i.e., the money invested by the agent is not signifi-
cant enough to move the market. This assumption relates to the agent’s
own trades’ impact on the market. The reasonability of this assumption
depends on the depth of the market.
financial markets.
10 In reality, prices may significantly change between receiving a quote and placing an order.
45
ideas from Mandelbrot and Taylor [Man97, MT67], and Clark [Cla73] presented
in section 2.8. Dollar volume-based sampling provides better statistical prop-
erties for the agent and can, without human supervision, adapt to changes in
market activity.
In practice, observations are sampled by sequentially summing the product
of the volume vi and price pi of every trade in the market and then sampling
a new observation once this sum breaches a predefined threshold δ > 0 before
starting the process over again. Define the sum of the total transacted dollar
volume from the past sampled point k to point i as
i
X
χi = vj · p j (5.1)
j=k+1
46
toward the financial instrument. Thus, the state space S is continuous and
partially observable. Representing the environment state st to an algorithmic
trading agent is impossible, so it needs to be approximated by an agent state,
i.e., sat ≈ st . This thesis adopts the philosophy of technical traders described
in section 2.5. It uses past trades, specifically their price and volume, as ob-
servations ot of the environment. Let k ∈ R+ be the number of trades for the
instrument during the period (t − 1, t]. An observation ot at time t is defined
as
ot = [pt , vt ] (5.3)
where
• pt ∈ Rk are the prices of all k trades during the period (−1, t]. The
opening price is denoted pt .
• vt ∈ Rk are the volumes of all k trades during the period (−1, t].
A single observation ot is not a Markovian state signal, and the agent state
can be defined by the entire history ht . However, this alternative is not scal-
able. Section 5.2 introduced the time discretization scheme for this environment,
which is a form of sub-sampling. However, the computational and memory re-
quirements still grow linearly with the number of samples, so a history cut-off
is also employed. In other words, the agent will only have access to the past
n ∈ N+ observations ot−n+1:t . In addition, the recursive mechanism of consider-
ing the past action as a part of the internal state of the environment introduced
by Moody et al. [MWLS98] is adopted to consider transaction costs. The agent
state is formed by concatenating the external state consisting of stacking the
n most recent observations with the internal state consisting of the past action
at−1 , i.e.,
sat = {ot−n+1:t , at−1 } (5.4)
The dimension of the agent state vector is dim (sat ) = 2kn + 1.
47
complicates the learning process. Instead, a more straightforward approach is
to have the agent output its desired position weight. In this case, a trade is not
directly outputted but inferred from the difference between the agent’s current
position and its chosen next position.
At every step t, the agent performs an action at ∈ [−1, 1], representing the
agent’s position weight during the period (t, t + 1]. The weight represents the
type and size of the position the agent has selected, where
• at > 0 indicates a long position, where the agent bets the price will rise
from time t to time t + 1. The position is proportional to the size of the
weight, where at = 1 indicates that the agent is maximally long.
• at = 0 indicates no position.
• at < 0 indicates a short position, where the agent bets the price will fall.
at = −1 indicates that the agent is maximally short. This thesis assumes
that there is no additional cost or restriction on short-selling.
The trading episode starts and ends (if it ends) with no position, i.e., a0 = aT =
0.
The weight at represents a fraction of the total capital available to the agent
at any time. For this problem formulation, it is irrelevant if at = 1 represents
$1 or $100 million. However, this requires that any fraction of the financial
instrument can be bought and sold. E.g., if the agent has $100 to trade and
wants to take the position at = 0.5, i.e., a long position worth $50, the price
might not be a factor of 50, meaning that the agent would not get the exact
position it selected. The fractional trading assumption is less reasonable the
smaller the amount of capital available to the agent. On the other hand, the
assumptions made in section 5.1 are less reasonable the higher the amount of
capital.
48
return of a financial instrument at time t is defined as the relative change in
price from time t − 1 to t
pt
yt = −1 (5.5)
pt−1
Multiplicative returns, unlike additive returns, have the advantage that they
are insensitive to the size of the capital traded. Logarithmic returns log (yt + 1)
are typically used in algorithmic trading for their symmetric properties [JXL17,
Hua18, ZZW+ 20]. The gross log return realized at time t is
At the end of the period (t − 1, t], due to price movements yt in the market,
the weight at−1 evolve into
pt
at−1 pt−1
a′t = (5.7)
at−1 yt + 1
where a′t ∈ R. At the start of the next period t, the agent must rebalance
the portfolio from its current weight a′t to its chosen weight at . As noted in
section 2.1, the subsequent trades resulting from this rebalancing are subject to
transaction costs. The size of the required rebalancing at time t is represented
by ||at − a′t ||. The log-return net of transaction costs at time t is defined as
σ 2 (rinet |i = t − L + 1, ..., t) = σL
2 net
(rt ) (5.9)
rt = rtnet − λσ σL
2 net
(rt ) (5.10)
49
6 Reinforcement learning algorithm
This chapter presents two model-free reinforcement learning algorithms that
solve the trading MDP defined in chapter 5. There are three types of rein-
forcement learning algorithms: critic-based, actor-based, and actor-critic-based.
Despite the popularity of critic-based algorithms, such as Q-learning, they are
unsuitable in this context due to their inability to handle high-dimensional or
continuous action spaces. Actor-based and actor-critic-based methods, known
as policy gradient methods (4.6), are appropriate since they can handle continu-
ous action and state spaces. Furthermore, policy gradient methods are suitable
for continuing tasks like trading. As both actor-based and actor-critic-based
methods have advantages and disadvantages, it remains to be determined which
methodology is most appropriate for this problem. Actor-based RL methods like
REINFORCE are generally successful in stochastic continuous action spaces and
have been applied to both single instrument trading and portfolio optimization
[JXL17, ZZR20]. However, actor-based RL suffers high variance in learning and
tends to be unstable and inconvenient in online learning. Actor-critic methods
like Deep Deterministic Policy Gradient (DDPG) [LHP+ 15] have become popu-
lar lately and have been applied to several RL trading and portfolio optimization
problems [LXZ+ 18, YLZW20]. Deterministic policies can be appropriate for fi-
nancial trading, and off-policy learning combined with replay memory can be
practical for online learning. However, training two neural networks is generally
deemed to be unstable. Thus, the selection of a reinforcement learning algo-
rithm is non-trivial. This chapter presents an actor-based algorithm (6.2) and
an actor-critic-based algorithm (6.3) for solving the trading MDP.
50
reward rt+1 , i.e.,
Q(st , at ) = rt+1 (6.1)
∀st ∈ S, at ∈ A(st ). As an immediate reward process, the reward function can
be directly optimized by the policy gradient from rewards.
The actor-based direct policy gradient method introduced in section 6.2
optimizes the policy by using the immediate reward directly. In contrast, the
actor-critic method introduced in section 6.3 optimizes the policy using critique
from a Q-network optimized to minimize the loss to the immediate reward.
At every step t, the agent samples an action at ∼ πθ from the policy and clips
the action to the interval [−1, 1].
The novel idea of using the exploration rate ϵ as a controlled, decaying stan-
dard deviation of the stochastic policy represents progress in the research area.
As ϵ approaches 0, the policy becomes more or less deterministic to the mean
given by the parametric function approximation µθ , which is advantageous in
critical domains such as financial trading. However, being a stochastic policy,
the stochastic sampling required for the REINFORCE update is still available,
blending the best of both worlds for an algorithmic trading agent in an imme-
diate reward environment.
51
scheme. Trajectories are divided into mini-batches [ts , te ], where ts < te . The
policy’s performance measure on a mini-batch is defined as
" t #
X e
i.e., the expected sum of immediate rewards during the mini-batch [ts , te ] when
following the policy πθ,ϵ . Using the policy gradient theorem, the gradient of the
performance measure J with respect to the parameter weights θ is defined as
" t #
X e
This expectation is empirically estimated from rollouts under πθ,ϵ . The param-
eter weights are updated using a stochastic gradient ascent pass
52
Algorithm 2 Actor-Based Algorithm for Trading
Input: a differentiable stochastic policy parameterization πθ,ϵ (a|s)
Algorithm parameters: learning rate αθ > 0, mini-batch size b > 0, initial
exploration rate ϵ ≥ 0, exploration decay rate λϵ ∈ [0, 1], exploration minimum
ϵmin ≥ 0
Initialize: empty list B of size b
repeat
Receive initial state of the environment s0 ∈ S
repeat
for t = 0,1,...,T-1 do
Sample action at ∼ πθ,ϵ (·|st )
Execute action at in the environment and observe rt and st+1
Store pair of reward rt and log-probabilities ∇θ ln πθ,ϵ (at |st ) in B
if |B| == b or st is terminal then
Update the policy πθ,ϵ by one step of gradient ascent using:
X
∇θ J(πθ,ϵ ) ≈ rt ∇θ ln πθ,ϵ (at |st )
B
53
where W ∼ U[−1,1) is sampled noise from an uniform distribution. The explo-
ration parameters ϵ, ϵmin , λϵ are defined similarly for the direct policy gradient
algorithm 6.2. Clipping agents’ actions to the interval [−1, 1] prevents them
from taking larger positions than their available capital.
Optimization Both the actor and critic networks are updated using ran-
domly sampled mini-batches B from a replay memory D. The replay memory
provides random batches in sequential order for stateful RNNs, and random
batches not in sequential order that minimize correlation between samples for
non-stateful DNNs. The exploration policy µ′θ explores the environment and
generates transitions τ stored in the replay memory D.
The objective function J for the deterministic policy µθ is defined as
54
Algorithm 3 Actor-Critic Algorithm for Trading
Input: a differentiable deterministic policy parameterization µθ (s)
Input: a differentiable state-action value function parameterization Qϕ (s, a)
Algorithm parameters: learning rates αθ > 0, αϕ > 0, mini-batch size b > 0,
replay memory size d ≥ b, initial exploration rate ϵ ≥ 0, exploration decay rate
λϵ ∈ [0, 1], exploration minimum ϵmin ≥ 0
Initialize empty replay memory cache D
repeat
Receive initial state of the environment s0 ∈ S
for t = 1,...,T do
Select action at = µθ (st ) + ϵW from the exploration policy
Execute at in the environment and observe rt and st+1
Store transition τt = (st , at , rt ) in the replay memory D
Sample a random mini-batch B of |B| transitions τ from D
Update the Q-network by one step of gradient descent using
1 X
∇ϕ (Qϕ (s, a) − r)2
|B|
(s,a,r)∈B
end for
Update the exploration rate ϵ = max (ϵ · λϵ , ϵmin )
until convergence
55
7 Network topology
The reinforcement learning algorithms introduced in chapter 6 utilize function
approximation to generalize over a continuous state and action space. Section
2.5 introduced function approximators for extracting predictive patterns from
financial data, where empirical research suggested the superiority of deep learn-
ing methods. Thus, the function approximators introduced in this chapter rely
on deep learning techniques introduced in 3. In the research presented in section
2.5, the function approximators based on convolutional neural networks (3.3.10)
and those based on the LSTM (3.3.11) consistently performed well. Thus, this
section introduces two function approximators based on CNNs and LSTMs, re-
spectively. The sequential information layer, presented in section 7.4, leverages
these techniques to extract predictive patterns from the data. Furthermore,
the decision-making layer that maps forecasts to market positions, presented
in section 7.5, employs the recursive mechanism introduced by Moody et al.
[MWLS98], enabling the agent to consider transaction costs.
The direct policy gradient algorithm presented in chapter 6.2 is an actor-
based RL algorithm that only uses a parameterized policy network. The de-
terministic actor-critic algorithm presented in chapter 6.3 uses a parameterized
policy network and a parameterized critic network. This chapter outlines these
function approximators, which fortunately consist of many of the same compo-
nents. Section 7.2 describes the policy network, while section 7.3 describes the
Q-network. The last section 7.6 describes the optimization and regularization
of the networks.
56
the external agent state at time t is defined as
h i
p̂t = pt , phigh
t , plow
t (7.1)
Normalizing input data for neural networks speeds up learning [GBC16] and
is beneficial for reinforcement learning as well [ARS+ 20]. However, normalizing
the whole time series ex-ante is a form of lookahead. The normalization scheme
can only use data up to time ≤ t for the observation pt ∀t. The choice of instru-
ment weights depends on relative log returns rather than absolute price changes.
The price tensor p̂t is normalized using the log-returns from the previous clos-
ing price pt−1 . Additionally, adopting the ideas from Zhang et al. [ZZR20], the
input is further normalized by dividing by a volatility term defined as
√ √
pi
σ 2 log |i = t − L + 1, ..., t 2
L = σL,t L (7.2)
pi−1
where L ∈ N+ is the lookback window to calculate the volatility of the closing
price, which is set to L = 60 similarly as [ZZR20]. The normalized price tensor
at time t is thus defined as
2
√
p̄t = log (p̂t ⊘ pt−1 ) ⊘ σL,t L (7.3)
57
information layer, a decision-making layer, and a tanh function. The input to
′
the policy network is the modified agent state sat . The external part of the agent
S
state xt , i.e., the price tensor of stacked observations, is input into the sequential
information layer. The sequential information layer output is concatenated with
the previous action at−1 to produce input into the decision-making layer. The
output from the decision-making layer maps to a tanh function that produces
the action constrained to the action space [−1, 1].
7.3 Q-network
The Q-network Qϕ : R|S| × R|A| → R is a function approximator parameterized
′
by ϕ ∈ Rb . It is an action-value function that assigns the value of performing
a specific action in a specific state and thus takes two arguments, the modified
′
agent state sat and the action at . Other than that, there are two differences
between the critic and policy networks. Firstly, the Q-network has an additional
layer before the sequential information net that concatenates the agent state sat
and the current action at and maps it through a fully-connected layer into
a leaky-ReLU activation function with negative slope 0.01 and dropout with
probability 0.2. The second difference is that the output after the decision-
making layer does not map to a tanh function since the Q-network outputs
action-values, which are not constrained to any specific interval.
58
7.4 Sequential information layer
In essence, an algorithmic trading agent places bets on the relative price change,
or returns, of financial instruments. The agent’s success ultimately depends on
its ability to predict the future. However, doing so in highly competitive and
efficient markets is non-trivial. To remain competitive in continuously evolving
markets, the agent must learn to recognize patterns and generate rules based
on past experiences. The sequential information layer extracts the sequential
features from the input data and is arguably the most integral part of the
model. Let xIt be the input into the sequential information net (for the policy
network xIt = xSt ). The sequential information layer is a parametric function
approximator that takes the input xIt and outputs a feature vector gt , defined
as
f S (xIt ) = gt (7.5)
The choice of the appropriate function approximator for this task is non-
trivial. The inductive bias of the model must align with that of the problem
for the model to generalize effectively. Therefore, selecting a model that cap-
tures the problem’s underlying structure while also being efficient and scalable
is imperative. Research on financial time series forecasting found that deep
learning models, specifically those based on the CNN and LSTM architecture,
consistently outperformed traditional time series forecasting methods such as
the ARIMA and GARCH [XNS15, MRC18, SNN18, SGO20]. The universal ap-
proximation theorem (3.3.8) establishes that there are no theoretical constraints
on feedforward networks’11 expressivity. However, feedforward networks are not
as naturally well-suited to processing sequential data as CNNs and LSTMs.
Therefore, they may not achieve the same level of performance, even though it
is theoretically possible. Additionally, feedforward networks may require signif-
icantly more computing power and memory to achieve the same performance as
CNNs or LSTMs on sequential data. Transformers were also considered due to
their effectiveness in forecasting time series [MSA22]. Transformers employ an
encoder-decoder architecture and rely on attention mechanisms to capture long-
term dependencies. Thus, they do not require a hidden state, like RNNs, and
are relatively easy to parallelize, enabling efficient training on large datasets. A
variant called decision transformers [CLR+ 21] has been applied to offline rein-
forcement learning. However, it is unclear how to apply the transformer in its
conventional encoder-decoder topology to online reinforcement learning. There-
fore, the transformer is, for the moment, unsuitable for this problem. The gated
recurrent unit (GRU) is a newer version of the recurrent neural network that is
less computationally complex than the LSTM. However, LSTMs are generally
considered superior for forecasting financial data [SGO20].
This section defines two distinct DNN topologies for the sequential informa-
tion layer; the first is based on convolutional neural networks, while the second is
based on recurrent neural networks, specifically the LSTM. The two sequential
information topologies both consist of two hidden layers, which is enough for
11 Of arbitrary width or height.
59
the vast majority of problems. Performance is usually not improved by adding
additional layers.
60
recurrent part of the network is not considered when computing the normaliza-
tion statistic and is, therefore, not used.
xD
t = (gt , at−1 ) (7.6)
D
The decision-making layer is a dot product between a weight vector wD ∈ R|xt |
D ⊤ D
fD (xD
t ) = (w ) xt (7.7)
61
is clipped to 1 to prevent exploding gradients. There are many potential ac-
tivation functions for neural networks, including the default recommendation,
the ReLU. To combat the “dying ReLU problem”, the leaky-ReLU activation
function is used in the networks. The negative slope, or the “leak”, is set to the
standard value of 0.01.
7.6.1 Regularization
Machine learning research generally focuses on problems with complex struc-
tures and high signal-to-noise ratios, such as image classification. For these
problems, complicated non-linear models like neural nets have demonstrated
their effectiveness. However, in a high-noise environment such as financial fore-
casting, where the R-squared is often of order 10−4 [Isi21], anything beyond
linear regression poses a significant overfitting risk. An overfitted network will
likely perform well on the training set but generalize poorly out-of-sample. An
algorithmic trading agent that performs well on the training set is of little use,
and it is imperative to reduce the generalization error. Therefore, regulariza-
tion is needed to mitigate the risk of overfitting and reduce the generalization
error. A description of the regularization techniques used for these networks is
provided in section 3.3.6.
For ML models to be generalizable, they must learn data patterns rather
than individual data points to identify a bigger picture agnostic of noisy details.
Regularization techniques such as weight decay limit the capacity of the net-
works by adding a parameter norm penalty to the objective function. Weight
decay uses the L2 norm; other norms, such as the L1 norm, can also be used.
The L2 norm is appropriate since it punishes outliers harsher and is easier to op-
timize with gradient-based methods. The parameter λwd controls the degree of
penalization, balancing the tradeoff between increased bias and decreased vari-
ance. The network optimizer introduced in this section uses weight decay with
the constant parameter λwd = 0.001 to mitigate the network’s overfitting risk.
Experimentally, this value delivered the optimal balance for the bias-variance
tradeoff. Although increasing the weight decay penalty could further reduce
overfitting risk, this was too restrictive for the networks.
It is important to note that weight decay reduces, but does not eliminate, the
risk of overfitting. Dropout is another explicit regularization technique almost
universally used in deep neural networks. Dropout forces the network to learn
multiple independent data representations, resulting in a more robust model.
When training networks on noisy financial data, dropout effectively ensures
the network ignores the noise. Similarly to weight decay, the dropout rate is a
tradeoff. There is no established heuristic for choosing the dropout rate; instead,
it is usually chosen through experimentation. In this case, a dropout rate of
0.2 provided a suitable regularizing effect where the model generalized well and
produced accurate predictions. Dropout is used between all hidden layers in the
networks.
Although explicit regularizers such as weight decay and dropout reduce over-
fitting risk, it remains tricky to determine the optimal training duration. This
62
challenge is addressed with early stopping, which functions as an implicit regu-
larizer. The networks are trained in an early stopping scheme, with testing on
the validation set every 10th epoch. As reinforcement learning involves random
exploration, the models are tested slightly less frequently than conventional to
prevent premature stopping.
63
Part III
Experiments
64
8 Experiment and Results
Experiments play a vital role in science and provide the basis for scientific knowl-
edge. This chapter presents the experiments and results where the methods
presented in part II are tested on historical market data using the backtesting
framework described in section 2.9. The backtest requires simplifying market
assumptions, specified in chapter 5. Section 8.1 details the experiment setting.
The results of the experiment are presented and discussed in sections 8.2 and
8.3. Finally, the overall approach is discussed in section 8.4. The experiment
aims to answer the research questions posed at the start of this thesis.
1. Can the risk of algorithmic trading agents operating in volatile markets
be controlled?
8.1.1 Baselines
Defining a baseline can be helpful when evaluating the performance of the meth-
ods presented in part II. A challenge with testing algorithmic trading agents is
the lack of established baselines. However, the by far most common alternative
is the buy-and-hold baseline [MWLS98][ZZR20][ZZW+ 20]. The buy-and-hold
65
baseline consists of buying and holding an instrument throughout the experi-
ment, i.e., at = 1, ∀t. Compared to a naive buy-and-hold baseline, an intelligent
agent actively trading a market should be able to extract excess value and reduce
risk.
8.1.2 Hyperparameters
Table 1 shows the hyperparameters used in this experiment. The learning rates
for the policy network and Q-network are denoted αactor and αcritic , respec-
tively, and were tuned experimentally. |B| is the batch size, and |D| is the
replay memory size. Large batch sizes are necessary to obtain reliable gradient
estimates. However, large batch sizes also result in less frequent updates to
the agent and updates that may contain outdated information. As a result of
this tradeoff, the batch and replay memory sizes used in this experiment were
selected as appropriate values. The transaction cost fraction λc is set to a rea-
sonable value that reflects actual market conditions. The initial exploration rate
is denoted ϵ, with decay rate λϵ and minimum ϵmin . The number of stacked
past observations is given by n, considered a reasonable value for the agent to
use for short-term market prediction.
Table 1: Hyperparameters
66
their respective pseudocodes 2 and 3, the RL agents continuously refit them-
selves as they observe transitions. The results section (8.2) presents results from
these backtests.
8.1.5 Dataset
The dataset consists of the front-month TTF Natural Gas Futures contracts
from 2011 to 2022. The observations are sampled according to the transacted
euro volume on the exchange, defined in section 5.2. Larger sample sizes are de-
sirable to ensure statistical significance, especially for highly overparameterized
approximators such as neural networks. In addition, predictability is generally
67
higher over shorter horizons [Isi21]. However, as sampling frequency (and there-
fore trading frequency) increases, simplifying assumptions, such as no impact
and perfect liquidity, become increasingly consequential. Thus, an appropriate
target number of samples per day is tgt = 5, which provides a little over 20 000
total samples. The data processing is limited to what is described in section
7.1.
The first quarter of the dataset consisting of trades from 01/01/2011 to
01/01/2014 makes up the training set. The validation set is the second quarter
of the dataset from 01/01/2014 to 01/01/2017. Finally, the test set is the second
half of the dataset from 01/01/2017 to 01/01/2023. Figure 8.1 illustrates the
training-validation-test split.
8.2 Results
This section presents the results of the experiments described in the previous
section. The models are tested using four different values of the risk-sensitivity
term λσ (0, 0.01, 0.1, and 0.2), and the results of all four values are presented.
The results are visualized using three tools; a table and two types of plots, and
they are briefly described below
• The table consists of the performance metrics (described in section 8.1.4)
of each model (described in section 8.1) from the backtests.
• A standard line plot illustrates the performance of the models against the
baseline indexed in time of the cumulative product of logarithmic trade
returns, where the trade returns are defined in equation 8.1.
• A boxplot illustrates the distribution of the monthly logarithmic returns12
of each model and the baseline. Boxplots summarize a distribution by
its sampled median, the first quantile (Q1 ), and the third quantile (Q3 ),
represented by the box. The upper whisker extends to the largest observed
value within Q3 + 32 IQR, and the lower whisker extends to the smallest
observed value within Q1 − 32 IQR, where the interquartile range (IQR) is
Q3 − Q1 . Dots represent all values outside of the whiskers (outliers).
The plots display the performance of all models and the baseline and are grouped
by risk-sensitivity terms.
12 Again, trade returns are defined in equation 8.1 and resampled to produce monthly values.
The logarithmic monthly returns are then calculated based on these values.
68
Table 2 below shows the results of the backtests averaged over 10 runs. The
variation between runs was small enough to warrant the level of precision of the
results given in the table.
A pair of plots (line plot and boxplot) are grouped by risk-term values (0,
0.01, 0.1, and 0.2, respectively).
69
Figure 8.2: Cumulative logarithmic trade returns for λσ = 0
70
Figure 8.4: Cumulative logarithmic trade returns for λσ = 0.01
71
Figure 8.6: Cumulative logarithmic trade returns for λσ = 0.1
72
Figure 8.8: Cumulative logarithmic trade returns for λσ = 0.2
73
8.3 Discussion of results
This section will discuss the experiment results and how they relate to the three
research questions posed at the start of this thesis.
8.3.1 Risk/reward
The first research question posed at the start of this thesis was:
74
risk-neutral agents (i.e., those where λσ = 0), on average, increase the returns
by 13% compared to the baseline. Although they have no risk punishment,
they also decrease the standard deviation of returns by 18%. This last point is
surprising but could be a byproduct of an intelligent agent trying to maximize
returns. For λσ = 0.01, the deep RL agents, on average, produce 16% increased
returns and 47% reduced standard deviation of returns compared to the baseline.
This combination results in a 118% higher Sharpe ratio. For λσ = 0.1, the
agents on average produce 25% lower returns; however, the standard deviation
of returns is reduced more by 64%. Thus, the Sharpe is increased by 92%
compared to the baseline. The most risk-averse agents (i.e., those where λσ =
0.2) on average produce 38% lower returns with 66% lower standard deviation
of returns, yielding an 83% increase in Sharpe compared to the baseline. The
risk-sensitivity term λσ = 0.01 produces the highest Sharpe ratio on average.
Thus, the backtests suggest that of the four risk-sensitivity options tested in
this thesis, λσ = 0.01 strikes the best risk/reward balance.
8.3.2 RL models
The second research question posed at the start of this thesis was:
What reinforcement learning algorithms are suitable for optimizing
an algorithmic training agent in an online, continuous time setting?
A curious result from the experiment is that, for three out of four risk-sensitivity
terms13 , the model with the highest hit-rate has the lowest Sharpe. In other
words, the model making the highest rate of profitable trades also produces the
lowest risk-adjusted returns. This result illustrates the complexity of trading
financial markets and justifies the methods chosen in this thesis. Firstly, there
is no guarantee that a higher percentage of correct directional calls will result in
higher returns. Therefore, a forecast-based supervised learning approach opti-
mized for making correct directional calls may not align with the stakeholder’s
ultimate goal of maximizing risk-adjusted returns. For an algorithmic trad-
ing agent to achieve the desired results, e.g., making trades that maximize the
Sharpe ratio, it should be optimized directly for that objective. However, doing
so in a supervised learning setting is not straightforward. Reinforcement learn-
ing, on the other hand, provides a convenient framework for learning optimal
sequential behavior under uncertainty. Furthermore, discrete position sizing, a
drawback of value-based reinforcement learning, can expose the agent to high
risks. However, the agent can size positions based on confidence through the
continuous action space offered by policy gradient methods, allowing for more
effective risk management. Section 2.6 presented research arguing that, for al-
gorithmic trading, reinforcement learning is superior to supervised learning and
policy gradient methods are superior to value-based methods, and this result
supports those arguments.
Although previous research supported policy gradient methods, there was no
consensus on which one was superior in this context. Chapter 6 presented two
13 λ = 0, 0.01, and 0.1
σ
75
policy gradient methods: one based on an actor-only framework and the other
based on an actor-critic framework, and discussed their respective advantages
and disadvantages. The previous section (8.2) presented results from the back-
tests where both algorithms were tested out-of-sample. For all risk-sensitivity
terms, the direct policy gradient algorithm, on average, produces a 49% higher
Sharpe ratio than the deterministic actor-critic algorithm. Comparing the two
algorithms using the same network architecture and risk sensitivity term reveals
that the actor-based algorithm outperforms the actor-critic-based algorithm in
7 out of 8 combinations. The only case where the actor-critic-based algorithm
performs better14 is the case with the smallest performance gap. Furthermore,
the actor-only direct policy gradient method strictly increases the Sharpe ratio
for both network architectures as the risk-sensitivity parameter λσ increases to
a maximum at λσ = 0.1. The actor-critic method does not follow this pattern,
suggesting it fails to achieve its optimization objective.
The performance gap between the actor-based and actor-critic-based algo-
rithms is significant enough to warrant a discussion. An explanation for the
performance gap could be that the actor-critic-based algorithm optimizes the
policy using a biased Q-network reward estimate instead of the observed unbi-
ased reward. As a data-generating process, the commodity market is complex
and non-stationary. If the Q-network closely models the data-generating distri-
bution, using reward estimates from sampled experience from a replay memory
is an efficient method for optimizing the policy. On the other hand, it is also
clear that a policy that is optimized using Q-network reward estimates that are
inaccurate will adversely affect performance. The direct policy gradient algo-
rithm optimizes the policy using the observed unbiased reward and avoids this
problem altogether. Given that the reward function is exactly expressed, opti-
mizing it directly, as the direct policy gradient method does, is the most efficient
approach. Many typical RL tasks work well with the actor-critic framework, but
the backtests indicate that financial trading is not one of them.
8.3.3 Networks
The third and final research question posed at the start of this thesis was:
What deep learning architectures are suitable for modeling noisy,
non-stationary financial data?
In the research presented in section 2.5, two types of deep learning architectures
stood out; the long short-term memory and the convolutional neural network.
Chapter 7 presented two types of parametric function approximators based on
the CNN- and LSTM-architecture, respectively. The previous section (8.2) pre-
sented results from the backtests where both these function approximators are
tested out-of-sample. On average, the CNN-based models produce over 5%
higher Sharpe than those based on the LSTM, which is surprising, as LSTMs
are generally viewed as superior in sequence modeling and, due to their memory,
14 PG-LSTM vs. AC-LSTM for λσ = 0.01
76
are the preferred option when modeling POMDPs. In contrast to the CNN, the
LSTM can handle long-term dependencies, but it seems the lookback window
provides enough historical information for the CNN to make trade decisions.
However, the performance gap is not big enough to say anything conclusive,
and the LSTM outperforms the CNN in some tests, so it is unclear which is
most suitable.
One interesting observation is that the CNN-based models produce higher
returns and standard deviation of returns compared to the LSTM. On average,
the CNN-based models produce 37% higher returns and 15% higher standard
deviation of returns. From the line plots in figures 8.2, 8.4, 8.6, and 8.8, it
looks like a possible explanation for this is that the LSTM-based models prefer
smaller position sizes compared to the CNN-based models. One potential reason
for this phenomenon involves the difference in how the CNN and LSTM are
optimized. Generally speaking, the CNN-based model is far easier and quicker
to optimize than the LSTM-based model, partly due to batch norm, which in
its conventional form is incompatible with RNNs. Another reason is that when
the LSTM is trained for long sequences, the problem of vanishing gradients
makes back-propagating the error difficult and slow. Increasing the learning
rate leads to exploding gradients. The CNN-based model with batch norm
quickly and effectively adjusts its parameters to take full advantage of newly
observed information during out-of-sample tests. The LSTM-based model, on
the other hand, adjusts its parameters much slower. As a result, the actions it
selects often end up someplace in the middle of the action space causing smaller
position sizes, lower returns, and lower standard deviation of returns. For that
reason, the author of this thesis theorizes that the performance gap between the
CNN-based and LSTM-based models would increase with time.
8.4.1 Environment
Solving complex real-world problems with reinforcement learning generally re-
quires creating a simplified version of the problem that lends itself to analytical
tractability. Usually, this involves removing some of the frictions and constraints
of the real-world problem. In the context of financial trading, the environment
described in chapter 5 makes several simplifying assumptions about the envi-
ronment, including no market impact, no slippage, the ability to purchase or
sell any number of contracts at the exact quoted price, no additional costs or
restrictions on short-selling, and fractional trading. It is imperative to note that
these assumptions do not necessarily reflect real-world conditions. As such, it is
crucial to know the problem formulation’s limitations and how it will negatively
77
affect the model’s generalizability to the real-world problem. Poorly designed
environments, where agents learn to exploit design flaws rather than the ac-
tual problem, are a frequent problem in reinforcement learning [SB18]. At the
same time, these simplifying assumptions allow for a clean theoretical analysis
of the problem. Furthermore, the environment introduces some friction through
transaction costs, an improvement over many existing models.
Lookahead bias in the input data is avoided by using the price series alone
as input, as described in section 7.1. The price series of a financial instrument
is generally the most reliable predictor of future prices. However, price series
only provide a limited view of the market and do not consider the broader
economic context and the potential impact of external factors. As a result,
an agent relying solely on price series may miss out on meaningful predictive
signals. Furthermore, since the model learns online, an effective data gover-
nance strategy is required to ensure the quality and integrity of the real-time
input data stream, as data quality issues can harm the model’s performance.
The dollar bars sampling scheme described in section 5.2 has solid theoretical
foundations for improving the statistical properties of the sub-sampled price
series compared to traditional time-based sampling. When using this sampling
scheme, however, the agent cannot be certain of the prediction horizon, which
complicates forecasting.
Commodity trading firms often conduct asset-backed trading in addition
to paper trading, which incurs additional costs for booking pipeline capacity,
storage capacity, or LNG tankers. The model does not currently include these
costs, but the environment could be adjusted to include them.
8.4.2 Optimization
Statistical learning relies on an underlying joint feature-target distribution F (x, y),
with non-vanishing mutual information. The algorithmic trading agent approx-
imates this function by learning the distribution through historical data. As
financial markets are nonstationary, statistical distributions constantly change
over time, partly because market participants learn the market dynamics and
adjust their trading accordingly. In order to remain relevant for the near future,
the model must be continuously refitted using only data from the immediate
past, at the expense of statistical significance. On the other hand, training a
complex model using only a relatively small set of recent data is challenging in a
high-noise setting such as financial forecasting, often resulting in poor predictive
power. This tradeoff between timeliness and statistical significance is known as
the timeliness-significance tradeoff [Isi21]. The timeliness-significance tradeoff
highlights a central challenge in optimizing algorithmic trading models.
This thesis investigates the use of reinforcement learning in algorithmic trad-
ing, a field traditionally dominated by supervised learning-based approaches.
Supervised learning is a straightforward method for easily labeled tasks, such
as forecasting financial prices. Reinforcement learning, on the other hand, is
better suited to complex problems, such as sizing positions, managing risk and
transaction costs. In fact, with only minor modifications, the model outlined in
78
this thesis can optimize an agent trading a portfolio of arbitrary size. For this
reason, reinforcement learning was chosen as the algorithmic trading agents’ op-
timization framework. Temporal credit assignment is one of the main strengths
of reinforcement learning, making it ideal for game playing and robotics, in-
volving long-term planning and delayed reward. In this problem, however, tem-
poral credit assignment does not apply since trading is an immediate reward
process. Furthermore, the complexity of reinforcement learning compared to
supervised learning comes at a price. As well as requiring a model of the en-
vironment in which the agent interacts and learning to associate context with
reward-maximizing actions, reinforcement learning introduces added complexity
by introducing the exploration-exploitation tradeoff. With more moving parts,
reinforcement learning can be significantly more challenging to implement and
tune and is generally less sample efficient than supervised learning. The learn-
ing process is more convoluted as the agent learns through reinforcement signals
generated by interaction with the environment and involves stochasticity. Con-
sequently, the model can display unstable behavior where the policy diverges, or
the agent overfits to noise. E.g., if the market experiences a sustained downward
trend, the agent can be deceived into believing that the market will continue to
decline indefinitely. As a result, the agent may adjust its policy to always short
the market, which will have disastrous effects once the market reverses. The
phenomenon is caused by the temporal correlation of sequential interactions
between RL agents and the market, and that reinforcement learning is sample
inefficient, making it difficult to obtain good gradient estimates. Replay memory
can ensure that gradient estimates are derived from a wide variety of market con-
ditions. However, replay memory introduces biased gradient estimates, which,
according to backtests, is a poor tradeoff. The timeliness-significance tradeoff
further complicates this problem of obtaining suitable gradient estimates. A
supervised learning framework is more straightforward and avoids much of the
complexity associated with reinforcement learning. Thus, it is unclear whether
reinforcement learning or supervised learning is the most appropriate optimiza-
tion framework for algorithmic trading.
79
reasonable? Does understanding the model even matter as long as it delivers
satisfactory backtest performance?
This question can be answered by reviewing statistical learning theory. Gen-
erally, machine learning models are tested under the assumption that observa-
tions are drawn from the same underlying distribution, the data-generating dis-
tribution, and that observations are IID. In this setting, the test error serves as
a proxy for the generalization error, i.e., the expected error on new observations.
However, the dynamics of the financial markets are constantly changing. Fierce
competition in financial markets creates a cycle in which market participants
attempt to understand the underlying price dynamics. As market participants
better understand market dynamics, they adjust their trading strategies to ex-
ploit that knowledge, further changing market dynamics. Due to the constantly
changing dynamics of the market, models that worked in the past may no longer
work in the future as inefficiencies are arbitraged away15 . Therefore, it is impor-
tant to be cautious when interpreting backtest errors as generalization errors,
as it is unlikely that observations sampled at different points in time are drawn
from the same probability distribution. Even if one disregards all the flaws
of a backtest16 , the backtest, at best, only reflects performance on historical
data. In no way is this intended to discourage backtesting. However, naively
interpreting backtest performance as an assurance of future performance is dan-
gerous. Referring back to section 2.9; a backtest is a historical simulation of how
the model would have performed should it have been run over a past period.
Even exceptional results from the most flawlessly executed backtest can never
guarantee that the model generalizes to the current market. Furthermore, the
results should be interpreted cautiously if no ex-ante logical foundation exists
to explain them. Deep neural networks are highly susceptible to overfitting to
random noise when trained on noisy financial time series. It is, however, difficult
to determine if the agent has detected a legitimate signal if the model is not
interpretable. Even if the model detects a legitimate signal in the backtests,
other market participants may discover the same signal and render the model
obsolete. Again, determining this is difficult without knowing what inefficiencies
the model exploits, and deploying it until it displays a sustained period of poor
performance will be costly.
In response to the question of whether or not understanding a model matters
if it performs well on a backtest, the answer is an emphatic yes. Blindly taking
backtest performance of a black box as an assurance of future performance in
a noisy and constantly changing environment can prove costly. Thus, the aver-
sion to adopting deep learning in algorithmic trading is reasonable. Ensuring
that the trading decisions are explainable and the models are interpretable is
essential for commercial and regulatory acceptance. To address this challenge,
models should be created with a certain degree of interpretability. This way,
stakeholders can get insight into which inefficiencies the model exploits, evalu-
ate its generalizability, and identify its obsolescence before incurring significant
15 Volatility, for example, used to be a reliable indicator of future returns [Isi21].
16 E.g., not accounting for market impact and lookahead bias.
80
losses. The use of deep learning in algorithmic trading can still be viable with
techniques such as explainable AI and model monitoring.
81
9 Future work
The methods presented in this thesis leave room for improvement in further
work. More investigation should be done to evaluate the effectiveness of exist-
ing methods in different contexts. Further investigation will enable a deeper
understanding of the model and its generalizability and provide an opportunity
to identify potential areas for improvement. Considering the lack of real-world
market data, one option is to use generative adversarial networks (GANs) to
generate synthetic markets [YJVdS19]. GANs can generate unlimited data,
which can be used to train and test the model and its generalizability. Addi-
tionally, the lack of established baselines could be improved upon. While the
buy-and-hold baseline is well understood and trusted, it is unrealistic in this
context, as futures contracts expire. Although it presents its own challenges,
developing a baseline more appropriate for futures trading would improve the
current model. Furthermore, a greater level of interpretability is required to
achieve real-world adoption. Therefore, combining algorithmic trading research
with explainable AI is imperative to improve existing methods’ interpretability.
Incorporating non-traditional data sources, such as social media sentiment or
satellite images, may prove beneficial when forecasting market returns. Alterna-
tive data can provide a more comprehensive and holistic view of market trends
and dynamics, allowing for more accurate predictions. By leveraging alternative
data, algorithmic trading agents can gain an edge over their competitors and
make better-informed decisions. Using deep learning techniques such as natural
language processing and computer vision to analyze text or image data in an al-
gorithmic trading context is promising. Neural networks are generally effective
in problems with complex structures and high signal-to-noise ratios. Thus, it
may be more appropriate to use deep learning to extract features from images
or text rather than analyzing price series.
Lastly, the methods presented in this thesis are limited to trading a single
instrument. They are, however, compatible with portfolio optimization with
minimal modifications. Further research in this area would be interesting, as
it better utilizes the potential of the reinforcement learning framework and the
scalability of data-driven decision-making.
82
10 Conclusion
This thesis investigates the effectiveness of deep reinforcement learning methods
in commodities trading. Previous research in algorithmic trading, state-of-the-
art reinforcement learning, and deep learning algorithms was examined, and the
most promising methods were implemented and tested. This chapter summa-
rizes the thesis’ most important contributions, results, and conclusions.
This thesis formalizes the commodities trading problem as a continuing
discrete-time stochastic dynamical system. The system employs a novel time-
discretization scheme that is reactive and adaptive to market volatility, provid-
ing better statistical properties of the sub-sampled financial time series. Two
policy gradient algorithms, an actor-based and an actor-critic-based, are pro-
posed to optimize a transaction-cost- and risk-sensitive agent. Reinforcement
learning agents parameterized using deep neural networks, specifically CNNs
and LSTMs, are used to map observations of historical prices to market posi-
tions.
The models are backtested on the front month TTF Natural Gas futures
contracts from 01-01-2017 to 01-01-2023. The backtest results indicate the vi-
ability of deep reinforcement learning agents in commodities trading. On av-
erage, the deep reinforcement learning agents produce an 83% higher Sharpe
ratio out-of-sample than the buy-and-hold baseline. The backtests suggest that
deep RL models can adapt to the unprecedented volatility caused by the en-
ergy crisis during 2021-2022. Introducing a risk-sensitivity term functioning as
a trade-off hyperparameter between risk and reward produces satisfactory re-
sults, where the agents reduce risk as the risk-sensitivity term increases. The
risk-sensitivity term allows the stakeholder to control the risk of an algorithmic
trading agent in volatile markets. The direct policy gradient algorithm produces
significantly higher Sharpe (49% on average) than the deterministic actor-critic
algorithm, suggesting that an actor-based policy gradient method is more suited
to algorithmic trading in an online, continuous time setting. The parametric
function approximator based on the CNN architecture performs slightly better
(5% higher Sharpe on average) than the LSTM, possibly due to the problem of
vanishing gradients for the LSTM.
The algorithmic trading problem is made analytically tractable by simpli-
fying assumptions that remove market frictions. Performance may be inflated
due to these assumptions and should be viewed with a high degree of caution.
83
Acronyms
AC Actor-Critic. 41
AMH Adaptive Market Hypothesis. 7
ANN Artificial Neural Network. 19
DL Deep Learning. 29
DNN Deep Neural Network. 19
DQN Deep Q-Network. 39
DRQN Deep Recurrent Q-Network. 39
PG Policy Gradient. 39
POMDP Partially Observable Markov Decision Process. 36
84
References
[AAC+ 19] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz
Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias
Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s
cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
[AG00] Thierry Ané and Hélyette Geman. Order flow, transaction
clock, and normality of asset returns. The Journal of Finance,
55(5):2259–2284, 2000.
[AHM19] Rob Arnott, Campbell R Harvey, and Harry Markowitz. A back-
testing protocol in the era of machine learning. The Journal of
Financial Data Science, 1(1):64–74, 2019.
[BHMM19] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal.
Reconciling modern machine-learning practice and the classical
bias–variance trade-off. Proceedings of the National Academy of
Sciences, 116(32):15849–15854, 2019.
[BSF94] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning
long-term dependencies with gradient descent is difficult. IEEE
transactions on neural networks, 5(2):157–166, 1994.
[CGLL19] Raymond H Chan, Yves ZY Guo, Spike T Lee, and Xun Li.
Financial Mathematics, Derivatives and Structured Products.
Springer, 2019.
85
[DBK+ 16] Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qiong-
hai Dai. Deep direct reinforcement learning for financial signal
representation and trading. IEEE transactions on neural net-
works and learning systems, 28(3):653–664, 2016.
[DHS11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgra-
dient methods for online learning and stochastic optimization.
Journal of machine learning research, 12(7), 2011.
[DP18] Marcos Lopez De Prado. Advances in financial machine learn-
ing. John Wiley & Sons, 2018.
86
[HGMS18] Ma Hiransha, E Ab Gopalakrishnan, Vijay Krishna Menon, and
KP Soman. Nse stock market prediction using deep-learning
models. Procedia computer science, 132:1351–1362, 2018.
[Hor91] Kurt Hornik. Approximation capabilities of multilayer feedfor-
ward networks. Neural networks, 4(2):251–257, 1991.
87
[LHP+ 15] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nico-
las Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wier-
stra. Continuous control with deep reinforcement learning. arXiv
preprint arXiv:1509.02971, 2015.
[Lo04] Andrew W Lo. The adaptive markets hypothesis. The Journal
of Portfolio Management, 30(5):15–29, 2004.
[LPW+ 17] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei
Wang. The expressive power of neural networks: A view from
the width. Advances in neural information processing systems,
30, 2017.
[LXZ+ 18] Xiao-Yang Liu, Zhuoran Xiong, Shan Zhong, Hongyang Yang,
and Anwar Walid. Practical deep reinforcement learning ap-
proach for stock trading. arXiv preprint arXiv:1811.07522, 2018.
[Man97] Benoit B Mandelbrot. The variation of certain speculative prices.
In Fractals and scaling in finance, pages 371–418. Springer, 1997.
[Mar68] Harry M Markowitz. Portfolio selection. In Portfolio selection.
Yale university press, 1968.
[MBM+ 16] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza,
Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and
Koray Kavukcuoglu. Asynchronous methods for deep reinforce-
ment learning. In International conference on machine learning,
pages 1928–1937. PMLR, 2016.
[MKS+ 13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex
Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Ried-
miller. Playing atari with deep reinforcement learning. arXiv
preprint arXiv:1312.5602, 2013.
[MM97] Tom M Mitchell and Tom M Mitchell. Machine learning, vol-
ume 1. McGraw-hill New York, 1997.
[MRC18] Sean McNally, Jason Roche, and Simon Caton. Predicting the
price of bitcoin using machine learning. In 2018 26th euromi-
cro international conference on parallel, distributed and network-
based processing (PDP), pages 339–343. IEEE, 2018.
[MS01] John Moody and Matthew Saffell. Learning to trade via di-
rect reinforcement. IEEE transactions on neural Networks,
12(4):875–889, 2001.
[MSA22] S Makridakis, E Spiliotis, and V Assimakopoulos. The m5 accu-
racy competition: Results, findings and conclusions. 2020. URL:
https://www. researchgate. net/publication/344487258 The M5
Accuracy competition Results findings and conclusions, 2022.
88
[MT67] Benoit Mandelbrot and Howard M Taylor. On the distribution
of stock price differences. Operations research, 15(6):1057–1062,
1967.
[MW97] John Moody and Lizhong Wu. Optimization of trading systems
and portfolios. In Proceedings of the IEEE/IAFE 1997 com-
putational intelligence for financial engineering (CIFEr), pages
300–307. IEEE, 1997.
[MWLS98] John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell.
Performance functions and reinforcement learning for trading
systems and portfolios. Journal of Forecasting, 17(5-6):441–470,
1998.
[Nar13] Rishi K Narang. Inside the black box: A simple guide to quan-
titative and high frequency trading, volume 846. John Wiley &
Sons, 2013.
[Pet22] Martin Peterson. The St. Petersburg Paradox. In Edward N.
Zalta, editor, The Stanford Encyclopedia of Philosophy. Meta-
physics Research Lab, Stanford University, Summer 2022 edi-
tion, 2022.
[Pla22] Aske Plaat. Deep Reinforcement Learning. Springer, 2022.
[PMR+ 17] Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando
Miranda, and Qianli Liao. Why and when can deep-but not
shallow-networks avoid the curse of dimensionality: a review.
International Journal of Automation and Computing, 14(5):503–
519, 2017.
[RHW85] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams.
Learning internal representations by error propagation. Techni-
cal report, California Univ San Diego La Jolla Inst for Cognitive
Science, 1985.
[SB18] Richard S Sutton and Andrew G Barto. Reinforcement learning:
An introduction. MIT press, 2018.
[SGO20] Omer Berat Sezer, Mehmet Ugur Gudelek, and Ahmet Murat
Ozbayoglu. Financial time series forecasting with deep learn-
ing: A systematic literature review: 2005–2019. Applied soft
computing, 90:106181, 2020.
[Sha98] William F Sharpe. The sharpe ratio. Streetwise–the Best of the
Journal of Portfolio Management, pages 169–185, 1998.
[SHK+ 14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya
Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way
to prevent neural networks from overfitting. The journal of ma-
chine learning research, 15(1):1929–1958, 2014.
89
[SHM+ 16] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Lau-
rent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioan-
nis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.
Mastering the game of go with deep neural networks and tree
search. nature, 529(7587):484–489, 2016.
[SHS+ 17] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis
Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Lau-
rent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering
chess and shogi by self-play with a general reinforcement learning
algorithm. arXiv preprint arXiv:1712.01815, 2017.
[SLH+ 14] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan
Wierstra, and Martin Riedmiller. Deterministic policy gradient
algorithms. In International conference on machine learning,
pages 387–395. PMLR, 2014.
[SNN18] Sima Siami-Namini and Akbar Siami Namin. Forecasting eco-
nomics and financial time series: Arima vs. lstm. arXiv preprint
arXiv:1803.06386, 2018.
[SRH20] Simon R Sinsel, Rhea L Riemke, and Volker H Hoffmann. Chal-
lenges and solution technologies for the integration of variable re-
newable energy sources—a review. renewable energy, 145:2271–
2285, 2020.
[SSS00] Robert H Shumway, David S Stoffer, and David S Stoffer. Time
series analysis and its applications, volume 3. Springer, 2000.
[SSS+ 17] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis
Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas
Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of
go without human knowledge. nature, 550(7676):354–359, 2017.
[SVL14] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to se-
quence learning with neural networks. Advances in neural infor-
mation processing systems, 27, 2014.
[SWD+ 17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford,
and Oleg Klimov. Proximal policy optimization algorithms.
arXiv preprint arXiv:1707.06347, 2017.
[Tal97] Nassim Nicholas Taleb. Dynamic hedging: managing vanilla and
exotic options, volume 64. John Wiley & Sons, 1997.
[TOdSMJZ20] Danilo Tedesco-Oliveira, Rouverson Pereira da Silva, Walter
Maldonado Jr, and Cristiano Zerbato. Convolutional neural
networks in predicting cotton yield from images of commercial
fields. Computers and Electronics in Agriculture, 171:105307,
2020.
90
[VSP+ 17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polo-
sukhin. Attention is all you need. Advances in neural informa-
tion processing systems, 30, 2017.
[XNS15] Ruoxuan Xiong, Eric P Nichols, and Yuan Shen. Deep learn-
ing stock volatility with google domestic trends. arXiv preprint
arXiv:1512.04916, 2015.
[YJVdS19] Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar.
Time-series generative adversarial networks. Advances in neural
information processing systems, 32, 2019.
[YLZW20] Hongyang Yang, Xiao-Yang Liu, Shan Zhong, and Anwar Walid.
Deep reinforcement learning for automated stock trading: An
ensemble strategy. In Proceedings of the first ACM international
conference on AI in finance, pages 1–8, 2020.
[ZBH+ 21] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht,
and Oriol Vinyals. Understanding deep learning (still) re-
quires rethinking generalization. Communications of the ACM,
64(3):107–115, 2021.
[ZS10] Wenbin Zhang and Steven Skiena. Trading strategies to exploit
blog and news sentiment. In Proceedings of the international
AAAI conference on web and social media, volume 4, pages 375–
378, 2010.
[Zuc19] Gregory Zuckerman. The man who solved the market: How Jim
Simons launched the quant revolution. Penguin, 2019.
[ZZR20] Zihao Zhang, Stefan Zohren, and Stephen Roberts. Deep rein-
forcement learning for trading. The Journal of Financial Data
Science, 2(2):25–40, 2020.
[ZZW+ 20] Yifan Zhang, Peilin Zhao, Qingyao Wu, Bin Li, Junzhou Huang,
and Mingkui Tan. Cost-sensitive portfolio selection via deep
reinforcement learning. IEEE Transactions on Knowledge and
Data Engineering, 34(1):236–248, 2020.
91