Deep Policy Gradient Methods in Commodity Markets

Deep Policy Gradient Methods
in Commodity Markets
Jonas Rotschi Hanetho
Thesis submitted for the degree of

Master in Informatics: Programming and System
Architecture
60 credits
Institute for Informatics

Faculty of mathematics and natural sciences
UNIVERSITY OF OSLO
Spring 2023
Deep Policy Gradient
Methods in Commodity
Markets
Jonas Rotschi Hanetho

© 2023 Jonas Rotschi Hanetho
Deep Policy Gradient Methods in Commodity Markets
http://www.duo.uio.no/
Printed: Reprosentralen, University of Oslo

Acknowledgements
This thesis would not have been possible without my supervisors, Dirk Hesse
and Martin Giese. My sincere thanks are extended to Dirk for his excellent
guidance and mentoring throughout this project and to Martin for his helpful
suggestions and advice. Finally, I would like to thank the Equinor data science
team for insightful discussions and for providing me with the tools needed to
complete this project.
i
Abstract
The energy transition has increased the reliance on intermittent energy sources,
destabilizing energy markets and causing unprecedented volatility, culminating
in the global energy crisis of 2021. In addition to harming producers and con-
sumers, volatile energy markets may jeopardize vital decarbonization efforts.
Traders play an important role in stabilizing markets by providing liquidity and
reducing volatility. Forecasting future returns is an integral part of any financial
trading operation, and several mathematical and statistical models have been
proposed for this purpose. However, developing such models is non-trivial due
to financial markets’ low signal-to-noise ratios and nonstationary dynamics.
This thesis investigates the effectiveness of deep reinforcement learning meth-
ods in commodities trading. It presents related work and relevant research in
algorithmic trading, deep learning, and reinforcement learning. The thesis for-
malizes the commodities trading problem as a continuing discrete-time stochas-
tic dynamical system. This system employs a novel time-discretization scheme
that is reactive and adaptive to market volatility, providing better statistical
properties for the sub-sampled financial time series. Two policy gradient algo-
rithms, an actor-based and an actor-critic-based, are proposed for optimizing
a transaction-cost- and risk-sensitive trading agent. The agent maps historical
price observations to market positions through parametric function approxima-
tors utilizing deep neural network architectures, specifically CNNs and LSTMs.
On average, the deep reinforcement learning models produce an 83 percent
higher Sharpe ratio than the buy-and-hold baseline when backtested on front-
month natural gas futures from 2017 to 2022. The backtests demonstrate that
the risk tolerance of the deep reinforcement learning agents can be adjusted us-
ing a risk-sensitivity term. The actor-based policy gradient algorithm performs
significantly better than the actor-critic-based algorithm, and the CNN-based
models perform slightly better than those based on the LSTM. The backtest
results indicate the viability of deep reinforcement learning-based algorithmic
trading in volatile commodity markets.
ii
List of Figures
2.1 Time series cross-validation (backtesting) compared to standard
cross-validation from [HA18]. . . . . . . . . . . . . . . . . . . . . 13
3.1 An illustration of the relationship between the capacity of a func-
tion approximator and the generalization error from [GBC16]. . . 19
3.2 Feedforward neural network from [FFLb]. . . . . . . . . . . . . . 20
3.3 ReLU activation function from [FFLb]. . . . . . . . . . . . . . . . 25
3.4 Leaky ReLU activation function from [FFLb]. . . . . . . . . . . . 26
3.5 An example of the effect of weight decay with parameter λ on a
high-dimensional polynomial regression model from [GBC16]. . . 27
3.6 3D convolutional layer from [FFLc]. . . . . . . . . . . . . . . . . 30
3.7 Recurrent neural network from [FFLa]. . . . . . . . . . . . . . . . 31
3.8 LSTM cell from [FFLa]. . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Agent-environment interaction from [SB18]. . . . . . . . . . . . . 35
7.1 Policy network architecture . . . . . . . . . . . . . . . . . . . . . 58
7.2 Q-network architecture . . . . . . . . . . . . . . . . . . . . . . . . 58
7.3 Convolutional sequential information layer architecture . . . . . . 60
7.4 LSTM sequential information layer architecture . . . . . . . . . . 61
8.1 The training-validation-test split . . . . . . . . . . . . . . . . . . 68
8.2 Cumulative logarithmic trade returns for λσ = 0 . . . . . . . . . 70
8.3 Boxplot of monthly logarithmic trade returns for λσ = 0 . . . . . 70
8.4 Cumulative logarithmic trade returns for λσ = 0.01 . . . . . . . . 71
8.5 Boxplot of monthly logarithmic trade returns for λσ = 0.01 . . . 71
8.7 Boxplot of monthly logarithmic trade returns for λσ = 0.1 . . . . 72
8.9 Boxplot of monthly logarithmic trade returns for λσ = 0.2 . . . . 73
iii
List of Tables
1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2 Backtest results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
iv
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 2
I Background 4
2 Algorithmic trading 5
2.1 Commodity markets . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Financial trading . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Modern portfolio theory . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Efficient market hypothesis . . . . . . . . . . . . . . . . . . . . . 7
2.5 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Mapping forecasts to market positions . . . . . . . . . . . . . . . 9
2.7 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.8 Sub-sampling schemes . . . . . . . . . . . . . . . . . . . . . . . . 11
2.9 Backtesting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Deep learning 16
3.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 No-free-lunch theorem . . . . . . . . . . . . . . . . . . . . 17
3.1.2 The curse of dimensionality . . . . . . . . . . . . . . . . . 17
3.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Function approximation . . . . . . . . . . . . . . . . . . . 17
3.3 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 Feedforward neural networks . . . . . . . . . . . . . . . . 19
3.3.2 Parameter initialization . . . . . . . . . . . . . . . . . . . 20
3.3.3 Gradient-based learning . . . . . . . . . . . . . . . . . . . 21
3.3.4 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.5 Activation function . . . . . . . . . . . . . . . . . . . . . . 25
3.3.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.7 Batch normalization . . . . . . . . . . . . . . . . . . . . . 27
3.3.8 Universal approximation theorem . . . . . . . . . . . . . . 28
3.3.9 Deep neural networks . . . . . . . . . . . . . . . . . . . . 29
3.3.10 Convolutional neural networks . . . . . . . . . . . . . . . 29
3.3.11 Recurrent neural networks . . . . . . . . . . . . . . . . . . 30
4 Reinforcement learning 34
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Markov decision process . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Infinite Markov decision process . . . . . . . . . . . . . . 36
4.2.2 Partially observable Markov decision process . . . . . . . 36
4.3 Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
v
4.4 Value function and policy . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Function approximation . . . . . . . . . . . . . . . . . . . . . . . 39
4.6 Policy gradient methods . . . . . . . . . . . . . . . . . . . . . . . 39
4.6.1 REINFORCE . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6.2 Actor-critic . . . . . . . . . . . . . . . . . . . . . . . . . . 41
II Methodology 44
5 Problem Setting 45
5.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Time Discretization . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Action Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6 Reinforcement learning algorithm 50

6.1 Immediate reward environment . . . . . . . . . . . . . . . . . . . 50
6.2 Direct policy gradient . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3 Deterministic actor-critic . . . . . . . . . . . . . . . . . . . . . . 52
7 Network topology 56
7.1 Network input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.2 Policy network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.3 Q-network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.4 Sequential information layer . . . . . . . . . . . . . . . . . . . . . 59
7.4.1 Convolutional neural network . . . . . . . . . . . . . . . . 60
7.4.2 Long Short-Term Memory . . . . . . . . . . . . . . . . . . 60
7.5 Decision-making layer . . . . . . . . . . . . . . . . . . . . . . . . 61
7.6 Network optimization . . . . . . . . . . . . . . . . . . . . . . . . 61
7.6.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 62
III Experiments 64
8 Experiment and Results 65
8.1 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . 65
8.1.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.1.2 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . 66
8.1.3 Training scheme . . . . . . . . . . . . . . . . . . . . . . . 66
8.1.4 Performance metrics . . . . . . . . . . . . . . . . . . . . . 67
8.1.5 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.3 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3.1 Risk/reward . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3.2 RL models . . . . . . . . . . . . . . . . . . . . . . . . . . 75
vi
8.3.3 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.4 Discussion of model . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.4.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.4.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.4.3 Interpretability and trust . . . . . . . . . . . . . . . . . . 79
9 Future work 82
10 Conclusion 83
List of Abbreviations 84
vii
1 Introduction
1.1 Motivation
The transition to sustainable energy sources is one of the most critical challenges
facing the world today. By 2050, the European Union aims to become carbon
neutral [eur]. However, rising volatility in energy markets, culminating in the
2021 global energy crisis, complicates this objective. Supply and demand forces
determine price dynamics, where an ever-increasing share of supply stems from
intermittent renewable energy sources such as wind and solar power. Increasing
reliance on intermittent energy sources leads to unpredictable energy supply,
contributing to volatile energy markets [SRH20]. Already volatile markets are
further destabilized by evolutionary traits such as fear and greed, causing human
commodity traders to overreact [Lo04]. Volatile markets are problematic for
producers and consumers, and failure to mitigate these concerns may jeopardize
decarbonization targets.
Algorithmic trading agents can stabilize commodity markets by systemati-
cally providing liquidity and aiding price discovery [Isi21, Nar13]. Developing
these methods is non-trivial as financial markets are non-stationary with com-
plicated dynamics [Tal97]. Machine learning (ML) has emerged as the preferred
method in algorithmic trading due to its ability to learn to solve complicated
tasks by leveraging data [Isi21]. The majority of research on ML-based algo-
rithmic trading has focused on forecast-based supervised learning (SL) meth-
ods, which tend to ignore non-trivial factors such as transaction costs, risk,
and the additional logic associated with mapping forecasts to market positions
[Fis18]. Reinforcement learning (RL) presents a suitable alternative to account
for these factors. In reinforcement learning, autonomous agents learn to per-
form tasks in a time-series environment through trial and error without human
supervision. Around the turn of the millennium, Moody and his collaborators
[MW97, MWLS98, MS01] made several significant contributions to this field,
empirically demonstrating the advantages of reinforcement learning over super-
vised learning for algorithmic trading.
In the last decade, the deep learning (DL) revolution has made exceptional
progress in areas such as image classification [HZRS15] and natural language
processing [VSP+ 17], characterized by complex structures and high signal-to-
noise ratios. The strong representation ability of deep learning methods has even
translated to forecasting low signal-to-noise financial data [XNS15, HGMS18,
MRC18]. In complex, high-dimensional environments, deep reinforcement learn-
ing (deep RL), i.e., integrating deep learning techniques into reinforcement
learning, has yielded impressive results. Noteworthy contributions include achiev-
ing superhuman play in Atari games [MKS+ 13] and chess [SHS+ 17], and training
a robot arm to solve the Rubik’s cube [AAC+ 19]. A significant breakthrough
was achieved in 2016 when the deep reinforcement learning-based computer pro-
gram AlphaGo[SHM+ 16] beat top Go player Lee Sedol. In addition to learning
by reinforcement learning through self-play, AlphaGo uses supervised learning
techniques to learn from a database of historical games. In 2017, an improved
1
version called AlphaGo Zero[SSS+ 17], which begins with random play and relies
solely on reinforcement learning, comprehensively defeated AlphaGo. Deep RL
has thus far been primarily studied in the context of game-playing and robotics,
and its potential application to financial trading remains largely unexplored.
Combining the two seems promising, given the respective successes of reinforce-
ment learning and deep learning in algorithmic trading and forecasting.
1.2 Problem description

This thesis investigates the effectiveness of deep reinforcement learning methods
in commodities trading. It examines previous research in algorithmic trading,
state-of-the-art reinforcement learning, and deep learning algorithms. The most
promising methods are implemented, along with novel improvements, to create
a transaction-cost- and risk-sensitive parameterized agent directly outputting
market positions. The agent is optimized using reinforcement learning algo-
rithms, while deep learning methods extract predictive patterns from raw mar-
ket observations. These methods are evaluated out-of-sample by backtesting on
energy futures.
Machine learning relies on generalizability. A common criticism against algo-
rithmic trading approaches is their alleged inability to generalize to “extreme”
market conditions [Nar13]. This thesis investigates the performance of algo-
rithmic trading agents out-of-sample under unprecedented market conditions
caused by the energy crisis during 2021-2022. It will address the following re-
search questions
1. Can the risk of algorithmic trading agents operating in volatile markets
be controlled?
2. What reinforcement learning algorithms are suitable for optimizing an
algorithmic training agent in an online, continuous time setting?
3. What deep learning architectures are suitable for modeling noisy, non-
stationary financial data?
1.3 Thesis Organization

The thesis consists of three parts: the background (part I), the methodology
(part II), and the experiments (part III). The list below provides a brief outline
of the chapters in this thesis:
• Chapter 2: Overview of relevant concepts in algorithmic trading.

• Chapter 3: Overview of relevant machine learning and deep learning
concepts.
• Chapter 4: Overview of relevant concepts in reinforcement learning.
• Chapter 5: Formalization of the problem setting.
2
• Chapter 6: Description of reinforcement learning algorithms.
• Chapter 7: Description of the neural network function approximators.
• Chapter 8: Detailed results from experiments.
• Chapter 9: Suggested future work.
• Chapter 10: Summary of contributions, results, and main conclusions.
3
Part I
Background
4
2 Algorithmic trading
A phenomenon commonly described as an arms race has resulted from fierce
competition in financial markets. In this phenomenon, market participants com-
pete to remain on the right side of information asymmetry, which further reduces
the signal-to-noise ratio and the frequency at which information arrives and is
absorbed by the market [Isi21]. An increase in volatility and the emergence of a
highly sophisticated breed of traders called high-frequency traders have further
complicated already complex market dynamics. In these developed, modern
financial markets, the dynamics are so complex and change at such a high fre-
quency that humans will have difficulty competing. Indeed, there is reason to
believe that machines already outperform humans in the world of financial trad-
ing. The algorithmic hedge fund Renaissance Technologies, founded by famed
mathematician Jim Simons, is considered the most successful hedge fund ever.
From 1988 to 2018, Renaissance Technologies’ Medallion fund generated 66 per-
cent annualized returns before fees relying exclusively on algorithmic strategies
[Zuc19]. In 2020, it was estimated that algorithmic trading accounts for around
60-73 percent of U.S. and European equity trading, up from just 15 percent in
2003 [Int20]. Thus, it is clear that algorithms already play a significant role in
financial markets. Due to the rapid progress of computing power1 relative to
human evolution, this importance will likely only grow.
This chapter provides an overview of this thesis’s subject matter, algorith-
mic trading on commodity markets, examines related work, and justifies the
algorithmic trading methods described in part II. Section 2.1 presents a brief
overview of commodity markets and energy futures contracts. Sections 2.2, 2.3,
and 2.4 introduce some basic concepts related to trading financial markets that
are necessary to define and justify a trading agent’s goal of maximizing risk-
adjusted returns. This goal has two sub-goals: forecasting returns and mapping
forecasts to market positions, which are discussed separately in sections 2.5 and
2.6. Additionally, these sections provide an overview of how the concepts intro-
duced in the following chapters 3 and 4 can be applied to algorithmic trading
and provide an overview of related research. The sections 2.7 and 2.8 describe
how to represent a continuous financial market as discrete inputs to an algorith-
mic trading system. To conclude, section 2.9 introduces backtesting, a form of
cross-validation used to evaluate algorithmic trading agents.
2.1 Commodity markets

Energy products trade alongside other raw materials and primary products on
commodity markets. The commodity market is an exchange that matches buy-
ers and sellers of the products offered at the market. Traditionally trading was
done in an open-outcry manner, though now an electronic limit order book is
used to maintain a continuous market. Limit orders specify the direction, quan-
tity, and acceptable price of a security. Limit orders are compared to existing
1 Moore’s
law states that the number of transistors in an integrated circuit doubles roughly
every two years.
5
orders in the limit order book when they arrive on the market. A trade occurs
at the price set by the first order in the event of an overlap. The exchange
makes money by charging a fee for every trade, usually a small percentage of
the total amount traded.
The basis of energy trade is energy futures, a derivative contract with energy
products as the underlying asset [CGLL19]. Futures contracts are standardized
forward contracts listed on stock exchanges. They are interchangeable, which
improves liquidity. Futures contracts obligate a buyer and seller to transact a
given quantity of the underlying asset at a future date and price. The quantity,
quality, delivery location, and delivery date are all specified in the contract.
Futures contracts are usually identified by their expiration month. The “front-
month” is the nearest expiration date and usually represents the most liquid
market. Natural gas futures expire three business days before the first calendar
day of the delivery month. To avoid physical delivery of the underlying com-
modity, the contract holder must sell their holding to the market before expiry.
Therefore, the futures and underlying commodity prices converge as the deliv-
ery date approaches. A futures contract achieves the same outcome as buying
a commodity on the spot market on margin and storing it for a future date.
The relative price of these alternatives is connected as it presents an arbitrage
opportunity. The difference in price between a futures contract and the spot
price of the underlying commodity will therefore depend on the financing cost,
storage cost, and convenience yield of holding the physical commodity over the
futures contract. Physical traders use futures as a hedge while transporting
commodities from producer to consumer. If a trader wishes to extend the ex-
piry of his futures contract, he can “roll” the contract by closing the contract
about to expire and entering into a contract with the same terms but a later
expiry date [CGLL19]. The “roll yield” is the difference in price for these two
contracts and might be positive or negative. The exchange clearinghouse uses a
margin system with daily settlements between parties to mitigate counterparty
risk [CGLL19].
2.2 Financial trading

Financial trading is the act of buying and selling financial assets. Owning a
financial asset is called being long that asset, which will realize a profit if the
asset price increases and suffer a loss if the asset price decreases. Short-selling
refers to borrowing, selling, and then, at a later time, repurchasing a financial
asset and returning it to the lender with the hopes of profiting from a price drop
during the loan term. Short-selling allows traders to profit from falling prices.
2.3 Modern portfolio theory

Harry Markowitz laid the groundwork for what is known as Modern Portfo-
lio Theory (MPT) [Mar68]. MPT assumes that investors are risk-averse and
advocates maximizing risk-adjusted returns. The Sharpe ratio [Sha98] is the
most widely-used measurement of risk-adjusted return developed by economist
6
William F. Sharpe. The Sharpe ratio compares excess return with the standard
deviation of investment returns and is defined as
E[rt − r̄] E[rt ]
Sharpe ratio = p ≃ (2.1)
var[rt − r̄] σ rt
where E[rt ] is the expected return over T samples, r̄ is the risk-free rate, and
σrt > 0 is the standard deviation of the portfolio’s excess return. Due to neg-
ligible low interest rates, the risk-free rate is commonly set to r̄ = 0. The
philosophy of MPT is that the investor should be compensated through higher
returns for taking on higher risk. The St. Petersburg paradox2 illustrates why
maximizing expected reward in a risk-neutral manner might not be what an in-
dividual wants. Although market participants have wildly different objectives,
this thesis will adopt the MPT philosophy of assuming investors want to maxi-
mize risk-adjusted returns. Hence, the goal of the trading agent described in this
thesis will be to maximize the risk-adjusted returns represented by the Sharpe
ratio. Maximizing future risk-adjusted returns can be broken down into two
sub-goals; forecasting future returns and mapping the forecast to market posi-
tions. However, doing so in highly efficient and competitive financial markets is
non-trivial.
2.4 Efficient market hypothesis

Actively trading a market suggests that the trader is dissatisfied with market
returns and believes there is potential for extracting excess returns, or alpha.
Most academic approaches to finance are based on the Efficient Market Hypoth-
esis (EMH) [Fam70], which states that all available information is fully reflected
in the prices of financial assets at any time. According to the EMH, a financial
market is a stochastic martingale process. As a consequence, searching for alpha
is a futile effort as the expected future return of a non-dividend paying asset is
the present value, regardless of past information, i.e.,
E[Rt+1 |It ] = Rt (2.2)
Practitioners and certain parts of academia heavily dispute the EMH. Be-
havioral economists reject the idea of rational markets and believe that human
evolutionary traits such as fear and greed distort market participants’ deci-
sions, creating irrational markets. The Adaptive Market Hypothesis (AMH)
[Lo04] reconciles the efficient market hypothesis with behavioral economics by
applying evolution principles (competition, adaptation, and natural selection)
to financial interactions. According to the AHM, what behavioral economists
label irrational behavior is consistent with an evolutionary model of individuals
adapting to a changing environment. Individuals within the market are contin-
ually learning the market dynamics, and as they do, they adapt their trading
strategies, which in turn changes the dynamics of the market. This loop creates
2 For an explanation of the paradox, see the article [Pet22].
7
complicated price dynamics. Traders who adapt quickly to changing dynamics
can exploit potential inefficiencies. Based on the AHM philosophy, this the-
sis hypothesizes that there are inefficiencies in financial markets that can be
exploited, with the recognition that such opportunities are limited and chal-
lenging to discover.
2.5 Forecasting
Unless a person is gambling, betting on the price movements of volatile financial
assets only makes sense if the trader has a reasonable idea of where the price
is moving. Since traders face non-trivial transaction costs, the expected value
of a randomly selected trade is negative. Hence, as described by the gambler’s
ruin, a person gambling on financial markets will eventually go bankrupt due to
the law of large numbers. Forecasting price movements, i.e., making predictions
based on past and present data, is a central component of any financial trading
operation and an active field in academia and industry. Traditional approaches
include fundamental analysis, technical analysis, or a combination of the two
[GD34]. These can be further simplified into qualitative and quantitative ap-
proaches (or a combination). A qualitative approach, i.e., fundamental analysis,
entails evaluating the subjective aspects of a security [GD34], which falls outside
the scope of this thesis. Quantitative (i.e., technical) traders use past data to
make predictions [GD34]. The focus of this thesis is limited to fully quantitative
approaches.
Developing quantitative forecasts for the price series of financial assets is non-
trivial as financial markets are non-stationary with a low signal-to-noise ratio
[Tal97]. Furthermore, modern financial markets are highly competitive and ef-
fective. As a result, easily detectable signals are almost certainly arbitraged out.
Researchers and practitioners use several mathematical and statistical models to
identify predictive signals leading to excess returns. Classical models include the
autoregressive integrated moving average (ARIMA) and the generalized autore-
gressive conditional heteroskedasticity (GARCH). The ARIMA is a linear model
and a generalization of the autoregressive moving average (ARMA) that can be
applied to time series with nonstationary mean (but not variance) [SSS00]. The
assumption of constant variance (i.e., volatility) is not valid for financial markets
where volatility is stochastic [Tal97]. The GARCH is a non-linear model devel-
oped to handle stochastic variance by modeling the error variance as an ARMA
model [SSS00]. Although the ARIMA and GARCH have practical applications,
their performance in modeling financial time series is generally unsatisfactory
[XNS15, MRC18].
Over the past 20 years, the availability and affordability of computing power,
storage, and data have lowered the barrier of entry to more advanced algo-
rithmic methods. As a result, researchers and practitioners have turned their
attention to more complex machine learning methods because of their ability
to identify signals and capture relationships in large datasets. Initially, there
was a flawed belief that the low signal-to-noise ratio leaves viable only simple
forecasts such as those based on low-dimensional ordinary least squares [Isi21].
8
With the recent deep learning revolution, deep neural networks have demon-
strated strong representation abilities when modeling time series data [SVL14].
The Makridakis competition evaluates time series forecasting methods. In its
fifth installment held in 2020, all 50 top-performing models were based on deep
learning architectures [MSA22]. A considerable amount of recent empirical re-
search suggests that deep learning models significantly outperform traditional
models like the ARIMA and GARCH when forecasting financial time series
[XNS15, MRC18, SNN18, SGO20]. These results are somewhat puzzling. The
risk of overfitting is generally higher for noisy data like financial data. More-
over, the loss function for DNNs is non-convex, which makes finding a global
minimum impossible. Despite the elevated overfitting risk and the massive over-
parameterization of DNNs, they still demonstrate stellar generalization. Thus,
based on recent research, the thesis will apply deep learning techniques to model
financial time series.
A review of deep learning methods in financial time series forecasting [SGO20]
found that LSTMs were the preferred choice in sequence modeling, possibly due
to their ability to remember both long- and short-term dependencies. Convolu-
tional neural networks are another common choice. CNNs are best known for
their ability to process 2D grids such as images; however, they have shown a
solid ability to model 1D grid time series data. Using historical prices, Hiransha
et al. [HGMS18] tested FFNs, vanilla RNNs, LSTMs, and CNNs on forecast-
ing next-day stock market returns on the National Stock Exchange (NSE) of
India and the New York Stock Exchange (NYSE). In the experiment, CNNs
outperformed other models, including the LSTM. These deep learning models
can extract generalizable patterns from the price series alone [SGO20].
2.6 Mapping forecasts to market positions

Most research on ML in financial markets focuses on forecast-based supervised
learning approaches [Fis18]. These methods tend to ignore how to convert fore-
casts into market positions or use some heuristics like the Kelly criterion to
determine optimal position sizing [Isi21]. The forecasts are usually optimized
by minimizing a loss function like the Mean Squared Error (MSE). An accurate
forecast (in the form of a lower MSE) may lead to a more profitable trader,
but this is not always true. Not only does the discovered signal need ade-
quate predictive power, but it must consistently produce reliable directional
calls. Moreover, the mapping from forecast to market position needs to con-
sider transaction costs and risk, which is challenging in a supervised learning
framework [MW97]. Neglecting transaction costs can lead to aggressive trading
and overestimation of returns. Neglecting risk can lead to trading strategies
that are not viable in the real world. Maximizing risk-adjusted returns is only
feasible when accounting for transaction costs and risk. These shortcomings are
addressed using reinforcement learning [MW97, MWLS98]. Using RL, deep neu-
ral networks can be trained to output market positions directly. Moreover, the
DNN can be jointly optimized for risk- and transaction-cost-sensitive returns,
thus directly optimizing for the true goal: maximizing risk-adjusted returns.
9
Moody and Wu [MW97] and Moody et al. [MWLS98] empirically demon-
strated the advantages of reinforcement learning relative to supervised learning.
In particular, they demonstrated the difficulty of accounting for transaction
costs using a supervised learning framework. A significant contribution is their
model-free policy-based RL algorithm for trading financial instruments recur-
rent reinforcement learning (RRL). The name refers to the recursive mechanism
that stores the past action as an internal state of the environment, allowing the
agent to consider transaction costs. The agent outputs market positions and is
limited to a discrete action space at ∈ {−1, 0, 1}, corresponding to maximally
short, no position, and maximally long. At time t, the previous action at−1 is
fed into the policy network fθ along with the external state of the environment
st in order to make the trade decision, i.e.,
at = fθ (st , at−1 )
where fθ is a linear function, and the external state is constructed using the
past 8 returns. The return rt is realized at the end of the period (t − 1, t] and
includes the returns resulting from the position at−1 held through this period
minus transaction costs incurred at time t due to a difference in the new position
at from the old at−1 . Thus, the agent learns the relationship between actions
and the external state of the environment and the internal state.
Moody and Saffel [MS01] compared their actor-based RRL algorithm to
the value-based Q-learning algorithm when applied to financial trading. The
algorithms are tested on two real financial time series; the U.S. dollar/British
pound foreign exchange pair and the S&P 500 stock index. While both perform
better than a buy-and-hold baseline, the RRL algorithm outperforms Q-learning
on all tests. The authors argue that actor-based algorithms are better suited
to immediate reward environments and may be better able to deal with noisy
data and quickly adapt to non-stationary environments. They point out that
critic-based RL suffers from the curse of dimensionality and that when extended
to function approximation, it sometimes fails to converge even in simple MDPs.
Deng et al. [DBK+ 16] combine Moody’s direct reinforcement learning frame-
work with a recurrent neural network to introduce feature learning through deep
learning. Another addition is the use of continuous action space. To constrain
actions to the interval [−1, 1], the RNN output is mapped to a tanh function.
Jiang et al. [JXL17] presents a deterministic policy gradient algorithm that
trades a portfolio of multiple financial instruments. The policy network is mod-
eled using CNNs and LSTMs, taking each period’s closing, lowest, and highest
prices as input. The DNNs are trained on randomly sampled mini-batches of
experience. These methods account for transaction costs but not risk. Zhang
et al. [ZZW+ 20] present a deep RL framework for a risk-averse agent trading
a portfolio of instruments using both CNNs and LSTMs. Jin and El-Saawy
[JES16] suggest that adding a risk-term to the reward function that penalizes
the agent for volatility produces a higher Sharpe ratio than optimizing for the
Sharpe ratio directly. Zhang et al. [ZZW+ 20] apply a similar risk-term penalty
to the reward function.
10
2.7 Feature engineering
Any forecast needs some predictor data, or features, to make predictions. While
ML forecasting is a science, feature engineering is an art and arguably the most
crucial part of the ML process. Feature engineering and selection for financial
forecasting are only limited by imagination. Features range from traditional
technical indicators (e.g., Moving Average Convergence Divergence, Relative
Strength Index) [ZZR20] to more modern deep learning-based techniques like
analyzing social-media sentiment of companies using Natural Language Process-
ing [ZS10] or using CNNs on satellite images along with weather data to predict
cotton yields [TOdSMJZ20]. Research in feature engineering and selection is
exciting and potentially fruitful but beyond this thesis’s scope. The most re-
liable predictor of future prices of a financial instrument tends to be its past
price, at least in the short term [Isi21]. Therefore, in this thesis, higher-order
features are not manually extracted. Instead, only the price series are analyzed.
2.8 Sub-sampling schemes

Separating high- and low-frequency trading can be helpful, as they present
unique challenges. High-frequency trading (HFT) focuses on reducing software
and hardware latency, which may include building a $300 million fiber-optic ca-
ble to reduce transmission time by four milliseconds between exchanges to gain
a competitive advantage [LB14] 3 . This type of trading has little resemblance
to the low-frequency trading examined in this thesis, described in minutes or
hours rather than milliseconds.
Technical traders believe that the prices of financial instruments reflect all
relevant information [Tal97]. From this perspective, the market’s complete order
history represents the financial market’s state. This state representation would
scale poorly, with computational and memory requirements growing linearly
with time. Consequently, sub-sampling schemes for periodic feature extraction
are almost universally employed. While sampling information at fixed intervals
is straightforward, there may be more effective methods. As exchange activity
varies throughout the day, sampling at fixed intervals may lead to oversampling
during low-activity periods and undersampling during high-activity periods. In
addition, time-sampled series often exhibit poor statistical properties, such as
non-normal returns, autocorrelation, and heteroskedasticity [DP18].
The normality of returns assumption underpins several mathematical finance
models, e.g., Modern Portfolio Theory [Mar68], and the Sharpe-ratio [Sha98].
There is, however, too much peaking and fatter tails in the actual observed dis-
tribution for it to be relative to samples from Gaussian populations [Man97] 4 .
Mandelbrot showed in 1963 [Man97] that a Lévy alpha-stable distribution with
3 Turns out they forgot that light travels about 30% slower in glass than in air, and they
lost their competitive advantage to simple line-of-sight microwave networks [LB14].

4 Assuming that the S&P 500 index returns were normally distributed, the probability of
daily returns being below five percent between 1962 and 2004 (10 698 observations) would be
approximately 0.0005. However, it happened 8 times [Has07].
11
infinite variance can approximate returns over fixed periods. In 1967, Mandel-
brot and Taylor [MT67] argued that returns over a fixed number of transactions
may be close to Independent and Identically Distributed (IID) Gaussian. Sev-
eral empirical studies have since confirmed this [Cla73, AG00]. Clark [Cla73]
discovered that sampling by volume instead of transactions exhibits better sta-
tistical properties, i.e., closer to IID Gaussian distribution. Sampling by volume
instead of ticks has intuitive appeal. While tick bars count one transaction of n
contracts as one bar, n transactions of one contract count as n bars. Sampling
according to transaction volume might lead to significant sampling frequency
variations for volatile securities. When the price is high, the volume will be
lower, and therefore the number of observations will be lower, and vice versa,
even though the same value might be transacted. Therefore, sampling by the
monetary value transacted, also called dollar bars, may exhibit even better sta-
tistical properties [DP18]. Furthermore, for equities, sampling by monetary
value exchanged makes an algorithm more robust against corporate actions like
stock splits, reverse splits, stock offerings, and buybacks. To maintain a suitable
sampling frequency, the sampling threshold may need to be adjusted if the total
market size changes significantly.
Although periodic feature extraction reduces the number of observations that
must be processed, it scales linearly in computation and memory requirements
per observation. A history cut-off is often employed to represent the state by
only the n most recent observations to tackle this problem. Representing the
state of a partially observable MDP by the n most recent observations is a
common technique used in many reinforcement learning applications. Mnih et
al. [MKS+ 13] used 4 stacked observations as input to the DQN agent that
achieved superhuman performance on Atari games to capture the trajectory of
moving objects on the screen. The state of financial markets is also usually
approximated by stacking past observations [JXL17, ZZR20, ZZW+ 20].
2.9 Backtesting
Assessing a machine learning model involves estimating its generalization error
on new data. The most widely used method for estimating generalization error
is Cross-Validation (CV), which assumes that observations are IID and drawn
from a shared underlying data-generating distribution. However, the price of
a financial instrument is a nonstationary time series with an apparent tempo-
ral correlation. Conventional cross-validation ignores this temporal component
and is thus unsuitable for assessing a time series forecasting model. Instead,
backtesting, a form of cross-validation for time series, is used. Backtesting is
a historical simulation of how the model would have performed should it have
been run over a past period. The purpose of backtesting is the same as for
cross-validation; to determine the generalization error of an ML algorithm.
To better understand backtesting, it is helpful to consider an algorithmic
trading agent’s objective and how it operates to achieve it. The algorithmic
trading process involves the agent receiving information and, based on that
information, executing trades at discrete time steps. These trades are intended
12
to achieve a specific objective set by the stakeholder, which, in the philosophy
of modern portfolio theory, is maximizing risk-adjusted returns. Thus, assessing
an algorithmic trading agent under the philosophy of modern portfolio theory
entails estimating the risk-adjusted returns resulting from the agent’s actions.
However, when testing an algorithmic trading agent, it cannot access data ahead
of the forecasting period, as that would constitute information leakage to the
agent. For this reason, conventional cross-validation fails in algorithmic trading.
The most precise way to assess the performance of an algorithmic trading
agent is to deploy it to the market, let it trade with the intended amount of
capital, and observe its performance. However, this task would require consid-
erable time since low-frequency trading algorithms are typically assessed over
long periods, often several years5 . Additionally, any small error would likely
result in devastating losses, making algorithmic trading projects economically
unfeasible.
Figure 2.1: Time series cross-validation (backtesting) compared to standard

cross-validation from [HA18].
Backtesting is an alternative to this expensive and time-consuming process

where performance on historical simulations functions as a proxy for general-
5 There are a couple of reasons for this; first, there are, on average, 252 trading days per
year. Low-frequency trading algorithms typically make fewer than ten trades per day. In
order to obtain sufficient test samples, the agent must trade for several years. Second, testing
the algorithmic trading agent under various market conditions is crucial. A successful model
in particular market conditions may be biased towards those conditions and fail to generalize
to other market conditions.
13
ization error. Backtesting involves a series of tests that progress sequentially
through time, where every test set consists of a single observation. At test n,
the model trains on the training set consisting of observations prior to the ob-
servation in the test set (i < n). The forecast is, therefore, not based on future
observations, and there is no leakage from the test set to the training set. Then
the backtest progresses to the subsequent observation n + 1, where the training
set increases6 to include observations i < n + 1. The backtest progresses until
there are no test sets left. The backtesting process is formalized in algorithm 1.
Algorithm 1 Backtesting
Train the model on the first k observations (of T total observations).
for i = 0, 1, ..., T − k do
Select observation k + i from the test set.
Register trade ak+i .
Train the model using observations at times t ≤ k + i.
end for
Measure performance using registered trades ak , ak+1 , ..., aT and the correspond-
ing prices at time k, k + 1, ..., T .
When conducting historical simulations, knowing what information was avail-

able during the studied period is critical. Agents should not have access to data
beyond the point at which they are located in the backtest, in order to avoid
lookahead bias. Lookahead bias is a research bug of inadvertently using fu-
ture data, whereas real-time production trading is free of such “feature”. Data
used in forecasting must be stored point-in-time (PIT), which indicates when
the information was made available. An incorrectly labeled dataset can lead to
lookahead bias.
Backtesting is a flawed procedure as it suffers from lookahead bias by design.
Having experienced the hold-out test sample provides insight into what made
markets rise and fall, a luxury not available before the fact. Thus, only live trad-
ing can be considered genuinely out-of-sample. Another form of lookahead bias
is repeated and excessive backtest optimization leading to information leakage
from the test to the training set. A machine learning model will enthusiasti-
cally grab any lookahead but fail to generalize to live trading. Furthermore, the
backtests performed in this thesis rely on assumptions of zero market impact,
zero slippage, fractional trading, and sufficient liquidity to execute any trade
instantaneously at the quoted price. These assumptions do not reflect actual
market conditions and will lead to unrealistic high performance in the backtest.
Backtesting should emulate the scientific method, where a hypothesis is de-
veloped, and empirical research is conducted to find evidence of inconsistencies
with the hypothesis. It should be distinct from a research tool for discovering
predictive signals. It should only be conducted after research has been done.
A backtest is not an experiment but a simulation to see if the model behaves
6 The test set can also be a fixed-size FIFO queue.
14
as expected [DP18]. Random historical patterns might exhibit excellent per-
formance in a backtest. However, it should be viewed cautiously if no ex-ante
logical foundation exists to explain the performance [AHM19].
15
3 Deep learning
An intelligent agent requires some means of modeling the dynamics of the sys-
tem in which it operates. Modeling financial markets is complicated due to low
signal-to-noise ratios and non-stationary dynamics. The dynamics are highly
nonlinear; thus, several traditional statistical modeling approaches cannot cap-
ture the system’s complexity. Moreover, reinforcement learning requires param-
eterized function approximators, rendering nonparametric learners, e.g., support
vector machines and random forests, unsuitable. Additionally, parametric learn-
ers are generally preferred when the predictor data is well-defined[GBC16], such
as when forecasting security prices using historical data. In terms of nonlinear
parametric learners, artificial neural networks comprise the most widely used
class. Research suggests that deep neural networks, such as LSTMs and CNNs,
effectively model financial data [XNS15, HGMS18, MRC18]. Therefore, the al-
gorithmic trading agent proposed in this thesis will utilize deep neural networks
to model commodity markets.
This chapter introduces the fundamental concepts of deep learning relevant
to this thesis, starting with some foundational concepts related to machine learn-
ing 3.1 and supervised learning 3.2. Next, section 3.3 covers artificial neural
networks, central network topologies, and how they are initialized and opti-
mized to achieve satisfactory results. The concepts presented in this chapter
are presented in the context of supervised learning but will be extended to the
reinforcement learning context in the next chapter (4).
3.1 Machine learning

Machine Learning (ML) studies how computers can automatically learn from
experience without being explicitly programmed. A general and comprehensive
introduction to machine learning can be found in “Elements of Statistical Learn-
ing” by Trevor Hastie, Robert Tibshirani, and Jerome Friedman [HTFF09].
Tom Mitchell defined the general learning problem as “A computer program is
said to learn from experience E with respect to some class of tasks T and per-
formance measure P, if its performance on T, as measured by P, improves with
experience E.” [MM97]. In essence, ML algorithms extract generalizable predic-
tive patterns from data from some, usually unknown, probability distribution
to build a model about the space. It is an optimization problem where perfor-
mance improves through leveraging data. Generalizability relates to a model’s
predictive performance on independent test data and is a crucial aspect of ML.
Models should be capable of transferring learned patterns to new, previously
unobserved samples while maintaining comparable performance. ML is closely
related to statistics in transforming raw data into knowledge. However, while
statistical models are designed for inference, ML models are designed for predic-
tion. There are three primary ML paradigms; supervised learning, unsupervised
learning, and reinforcement learning.
16
3.1.1 No-free-lunch theorem
The no-free-lunch theorem states that there exists no single universally supe-
rior ML learning algorithm that applies to all possible datasets. Fortunately,
this does not mean ML research is futile. Instead, domain-specific knowledge
is required to design successful ML models. The no-free-lunch theorem results
only hold when averaged over all possible data-generating distributions. If the
types of data-generating distributions are restricted to classes with certain sim-
ilarities, some ML algorithms perform better on average than others. Instead
of trying to develop a universally superior machine learning algorithm, the fo-
cus of ML research should be on what ML algorithms perform well on specific
data-generating distributions.
3.1.2 The curse of dimensionality

The Hughes phenomenon states that, for a fixed number of training examples,
as the number of features increases, the average predictive power for a classifier
will increase before it starts deteriorating. Bellman termed this phenomenon the
curse of dimensionality, which frequently manifests itself in machine learning
[SB18]. One manifestation of the curse is that the sampling density is propor-
tional to N 1/p , where p is the dimension of the input space and N is the sample
size. If N = 100 represents a dense sample for p = 1, then N 1/10 = 10010 is the
required sample size for the same sampling density with p = 10 inputs. There-
fore, in high dimensions, the training samples sparsely populate the input space
[HTFF09]. In Euclidean spaces, a non-negative term is added to the distance
between points with each new dimension. Thus, generalizing from training sam-
ples becomes more complex, as it is challenging to say something about a space
without relevant training examples.
3.2 Supervised learning

Supervised Learning (SL) is the machine learning paradigm where a labeled
training set of N ∈ N+ observations τ = {xi , yi }N i=1 is used to learn a func-
tional dependence y = fˆ(x) that can predict y from a previously unobserved x.
Supervised learning includes regression tasks with numerical targets and classi-
fication tasks with categorical targets. A supervised learning algorithm adjusts
the input/output relationship of fˆ in response to the prediction error to the
target yi − fˆ(xi ). The hypothesis is that if the training set is representative of
the population, the model will generalize to previously unseen examples.
3.2.1 Function approximation

Function approximation, or function estimation, is an instance of supervised
learning that concerns selecting a function among a well-defined class that un-
derlies the predictive relationship between the input vector x and output vari-
able y. In most cases, x ∈ Rd , where d ∈ N+ , and y ∈ R. Function approxima-
tion relies on the assumption that there exists a function f (x) that describes
17
the approximate relationship between (x, y). This relationship can be defined
as
y = f (x) + ϵ (3.1)
where ϵ is some irreducible error that is independent of x where E[ϵ] = 0 and
V ar(ϵ) = σϵ2 . All departures from a deterministic relationship between (x, y)
are captured via the error ϵ. The objective of function approximation is to
approximate the function f with a model fˆ. In reality, this means finding the
optimal model parameters θ. Ordinary least squares can estimate linear models’
parameters θ.
Bias-variance tradeoff Bias and variance are two sources of error in model
estimates. Bias measures the in-sample expected deviation between the model
estimate and the target and is defined as
Bias2 (fˆ(x)) = [E[fˆ(x)] − f (x)]2 (3.2)
and is a decreasing function of complexity. Variance measures the variability in

model estimates and is defined as
V ar(fˆ(x)) = E[fˆ(x) − E[fˆ(x)]]2 (3.3)
and is an increasing function of complexity. The out-of-sample mean square

error for model estimates is defined as
M SE = Bias2 (fˆ) + V ar(fˆ) + V ar(ϵ) (3.4)
where the last term V ar(ϵ) is the irreducible noise error due to a target com-
ponent ϵ not predictable by x. The bias can be made arbitrarily small using a
more complex model; however, this will increase the model variance, or gener-
alization error, when switching from in-sample to out-of-sample. This is known
as the bias-variance tradeoff.
Overfitting A good ML model minimizes the model error, i.e., the training
error (bias) and the generalization error (variance). This is achieved at some
optimal complexity level, dependent on the data and the model. Increasing
model complexity, or capacity, can minimize training error. However, such a
model is unlikely to generalize well. Therefore, the difference between training
error and generalization error will be high. This is known as overfitting and
happens when the complexity of the ML model exceeds that of the underlying
problem. Conversely, underfitting is when the training error is high because the
model’s complexity is lower than that of the underlying problem.
To minimize model error, the complexity of the ML model, or its induc-
tive bias, must align with that of the underlying problem. The principle of
parsimony, known as Occam’s razor, states that among competing hypotheses
explaining observations equally well, one should pick the simplest one. This
heuristic is typically applied to ML model selection by selecting the simplest
model from models with comparable performance.
18
Figure 3.1: An illustration of the relationship between the capacity of a function
approximator and the generalization error from [GBC16].
Recent empirical evidence has raised questions about the mathematical foun-
dations of machine learning. Complex models such as deep neural networks
have been shown to decrease both training error and generalization error with
growing complexity [ZBH+ 21]. Furthermore, the generalization error keeps de-
creasing past the interpolation limit. These surprising results contradict the
bias-variance tradeoff that implies that a machine learning model should bal-
ance over- and underfitting. Belkin et al. [BHMM19] reconciled these conflict-
ing ideas by introducing a “double descent” curve to the bias-variance tradeoff
curve. This increases performance when increasing the model capacity beyond
the interpolation point.
3.3 Artificial neural networks

An Artificial Neural Network (ANN) is a parametric learner fitting nonlinear
models. The network defines a mapping hθ : Rn → Rm where n is the input
dimension, m is the output dimension, and θ are the network weights. A neural
network has a graph-like topology. It is a collection of nodes organized in layers
like a directed and weighted graph. The nodes of an ANN are typically separated
into layers; the input layer, one or more hidden layers, and the output layer.
Their dimensions depend on the function being approximated. A multi-layer
neural network is called a Deep Neural Network (DNN).
3.3.1 Feedforward neural networks

A Feedforward Network (FFN), or fully-connected network, defines the founda-
tional class of neural networks where the connections are a directed acyclic graph
that only allows signals to travel forward in the network. A feedforward network
19
is a mapping hθ that is a composition of multivariate functions f1 , f2 , ..., fk , g,
where k is the number of layers in the neural network. It is defined as
hθ = g ◦ fk ◦ ... ◦ f2 ◦ f1 (x) (3.5)
The functions fj , j = 1, 2, ..., k represent the network’s hidden layers and are
composed of multivariate functions. The function at layer j is defined as
fj (x) = ϕj (θj x + bj ) (3.6)
where ϕj is the activation function and bj is the bias at layer j. The activation
function is used to add nonlinearity to the network. The network’s final output
layer function g can be tailored to suit the problem the network is solving, e.g.,
linear for Gaussian output distribution or Softmax distribution for categorical
output distribution.
Figure 3.2: Feedforward neural network from [FFLb].
3.3.2 Parameter initialization

Neural network learning algorithms are iterative and require some starting point
from which to begin. The initial parameters of the networks can affect the speed
and level of convergence or even whether the model converges at all. Little is
known about weight initialization, a research area that is still in its infancy. Fur-
ther complicating the issue: initial parameters favorable for optimization might
be unfavorable for generalization [GBC16]. Developing heuristics for parameter
initialization is, therefore, non-trivial. Randomness and asymmetry between the
network units are desirable properties for the initial weights [GBC16]. Weights
are usually drawn randomly from a Gaussian or uniform distribution in a neigh-
borhood around zero, while the bias is usually set to some heuristically chosen
constant. Larger initial weights will prevent signal loss during forward- and
20
backpropagation. However, too large values can result in exploding values, a
problem particularly prevalent in recurrent neural √ networks. The initial scale
of the weights is usually set to something like 1/ m where m is the number of
inputs to the network layer.
Kaiming initialization [HZRS15] is a parameter initialization method that
takes the type of activation function (e.g., Leaky-ReLU) used to add nonlinearity
to the neural network into account. The key idea is that the initialization
method should not exponentially reduce or magnify the magnitude of input
signals. Therefore, each layer is initialized at separate scales depending on their
size. Let ml ∈ N+ be the size of the inputs into the layer l ∈ N+ . Kaiming He
et al. recommend initializing weights such that
1
ml Var[θl ] = 1
2
Which corresponds to an initialization scheme of
wl ∼ N (0, 2/ml )
Biases are initialized at 0.
3.3.3 Gradient-based learning

Neural nets are optimized by adjusting their weights θ with the help of objective
functions. Let J(θ) define the differentiable objective function for a neural
network, where θ are the network weights. The choice of the objective function
and whether it should be minimized or maximized depends on the problem
being solved. For regression tasks, the objective is usually to minimize some
loss function like mean-squared error (MSE)
n
1X 2
J(θ) = (hθ (xi ) − yi ) (3.7)
n i=1
Due to neural nets’ nonlinearity, most loss functions are non-convex, meaning
it is impossible to find an analytical solution to ∇J(θ) = 0. Instead, iterative,
gradient-based optimization algorithms are used. There are no convergence
guarantees, but it often finds a satisfactorily low value of the loss function
relatively quickly. Gradient descent-based algorithms adjust the weights θ in the
direction that minimizes the MSE loss function. The update rule for parameter
weights in gradient descent is defined as
θt+1 = θt − α∇θ J(θt ) (3.8)
where α > 0 is the learning rate and the gradient ∇J(θt ) is the partial derivatives
of the objective function with respect to each weight. The learning rate defines
the rate at which the weights move in the direction suggested by the gradient of
the objective function. Gradient-based optimization algorithms, also called first-
order optimization algorithms, are the most common optimization algorithms
for neural networks [GBC16].
21
Stochastic gradient descent Optimization algorithms that process the en-
tire training set simultaneously are known as batch learning algorithms. Using
the average of the entire training set allows for calculating a more accurate
gradient estimate. The speed at which batch learning converges to a local min-
ima will be faster than online learning. However, batch learning is not suitable
for all problems, e.g., problems with massive datasets due to the high computa-
tional costs of calculating the full gradient or problems with dynamic probability
distributions.
Instead, Stochastic Gradient Descent (SGD) is often used when optimizing
neural networks. SGD replaces the gradient in conventional gradient descent
with a stochastic approximation. Furthermore, the stochastic approximation is
only calculated on a subset of the data. This reduces the computational costs
of high-dimensional optimization problems. However, the loss is not guaranteed
to decrease when using a stochastic gradient estimate. SGD is often used for
problems with continuous streams of new observations rather than a fixed-size
training set. The update rule for SGD is similar to the one for GD but replaces
the true gradient with a stochastic estimate
θt+1 = θt − αt ∇θ J (j) (θt ) (3.9)
where ∇θ J (j) (θ) is the stochastic estimate of the gradient computed from ob-
PN
servation j. The total loss is defined as J(θ) = j=1 J (j) (θ) where N ∈ N is
the total number of observations. The learning rate at time t is αt > 0. Due
to the noise introduced by the SGD gradient estimate, gradually decreasing the
learning rate over time is crucial to ensure convergence. Stochastic approxima-
tion
P theory guarantees
P 2 convergence to a local optima if α satisfies the conditions
α = ∞ and α < ∞. It is common to adjust the learning rate using the
t
following update rule αt = (1 − β)α0 + βατ , where β = , and the learning rate
τ
is kept constant after τ iterations, i.e., ∀t ≥ τ , αt = ατ [GBC16].
Due to hardware parallelization, simultaneously computing the gradient of
N observations will usually be faster than computing each gradient separately
[GBC16]. Neural networks are, therefore, often trained on mini-batches, i.e.,
sets of more than one but less than all observations. Mini-batch learning is an
intermediate approach to fully online learning and batch learning where weights
are updated simultaneously after accumulating gradient information over a sub-
set of the total observations. In addition to providing better estimates of the
gradient, mini-batches are more computationally efficient than online learning
while still allowing training weights to be adjusted periodically during train-
ing. Therefore, minibatch learning can be used to learn systems with dynamic
probability distributions. Samples of the mini-batches should be independent
and drawn randomly. Drawing ordered batches will result in biased estimates,
especially for data with high temporal correlation.
Due to noisy gradient estimates, stochastic gradient descent and mini-batches
of small size will exhibit higher variance than conventional gradient descent dur-
ing training. The higher variance can be helpful to escape local minima and
22
find new, better local minima. However, high variance can also lead to prob-
lems such as overshooting and oscillation that can cause the model to fail to
converge. Several extensions have been made to stochastic gradient descent to
circumvent these problems.
Adaptive gradient algorithm The Adaptive Gradient (AdaGrad) is an ex-

tension to stochastic gradient descent introduced in 2011 [DHS11]. It outlines a
strategy for adjusting the learning rate to converge quicker and improving the
capability of the optimization algorithm. A per-parameter learning rate allows
AdaGrad to improve performance on problems with sparse gradients. Learning
rates are assigned lower for parameters with frequently occurring features and
higher for parameters with less frequently occurring features. The AdaGrad
update rule is given as
α
θt+1 = θt − √ gt (3.10)
Gt + ϵ
Pt
where gt = ∇θ J (j) (θt ), and Gt = τ =1 gt gt⊤ , is the outer product of all previous
subgradients. ϵ > 0 is a smoothing term to avoid division by zero. As training
proceeds, the squared gradients in the denominator of the learning rate will
continue to grow, resulting in a strictly decreasing learning rate. As a result,
the learning rate will eventually become so small that the model cannot acquire
new information.
Root mean square propagation Root Mean Square Propagation (RM-

SProp) is an unpublished extension to SGD developed by Geoffrey Hinton. RM-
SProp was developed to resolve the problem of AdaGrad’s diminishing learning
rate. Like AdaGrad, it maintains a per-parameter learning rate. To normalize
the gradient, it keeps a moving average of squared gradients. This normalization
decreases the learning rate for more significant gradients to avoid the exploding
gradient problem and increases it for smaller gradients to avoid the vanishing
gradient problem. The RMSProp update rule is given as
α
θt+1 = θt − p gt (3.11)
E[g 2 ]t + ϵ
where E[g 2 ]t = βE[g 2 ]t + (1 − β)gt2 where vt is the exponentially decaying

average of squared gradients and β > 0 is a second learning rate conventionally
set to β = 0.9.
Adam The Adam optimization algorithm is an extension of stochastic gradi-

ent descent that has recently seen wide adoption in deep learning. It was intro-
duced in 2015 [KB14] and derives its name from adaptive moment estimation.
It utilizes the Adaptive Gradient (AdaGrad) Algorithm and Root Mean Square
Propagation (RMSProp). Adam only requires first-order gradients and little
memory but is computationally efficient and works well with high-dimensional
parameter spaces. As with AdaGrad and RMSProp, Adam utilizes independent
per-parameter learning rates separately adapted during training. Adam stores
23
a moving average of gradients E[g]t = β1 E[g]t + (1 − β1 )gt with learning rate
β1 > 0. Like RMSProp, Adam also stores a moving average of squared gradients
E[g 2 ]t with learning rate β2 > 0. The Adam update rule is given as
α ˆ t
θt+1 = θt − q E[g] (3.12)
E[gˆ2 ]t + ϵ
E[g 2 ]t ˆ t = E[g]t . The authors recommend learning

where E[gˆ2 ]t = and E[g]
1 − β2t 1 − β1t
rates β1 = 0.9, β2 = 0.999, as well as ϵ = 10−8 . Adam has been shown to out-
perform other optimizers in a wide range of non-convex optimization problems.
Researchers at Google [ARS+ 20] recommend the Adam optimization algorithm
for SGD optimization in reinforcement learning.
3.3.4 Backpropagation
Gradient-based optimization requires a method for computing a function’s gra-
dient. For neural nets, the gradient of the loss function with respect to the
weights of the network ∇θ J(θ) is usually computed using the backpropagation
algorithm (backprop) introduced in 1986 [RHW85]. Backprop calculates the
gradient of the loss function with respect to each weight in the network. This is
done by iterating backward through the network layers and repeatedly applying
the chain rule. The chain rule of calculus is used when calculating derivatives of
functions that are compositions of other functions with known derivatives. Let
y, z : R → R be functions defined as y = g(x) and z = f (g(x)) = f (y). By the
chain rule
dz dz dy
= (3.13)
dx dy dx
Generalizing further, let x ∈ Rm , y ∈ Rn , and define mappings g : Rm → Rn
and f : Rn → R. If y = g(x) and z = f (y), then the chain rule is
∂z X ∂z ∂yj
= (3.14)
∂xi j
∂yj ∂xi
which can be written in vector notation as

⊤
∂y
∇x z = ∇y z (3.15)
∂x
∂y
where is the n×m Jacobian matrix of g. Backpropagation is often performed
∂x
on tensors and not vectors. However, backpropagation with tensors is performed
similarly by multiplying Jacobians by gradients. Backpropagation with tensors
can be performed by flattening a tensor into a vector, performing backprop on
the vector, and then reshaping the vector back into a tensor. Let X and Y be
tensors and Y = g(X) and z = f (Y ). The chain rule for tensors is
X ∂z
∇x z = (∇x Yj ) (3.16)
j
∂Yj
24
By recursively applying the chain rule, a scalar’s gradient can be expressed
for any node in the network that produced it. This is done recursively, starting
from the output layer and going back through the layers of the network to avoid
storing subexpressions of the gradient or recomputing them several times.
3.3.5 Activation function

The activation function ϕ(ξ) adds nonlinearity to a neural net. If the activation
function in the hidden layer is linear, then the network is equivalent to a network
without hidden layers since linear functions of linear functions are themselves
linear. The activation function must be differentiable to compute the gradient.
Choosing an appropriate activation function depends on the specific problem.
Sigmoid functions σ, like the logistic function, are commonly used, as well as
other functions such as the hyperbolic tangent function tanh. The derivative
of the logistic function is close to 0 except in a small neighborhood around 0.
At each backward step, the δ is multiplied by the derivative of the activation
function. The gradient will therefore approach 0 and thus produce extremely
slow learning. This is known as the vanishing gradient problem. For this reason,
the Rectified Linear Unit (ReLU) is the default recommendation for activation
function in modern deep neural nets [GBC16]. ReLU is a ramp function defined
as ReLU (x) = max{0, x}. The derivative of the ReLU function is defined as
(
0 if x < 0
ReLU ′ (x) = (3.17)
1 if x > 0
The derivative is undefined for x = 0, but it has subdifferential [0, 1], and
it conventionally takes the value ReLU ′ (0) = 0 in practical implementations.
Since ReLU is a piecewise linear function, it optimizes well with gradient-based
methods.
Figure 3.3: ReLU activation function from [FFLb].
ReLU suffers from what is known as the dying ReLU problem, where a
large gradient could cause a node’s weights to update such that the node will
25
never output anything but 0. Such nodes will not discriminate against any
input and are effectively “dead”. This problem can be caused by unfortunate
weight initialization or a too-high learning rate. Generalizations of the ReLU
function, like the Leaky ReLU (LReLU) activation function, has been proposed
to combat the dying ReLU problem [GBC16]. Leaky ReLU allows a small
“leak” for negative values proportional to some slope coefficient α, e.g., α =
0.01, determined before training. This allows small gradients to travel through
inactive nodes. Leaky ReLU will slowly converge even on randomly initialized
weights but can also reduce performance in some applications [GBC16].
Figure 3.4: Leaky ReLU activation function from [FFLb].
3.3.6 Regularization
Minimization of generalization error is a central objective in machine learning.
The representation capacity of large neural networks, expressed by the universal
approximation theorem (3.3.8), comes at the cost of increased overfitting risk.
Consequently, a critical question in ML is how to design and train neural net-
works to achieve the lowest generalization error. Regularization addresses this
question. Regularization is a set of techniques designed to reduce generalization
error, possibly at the expense of training error.
Regularization of estimators trades increased bias for reduced variance. If
effective, it reduces model variance more than it increases bias. Weight decay
is used to regularize ML loss functions by adding the squared L2 norm of the
parameter weights Ω(θ) = 12 ||θ||22 as a regularization term to the loss function
˜ = J(θ) + λΩ(θ)
J(θ) (3.18)
where λ ≥ 0 is a constant weight decay parameter. Increasing λ punishes larger

weights harsher. Weight decay creates a tradeoff for the optimization algorithm
between minimizing the loss function J(θ) and the regularization term Ω(θ).
Dropout [SHK+ 14] is another regularization strategy that reduces the risk
of overfitting by randomly eliminating non-output nodes and their connections
26
Figure 3.5: An example of the effect of weight decay with parameter λ on a
high-dimensional polynomial regression model from [GBC16].
during training, preventing units from co-adapting too much. Dropout can be
considered an ensemble method, where an ensemble of “thinned” sub-networks
trains the same underlying base network. It is computationally inexpensive and
only requires setting one parameter α ∈ [0, 1), which is the rate at which nodes
are eliminated.
Early stopping is a common and effective implicit regularization technique
that addresses how many epochs a model should be trained to achieve the low-
est generalization error. The training data is split into training and validation
subsets. The model is iteratively trained on the training set, and at predefined
intervals in the training cycle, the model is tested on the validation set. The
error on the validation set is used as a proxy for the generalization error. If the
performance on the validation set improves, a copy of the model parameters is
stored. If performance worsens, the learning terminates, and the model param-
eters are reset to the previous point with the lowest validation set error. Testing
too frequently on the validation set risks premature termination. Temporary
dips in performance are prevalent for nonlinear models, especially when trained
with reinforcement learning algorithms when the agent explores the state and
action space. Additionally, frequent testing is computationally expensive. On
the other hand, infrequent testing increases the risk of not registering the model
parameters near their performance peak. Early stopping is relatively simple but
comes at the cost of sacrificing parts of the training set to the validation set.
3.3.7 Batch normalization

Deep neural networks are sensitive to initial random weights and hyperparame-
ters. When updating the network, all weights are updated using a loss estimate
under the false assumption that weights in the prior layers are fixed. In prac-
tice, all layers are updated simultaneously. Therefore, the optimization step is
27
constantly chasing a moving target. The distribution of inputs during training
is forever changing. This is known as internal covariate shift, making the net-
work sensitive to initial weights and slowing down training by requiring lower
learning rates.
Batch normalization (batch norm) is a method of adaptive reparametriza-
tion used to train deep networks. It was introduced in 2015 [IS15] to help stabi-
lize and speed up training deep neural networks by reducing internal covariate
shift. Batch norm normalizes the output distribution to be more uniform across
dimensions by standardizing the activations of each input variable for each mini-
batch. Standardization rescales the data to standard Gaussian, i.e., zero-mean
unit variance. The following transformation is applied to a mini-batch of acti-
vations to standardize it
x(k) − E[x(k) ]
x̂(k)
norm = p (3.19)
V ar[x(k) ] + ϵ
where the ϵ > 0 is a small number such as 10−8 added to the denominator for
numerical stability. Normalizing the mean and standard deviation can, however,
reduce the expressiveness of the network [GBC16]. Applying a second transfor-
mation step to the mini-batch of normalized activations restores the expressive
power of the network
x̃(k) = γ x̂(k)
norm + β (3.20)
where β and γ are learned parameters that adjust the mean and standard devi-
ation, respectively. This new parameterization is easier to learn with gradient-
based methods. Batch normalization is usually inserted after fully connected
or convolutional layers. It is conventionally inserted into the layer before acti-
vation functions but may also be inserted after. Batchnorm speeds up learning
and reduces the strong dependence on initial parameters. Additionally, it can
have a regularizing effect and sometimes eliminate the need for dropout.
3.3.8 Universal approximation theorem

In 1989, Cybenko [Cyb89] proved that a feedforward network of arbitrary width
with a sigmoidal activation function and a single hidden layer can approximate
any continuous function. The theorem asserts that given any f ∈ C([0, 1]n )7 ,
ϵ > 0 and sigmoidal activation function ϕ, there is a finite sum of the form
N
X
fˆ(x) = αi ϕ(θi⊤ x + bi ) (3.21)
i=1
where αi , bi ∈ R and θi ∈ R2 , for which

|fˆ(x) − f (x)| < ϵ (3.22)
for all x ∈ [0, 1]n . Hornik [Hor91] later generalized to include all squashing acti-
vation functions in what is known as the universal approximation theorem. The
7 Continuous function on the n-dimensional unit cube.
28
theorem establishes that there are no theoretical constraints on the expressivity
of neural networks. However, it does not guarantee that the training algorithm
will be able to learn that function, only that it can be learned for an extensive
enough network.
3.3.9 Deep neural networks

Although a single-layer network, in theory, can represent any continuous func-
tion, it might require the network to be infeasibly large. It may be easier or
even required to approximate more complex functions using networks of deep
topology [SB18]. The class of ML algorithms that use neural nets with multiple
hidden layers is known as Deep Learning (DL). Interestingly, the universal ap-
proximation theorem also applies to networks of bounded width and arbitrary
depth. Lu et al. [LPW+ 17] showed that for any Lebesgue-integrable function
f : Rn → R and any ϵ > 0, there exists a fully-connected ReLU network A with
width (n + 4), such that the function FA represented by this network satisfies
Z
|f (x) − FA (x)|dx < ϵ (3.23)
Rn
i.e., any continuous multivariate function f : Rn → R can be approximated by

a deep ReLU network with width dm ≤ n + 4.
Poggio et al.[PMR+ 17] showed that a deep network could have exponentially
better approximation properties than a wide shallow network of the same total
size. Conversely, a network of deep topology can attain the same expressivity
as a larger shallow network. They also show that a deep composition of low-
dimensional functions has a theoretical guarantee, which shallow networks do
not have, that they can resist the curse of dimensionality for a large class of
functions.
Several unique network architectures have been developed for tasks like com-
puter vision, sequential data, and machine translation. As a result, they can
significantly outperform larger and more deeply layered feedforward networks.
The architecture of neural networks carries an inductive bias, i.e., an a priori
algorithmic preference. A neural network’s inductive bias must match that of
the problem it is solving to generalize well out-of-sample.
3.3.10 Convolutional neural networks

A Convolutional Neural Network (CNN) is a type of neural network specialized
in processing data with a known, grid-like topology such as time-series data
(1-dimensional) or images (2-dimensional) [GBC16]. Convolutional neural net-
works have profoundly impacted fields like computer vision [GBC16] and are
used in several successful deep RL applications [MKS+ 13, HS15, LHP+ 15]. A
CNN is a neural net that applies convolution instead of general matrix multipli-
cation in at least one layer. A convolution is a form of integral transform defined
as the integral of the product of two functions after one is reflected about the
29
y-axis and shifted
Z
s(t) = (x ∗ w)(t) = x(a)w(t − a)da (3.24)
where x(t) ∈ R and w(t) is a weighting function.

The convolutional layer takes the input x with its preserved spatial structure.
The weights w are given as filters that always extend the full depth of the input
volume but are smaller than the full input size. Convolutional neural nets utilize
weight sharing by applying the same filter across the whole input. The filter
slides across the input and convolves the filter with the image. It computes
the dot product at every spatial location, which makes up the activation map,
i.e., the output. This can be done using different filters to produce multiple
activation maps. The way the filter slides across the input can be modified.
The stride specifies how many pixels the filter moves every step. It is common
to zero pad the border if the stride is not compatible with the size of the filter
and the input.
Figure 3.6: 3D convolutional layer from [FFLc].
After the convolutional layer, a nonlinear activation function is applied to

the activation map. Convolutional networks may also include pooling layers
after the activation function that reduce the dimension of the data. Pooling can
summarize the feature maps to the subsequent layers by discarding redundant
or irrelevant information. Max pooling is a pooling operation that reports the
maximum output within a patch of the feature map. Increasing the stride of
the convolutional kernel also gives a downsampling effect similar to pooling.
3.3.11 Recurrent neural networks

A Recurrent Neural Network (RNN) is a type of neural network that allows
connections between nodes to create cycles so that outputs from one node affect
30
inputs to another. The recurrent structure enables networks to exhibit temporal
dynamic behavior. RNNs scale far better than feedforward networks for longer
sequences and are well-suited to processing sequential data. However, they can
be cumbersome to train as their recurrent structure precludes parallelization.
Furthermore, conventional batch norm is incompatible with RNNs, as the re-
current part of the network is not considered when computing the normalization
statistic.
Figure 3.7: Recurrent neural network from [FFLa].
RNNs generate a sequence of hidden states ht . The hidden states enable

weight sharing that allows the model to generalize over examples of various
lengths. Recurrent neural networks are functions of the previous hidden state
ht−1 and the input xt at time t. The hidden units in a recurrent neural network
are often defined as a dynamic system h(t) driven by an external signal x(t)
h(t) = f (h(t−1) , x(t) ; θ) (3.25)
Hidden states ht are utilized by RNNs to summarize problem-relevant as-

pects of the past sequence of inputs up to t when forecasting future states based
on previous states. Since the hidden state is a fixed-length vector, it will be
a lossly summary. The forward pass is sequential and cannot be parallelized.
Backprop uses the states computed in the forward pass to calculate the gradi-
ent. The backprop algorithm used on unrolled RNNs is called backpropagation
through time (BPTT). All nodes that contribute to an error should be adjusted.
In addition, for an unrolled RNN, nodes far back in the calculations should also
be adjusted. Truncated backpropagation through time that only backpropa-
gates for a few backward steps can be used to save computational resources at
the cost of introducing bias.
Every time the gradient backpropagates through a vanilla RNN cell, it is
multiplied by the transpose of the weights. A sequence of vanilla RNN cells will
therefore multiply the gradient with the same factor multiple times. If x > 1
then limn→∞ xn = ∞, and if x < 1 then limn→∞ xn = 0. If the largest singular
31
value of the weight matrix is > 1, the gradient will exponentially increase as it
backpropagates through the RNN cells. Conversely, if the largest singular value
is < 1, the opposite happens, where the gradient will shrink exponentially. For
the gradient of RNNs, this will result in either exploding or vanishing gradi-
ents. This is why vanilla RNNs trained with gradient-based methods do not
perform well, especially when dealing with long-term dependencies. Bengio et
al. [BSF94] present theoretical and experimental evidence supporting this con-
clusion. Exploding gradients lead to large updates that can have a detrimental
effect on model performance. The standard solution is to clip the parameter
gradients above a certain threshold. Gradient clipping can be done element-wise
or by the norm over all parameter gradients. Clipping the gradient norm has
an intuitive appeal over elementwise clipping. Since all gradients are normal-
ized jointly with the same scaling factor, the gradient still points in the same
direction, which is not necessarily the case for element-wise gradient clipping
[GBC16]. Let ∥g∥ be the norm of the gradient g and v > 0 be the norm thresh-
old. If the norm crosses over the threshold ∥g∥ > v, the gradient is clipped to
gv
g← (3.26)
∥g∥
Gradient clipping solves the exploding gradient problem and can improve perfor-
mance for reinforcement learning with nonlinear function approximation [ARS+ 20].
For vanishing gradients, however, the whole architecture of the recurrent net-
work needs to be changed. This is currently a hot topic of research [GBC16].
Long short-term memory Long Short-Term Memory (LSTM) is a form of

gated RNN designed to have better gradient flow properties to solve the problem
of vanishing and exploding gradients. LSTMs were introduced in 1997 [HS97]
and are traditionally used in natural language processing [GBC16]. Recently,
LSMT networks have been successfully applied to financial time series fore-
casting [SNN18]. Although new architectures like transformers have impressive
natural language processing and computer vision performance, LSTMs are still
considered state-of-the-art for time series forecasting.
The LSTM is parameterized by the weight W ∈ Rn , which is optimized
using gradient-based methods. While vanilla RNNs only have one hidden state,
LSTMs maintain two hidden states at every time step. One is ht , similar to
the hidden state of vanilla RNNs, and the second is ct , the cell state that gets
kept inside the network. The cell state runs through the LSTM cell with only
minor linear interactions. LSTMs are composed of a cell and four gates which
regulate the flow of information to and from the cell state and hidden state
• Input gate i; decides which values in the cell state to update
• Forget gate f ; decides what to erase from the cell state
• Output gate o; decides how much to output to the hidden state
• Gate gate g; how much to write to cell, decides how much to write to the
cell state
32
Figure 3.8: LSTM cell from [FFLa].
The output from the gates is defined as
i σ
f σ h
= W t−1 (3.27)
o σ xt
g tanh
where σ is the sigmoid activation function. The cell state ct and hidden state
ht are updated according to the following rules
ct = f ⊙ ct−1 + i ⊙ g (3.28)
ht = o ⊙ tanh (ct ) (3.29)

When the gradient flows backward in the LSTM, it backpropagates from
ct to ct−1 , and there is only elementwise multiplication by the f gate and no
multiplication with the weights. Since the LSTMs backpropagate from the last
hidden state through the cell states backward, it is only exposed to one tanh
nonlinear activation function. Otherwise, the gradient is relatively unimpeded.
Therefore, LSTMs handle long-term dependencies without the problem of ex-
ploding or vanishing gradients.
33
4 Reinforcement learning
An algorithmic trading agent maps observations of some predictor data to mar-
ket positions. This mapping is non-trivial, and as noted by Moody et al.
[MWLS98], accounting for factors such as risk and transaction costs is diffi-
cult in a supervised learning setting. Fortunately, reinforcement learning pro-
vides a convenient framework for optimizing risk- and transaction-cost-sensitive
algorithmic trading agents.
The purpose of this chapter is to introduce the fundamental concepts of
reinforcement learning relevant to this thesis. A more general and compre-
hensive introduction to reinforcement learning can be found in “Reinforcement
Learning: An Introduction” by Richard Sutton and Andrew Barto [SB18]. An
overview of deep reinforcement learning may be found in “Deep Reinforcement
Learning” by Aske Plaat [Pla22]. This chapter begins by introducing reinforce-
ment learning 4.1 and the Markov decision process framework 4.2, and some
foundational reinforcement learning concepts 4.3, 4.4. Section 4.5 discusses how
the concepts introduced in the previous chapter (3) can be combined with re-
inforcement learning to generalize over high-dimensional state spaces. Finally,
section 4.6 introduces policy gradient methods, which allow an agent to optimize
a parameterized policy directly.
4.1 Introduction
Reinforcement Learning (RL) is the machine learning paradigm that studies
how an intelligent agent can learn to make optimal sequential decisions in a
time series environment under stochastic or delayed feedback. It is based on
the concept of learning optimal behavior to solve complex problems by training
in an environment that incorporates the structure of the problem. The agent
optimizes a policy that maps states to actions through reinforcement signals
from the environment in the form of numerical rewards. The goal of using
RL to adjust the parameters of an agent is to maximize the expected reward
generated due to the agent’s actions. This goal is accomplished through trial
and error exploration of the environment. A key challenge of RL is balancing
exploring uncharted territory and exploiting current knowledge, known as the
exploration-exploitation tradeoff. Although it has been studied for many years,
the exploration-exploitation tradeoff remains unsolved. Each action must be
tried multiple times in stochastic environments to get a reliable estimate of its
expected reward. For environments with non-stationary dynamics, the agent
must continuously explore to learn how the distribution changes over time. The
agent-environment interaction in RL is often modeled as a Markov decision
process.
4.2 Markov decision process

A Markov Decision Process (MDP) is a stochastic control process and a classical
formalization of sequential decision-making. A MPD is a tuple (S, A, p, R, γ),
34
where
• S is a countable non-empty set of states (state space).
• A is a countable non-empty set of actions (action space)
• p(s′ |s, a) = P r(st+1 = s′ |st = s, at = a) is the transition probability
matrix.
• R ⊂ R is the set of all possible rewards.
• γ ∈ [0, 1] is the discount rate.
The agent interacts with the environment at discrete time steps t = 0, 1, 2, 3, ...,
which are not necessarily fixed intervals of real-time. At each step t, the agent
receives a representation of the state of the environment st ∈ S, where s0 ∈ S
is the initial state drawn from some initial state distribution p0 ∈ ∆(S). Based
on the state st = s, the agent chooses one of the available actions in the current
state at ∈ A(s). After performing the action at , the agent receives an immediate
numerical reward rt+1 ∈ R and the subsequent state representation st+1 ∈ S.
This interaction with a Markov decision process produces a sequence known as
a trajectory: s0 , a0 , r1 , s1 , a1 , r2 , s2 , a2 , r3 , .... This sequence is finite for episodic
tasks (with the termination time usually labeled T ); for continuing tasks, it is
infinite.
Figure 4.1: Agent-environment interaction from [SB18].
The dynamics of the system can be completely described by the one-step

transition function p : S × R × S × A → [0, 1] that is defined as
p(s′ , r|s, a) = P r{st = s′ , rt = r|st−1 = s, at−1 = a} (4.1)
for all s, s′ ∈ S, r ∈ R, and a ∈ A(s). It defines a probability distribution such

that X X
p(s′ , r|s, a) = 1 (4.2)
s∈S a∈A(s)
35
for all s ∈ S, and a ∈ A(s). Note that the one-step transition function depends
only on the current state s and not previous states, i.e., the state has the Markov
property. Essentially, MDPs are Markov chains with actions and rewards. The
transition probabilities p : S × S × A → [0, 1] are defined through the dynamics
function p(s′ , r|s, a), as
X
p(s′ |s, a) = P r(st = s′ |st−1 = s, at−1 = a) = p(s′ , r|s, a) (4.3)
r∈R
The reward generating function r : S × A → R, is defined through the dynamics

function p(s′ , r|s, a), as
X X
r(s, a) = E[rt+1 |st = s, at = a] = r p(s′ , r|s, a) (4.4)
r∈R s′ ∈S
The reward-generating function determines the expected reward from perform-

ing an action a in a state s. In practice, the dynamics of the system p(s′ , r|s, a)
are seldom known a priori but learned through interacting with the environment.
4.2.1 Infinite Markov decision process

A finite MDP is a Markov decision process with countably finite state space
|S| < ∞, action space |A| < ∞. Finite MDPs can be described as tabular and
solved by dynamic programming algorithms with convergence guarantees if the
state and action space dimensions |S × A| are not too large. Unfortunately, the
applicability of these methods is severely limited by the assumption that state-
action spaces are countable. These assumptions must be relaxed for MDPs
to have significant real-world applications. Fortunately, the same theory for
finite MDPs also applies to continuous and countably infinite state-action spaces
under function approximation. The system dynamics are then described by a
transition probability function P instead of a matrix.
4.2.2 Partially observable Markov decision process

Let st be the environment state and sat be the agent state. A Markov deci-
sion process assumes full observability of the environment, i.e., that st = sat .
A Partially Observable Markov Decision Process (POMDP) relaxes this as-
sumption and allows for optimal decision-making in environments that are only
partially observable to the agent. Since they are generalizations of MDPs, all
algorithms used for MDPs are compatible with POMDPs. A POMDP is a tuple
(S, A, O, P, R, Z, γ) that extends a MDP with two additional elements
• O is the observation space.

• Z is the observation function, Z = P r(ot+1 = o|st+1 = s′ , at = a).
An agent in a POMDP cannot directly observe the state; instead, it receives
an observation o ∈ O determined by the observation function Z. The agent
36
state approximates the environment state sat ≈ st . However, a single observa-
tion o is not a Markovian state signal. Direct mapping between observation and
action is insufficient for optimal behavior, and a memory of past observations
is required. The history of a POMDP is a sequence of actions and observations
ht = {o1 , a1 , ..., ot , at }. The agent state can be defined as the history sat = ht .
However, storing and processing the complete history of every action scales lin-
early with time, both in memory and computation. A more scalable alternative
is a stateful sequential model like a recurrent neural network (RNN). In this
model, the agent state is represented by the network sat = fθ (sat−1 , ot ).
A state can be split into an agent’s internal state and the environment’s
external state. Anything that cannot be changed arbitrarily by the agent is
considered external and, thus, part of the external environment. On the other
hand, the internal data structures of the agent that the agent can change are
part of the internal environment.
4.3 Rewards
The goal of a reinforcement learning agent is to maximize the expected return
E[Gt ], where the return Gt is defined as the sum of rewards
Gt = rt+1 + rt+2 + ... + rT (4.5)
In an episodic setting, where t = 0, ..., T , this goal is trivial to define as the

sequence of rewards is finite. However, some problems, like financial trading, do
not naturally break into subsequences and are known as continuing problems.
For continuing problems, where T = ∞ and there is no terminal state, it is clear
that the sum of rewards Gt could diverge. Discounting was introduced to solve
the problem of returns growing to infinity. Discounted returns are defined as
∞
X
Gt = γ k rt+k+1 = rt+1 + γGt+1 (4.6)
k=0
where γ ∈ [0, 1] is the discount rate used to scale future rewards. Setting γ = 0
suggests that the agent is myopic, i.e., only cares about immediate rewards. As
long as γ < 1 and the reward sequence is bounded, the discounted return Gt
is finite. Discounting allows reinforcement learning to be used in continuing
problems.
Reinforcement signals rt+1 from the environment can be immediate or de-
layed. Games and robot control are typical examples of delayed reward environ-
ments, where an action affects not only the immediate reward but also the next
state and, through that, all subsequent rewards. An example of delayed reward
is when chess players occasionally sacrifice a piece to gain a positional advantage
later in the game. Although sacrificing a piece in isolation is poor, it can still be
optimal long-term. Consequently, temporal credit assignment is a fundamen-
tal challenge in delayed reward environments. AlphaZero [SHS+ 17] surpassed
human-level play in chess in just 24 hours, starting from random play, using
37
reinforcement learning. Interestingly, AlphaZero seems unusually (by human
standards) open to material sacrifices for long-term positional compensation,
suggesting that the RL algorithm estimates delayed reward better than human
players. Throughout this thesis, financial trading is modeled as a stochastic
immediate reward environment. This choice is justified in chapter 5. Therefore,
the problem reduces to an associative reinforcement learning problem, a specific
instance of the full reinforcement learning problem. It requires generalization
and trial-and-error exploration but not temporal credit assignment. The meth-
ods presented in this chapter will only be those relevant in an immediate reward
environment. Unless otherwise stated, the discount rate γ, a tradeoff between
immediate and delayed rewards, is assumed to be zero, making the agent my-
opic. As a result, the return Gt in an immediate reward environment is defined
as the immediate reward
∞
X
Gt = γ k rt+k+1 = rt+1 (4.7)
k=0
4.4 Value function and policy

A stochastic policy is a mapping from states to a probability distribution over
the action space and is defined as
π : S → ∆(A) (4.8)
The stochastic policy is a probability measure π(a|s) = P r{at = a|st = s},

which is the probability that the agent performs an action a, given that the
current state is s. Stochastic policies can be advantageous in problems with
perceptual aliasing. Furthermore, it handles the exploration-exploitation trade-
off without hard coding it. A deterministic policy maps states S to actions A,
and is defined as
µ:S→A (4.9)
RL algorithms determine how policies are adjusted through experience, where
the goal is for the agent to learn an optimal or near-optimal policy that maxi-
mizes returns.
Value-based RL algorithms, like Q-learning and SARSA, estimate a state-
value function or an action-value function and extract the policy from it. The
state-value function Vπ (s) is the expected return when starting in state s and
following policy π. It is defined ∀s ∈ S as
Vπ (s) = Eπ [Gt |st = s] (4.10)
The action-value function Qπ (s, a) is the expected return when performing ac-
tion a in state s and then following the policy π. It is defined ∀s ∈ S, a ∈ A(s)
as
Qπ (s, a) = Eπ [Gt |st = s, at = a] (4.11)
38
An example of a value-based policy is the ϵ-greedy policy, defined as
(
arg maxa Qπ (s, a) with probability 1 − ϵ
π(a|s, ϵ) = (4.12)
sample random action a ∼ A(s) with probability ϵ
where ϵ ∈ [0, 1] is the exploration rate.
Reinforcement learning algorithms are divided into on-policy and off -policy
algorithms. The same policy that generates the trajectories is being optimized
for on-policy algorithms. In contrast, for off-policy algorithms, the policy gener-
ating trajectories differs from the one being optimized. For off-policy learning,
the exploration can be delegated to an explorative behavioral policy while the
agent optimizes a greedy target policy.
4.5 Function approximation

Tabular reinforcement learning methods include model-based methods like dy-
namic programming as well as model-free methods like Monte-Carlo and temporal-
difference learning methods (e.g., Q-learning and SARSA). Unfortunately, tab-
ular methods require discrete state and action spaces, and due to the curse of
dimensionality, these spaces must be relatively small. Thus, their applicability
to real-world problems is limited. Complex environments like financial trad-
ing cannot be represented in discrete states. Instead, feature vectors represent
states in environments where the state space is too large to enumerate. As
most states will probably never be visited and visiting the same states twice is
unlikely, it is necessary to generalize from previous encounters to states with
similar characteristics. This is where the concepts of function approximation
discussed in the previous chapter (3.2.1) come in. Using function approxima-
tion, samples from the desired function, e.g., the value function, are generalized
to approximate the entire function.
Value-based reinforcement learning algorithms such as Deep Q-Network (DQN)
[MKS+ 13] and Deep Recurrent Q-Network (DRQN) [HS15] use deep neural net-
works as function approximators to generalize over a continuous space that is
optimized using the Q-learning algorithm. DQN and DRQN achieved superhu-
man performance in playing Atari games using raw pixels as input into a con-
volutional neural network that outputs action-value estimates of future returns.
These value-based algorithms are still limited by discrete action space and the
curse of dimensionality as it has to calculate the Q-value of every single action.
In the games DQN and DRQN are tested on, the agent is limited to a small
discrete set of actions (between 4 and 18). However, for many applications, a
discrete action space is severely limiting. Furthermore, these algorithms use the
naive exploration heuristics ϵ-greedy, which is not feasible in critical domains.
Fortunately, policy gradient methods bypass these problems entirely.
4.6 Policy gradient methods

While value-based reinforcement learning algorithms extract a policy from action-
value estimates, Policy Gradient (PG) methods learn a parameterized policy and
39
′
optimize it directly. The policy’s parameter vector is θ ∈ Rd , with the policy
defined as
πθ (a|s) = P r{at = a|st = s, θt = θ} (4.13)
Continuous action space is modeled by learning the statistics of the prob-
ability of the action space. A natural policy parameterization in continuous
action spaces is the Gaussian distribution a ∼ N (µθ (s), σθ (s)2 ) defined as
(a − µθ (s))2
−
1 2σθ (s)2
πθ (a|s) = √ e (4.14)
σθ (s) 2π
where µθ (s) ∈ R and σθ (s) ∈ R+ are parametric function approximations of the
mean and standard deviation, respectively. The mean decides the space where
the agent will favor actions, while the standard deviation decides the degree of
exploration. It is important to note that this gives a probability density, not a
probability distribution like the softmax distribution.
For policy gradient methods in the continuous time setting, the goal of opti-
mizing the policy πθ is to find the parameters θ that maximize the average rate
of return per time step [SB18]. The performance measure J for the policy πθ
in the continuing setting is defined in terms of the average rate of reward per
time step as
Z Z
J(πθ ) = dπ (s) r(s, a)πθ (a|s)dads
S A
= Es∼dπ ,a∼πθ [r(s, a)] (4.15)
where dπ (s) = limt →∞ P r{st = s|a0:t ∼ πθ } is the steady-state distribution

under the policy πθ .
Policy optimization aims to find the parameters θ that maximize the per-
formance measure J. Gradient ascent is used as the optimization algorithm
for the policy. The policy parameter θ is moved in the direction suggested by
the gradient of J to maximize the return, yielding the following gradient ascent
update
θt+1 = θt + α∇θ\ J(πθt ) (4.16)
where α is the step-size and ∇θ\ J(πθt ) is a stochastic estimate whose expectation
approximates the gradient of J with respect to θ [SB18].
The policy gradient theorem8 for the continuing case provides the following
expression for the gradient
Z Z
∇θ J(πθ ) = dπ (s) Qπ (s, a)∇θ πθ (a|s)dads
S A
= Es∼dπ ,a∼πθ [Qπ (s, a)∇θ log πθ (a|s)] (4.17)
Even though the steady-state distribution dπ depends on the policy parameters

θ, the gradient of the performance measure does not involve the gradient of dπ ,
8 For the full proof see chapter 13.6 in [SB18]
40
allowing the agent to simulate paths and update the policy parameter at every
step [SB18].
4.6.1 REINFORCE
REINFORCE is an on-policy direct policy optimization algorithm derived using
the policy gradient theorem [SB18]. The algorithm is on-policy. Consequently,
the agent will encounter the states in the proportions specified by the steady-
state distribution. Using the policy gradient theorem, the calculation of the
policy gradient reduces to a simple expectation. The only problem is estimating
the action-value function Qπ (s, a). REINFORCE solves this problem by using
the sampled return Gt as an unbiased estimate of the action-value function
Qπ (st , at ). Observing that the state-value is equal to the expectation of the
sampled return, i.e., Eπ [Gt |st , at ] = Qπ (st , at ), the following expression for the
policy gradient can be defined
∇θ J(πθ ) = Es∼dπ ,a∼πθ [Qπ (s, a)∇θ log πθ (a|s)]

= Es∼dπ ,a∼πθ [Gt ∇θ log πθ (a|s)] (4.18)
This expression can be sampled on each time step t, and its expectation equals
the gradient. The gradient ascent policy parameter update for REINFORCE is
defined as
θt+1 = θt + αGt ∇θ log πθt (at |st ) (4.19)
where α is the step size. The direction of the gradient is in the parameter space
that increases the probability of repeating action at on visits to st in the future
the most [SB18]. The higher the return, the more the agent wants to repeat that
action. The update is inversely proportional to the action probability to adjust
for different frequencies of visits to states, i.e., some states might be visited
often and have an advantage over less visited states.
While REINFORCE is unbiased and only requires estimating the policy, it
might exhibit high variance due to the high variability of sampled returns (if
the trajectory space is large). High variance leads to unstable learning updates
and slower convergence. Furthermore, the stochastic policy used to estimate
the gradient can be disadvantageous in critical domains such as health care or
finance. Thankfully, both these problems can be solved by a class of policy
gradient methods called actor-critic methods.
4.6.2 Actor-critic
Policy-based reinforcement learning is effective in high-dimensional and con-
tinuous action space, while value-based RL is more sample efficient and more
convenient for online learning. Actor-Critic (AC) methods seek to combine the
best of both worlds where a policy-based actor chooses actions, and the value-
based critic critique those actions. The actor optimizes the policy parameters
using stochastic gradient ascent in the direction suggested by the critic. The
critic’s value function is optimized using stochastic gradient descent to minimize
41
the loss to the target. This use of a critic introduces bias since the critique is an
approximation of the return and not actual observed returns like in actor-based
algorithms like REINFORCE. There are numerous actor-critic algorithms like
advantage actor-critic (A2C) [SB18], asynchronous advantage actor-critic (A3C)
[MBM+ 16], and proximal policy optimization (PPO) [SWD+ 17], that have ex-
hibited impressive performance in a variety of applications. These methods rely
on stochastic policies and computing the advantage function. For critical do-
mains such as finance, a deterministic policy directly optimized by a learned
action-value function might be more appropriate. Fortunately, the policy gra-
dient framework can be extended to deterministic policies [SLH+ 14, LHP+ 15].
The idea behind deterministic actor-critic algorithms is based on Q-learning,
where a network Q(s, a) approximates the return. Q-learning can be extended
to high-dimensional state spaces by defining the Q-network as a function ap-
′
proximator Qϕ (s, a) : S × A → R, parameterized by ϕ ∈ Rb . If the Q-network
is optimal (Q∗ϕ ), finding the optimal action (a∗ ) in a small discrete action space
is trivial; a∗ (s) = arg maxa Q∗ϕ (s, a). However, the exhaustive computations re-
quired for this process are not feasible in high-dimensional or continuous action
spaces due to the curse of dimensionality. This problem can be bypassed by
′
learning a deterministic policy µθ (s) : S → A, parameterized by θ ∈ Rd , as an
approximator to a(s), such that maxa Qϕ (s, a) ≈ Qϕ (s, µ(s)).
Deterministic policy gradient Let µθ : S → A be the deterministic policy

′
parameterized by θ ∈ Rd . The performance measure J for the deterministic
policy µθ in the continuous time average reward setting is defined as
Z
J(µθ ) = dµ (s)r(s, µθ (s))ds
S
= Es∼dµ [r(s, µθ (s))] (4.20)
Initially, there was a belief that the deterministic policy gradient did not exist;
however, it was proven by Silver et al. [SLH+ 14], which provides the following
expression for the gradient
Z
∇θ J(µθ ) = dµ (s)∇θ µθ (s)∇a Qµ (s, a)|a=µθ (s) ds
S
= Es∼dµ [∇θ µθ (s)∇a Qµ (s, a)|a=µθ (s) ] (4.21)
The deterministic policy gradient theorem holds for both on-policy and off-
policy methods. Deterministic policies only require integrating over the state
space and not both the state and action space like stochastic policies. The true
action-value can be approximated by a parameterized critic, i.e., Qϕ (s, a) ≈
Qµ (s, a).
Off-policy learning Learning a deterministic policy in continuous action

spaces on-policy will generally not ensure sufficient exploration and can lead
42
to sub-optimal solutions. To solve this problem, the deterministic actor-critic
algorithm learns off-policy by introducing an exploration policy µ′θ defined as
µ′θ (s) = µθ (s) + W (4.22)
where W is sampled noise from a noise-generating function. The exploration
policy µ′θ explores the environment and generates trajectories that optimize the
target policy µθ and Q-network Qϕ .
Q-network optimization Let Qϕ (s, a) : S × A → R be the Q-network pa-

′
rameterized by ϕ ∈ Rb . The Q-network is iteratively updated to fit a target
defined by the recursive relationship y = r +γ maxa′ Q(s′ , a′ ) known as the Bell-
man equation [SB18]. The Bellman equation reduces to the immediate reward
in an immediate reward environment, where γ = 0. The goal is to find the
weights ϕ that minimize the loss (usually MSE) to the target
L(Qϕ ) = Es∼dµ′ ,a∼µ′ ,r∼E [(Qϕ (s, a) − y)2 ] (4.23)
′
where dµ is the steady-state distribution under the exploration policy µ′θ , and
E is the environment. The gradient of the loss function with respect to the
Q-network parameter weights ϕ is defined as
∇ϕ L(Qϕ ) = Es∼dµ′ ,a∼µ′ ,r∼E [(Qϕ (s, a) − y)∇ϕ Qϕ (s, a)] (4.24)
and is used to calculate the backward pass in the Q-network’s stochastic gradient
descent optimization algorithm.
Replay memory Learning policies and Q-networks with large nonlinear func-
tion approximators is generally considered difficult and unstable and do not
come with convergence guarantees. Another challenge of combining deep neural
networks with reinforcement learning is that most ML optimization algorithms
assume that samples are independent and identically distributed (IID). The IID
assumption is rarely valid for RL agents sequentially exploring the state space.
Furthermore, minibatch learning is advantageous as it efficiently utilizes hard-
ware optimization. The introduction of replay memory [MKS+ 13] addresses
these problems and trains large nonlinear function approximators stably and
robustly. A replay memory D = {τt−k+1 , τt−k+2 , ..., τt } is a finite cache storing
the past k transitions τt = (st , at , rt ). A minibatch B ⊆ D of |B| > 0 transitions
are randomly sampled from the replay memory and used to update both the
policy and Q-network.
Randomly sampled batches are ineffective for training recurrent neural net-
works, which carry forward hidden states through the mini-batch. Deep Re-
current Q-Network (DRQN) [HS15] is an extension of DQN for recurrent neu-
ral networks. DRQN uses experience replay like DQN; however, the sampled
batches are in sequential order. The randomly sampled batch B ⊆ D consists
of the transitions B = {τi , τi+1 , ..., τi+|B|−2 , τi+|B|−1 }, where i is some random
starting point for the batch. The RNNs initial hidden state is zeroed at the
start of the mini-batch update but then carries forward through the mini-batch.
43
Part II
Methodology
44
5 Problem Setting
In reinforcement learning, the agent learns through interaction with the envi-
ronment. Thus, developing a model of the environment, in this case, the com-
modity market, is necessary to optimize an algorithmic trading agent through
reinforcement signals. Commodities trading involves sequential decision-making
in a stochastic and nonstationary environment to achieve some objective out-
lined by the stakeholder. This chapter describes a discrete-time Markov decision
process that models this environment 9 . Neither the strong assumption of count-
able state-actions nor the assumption of full environment observability can be
satisfied. Thus, based on the previously proposed financial markets dynami-
cal system [ZZR20, Hua18, LXZ+ 18], this chapter presents an infinite partially
observable MDP for commodities trading.
5.1 Assumptions
Since the model will be tested ex-post by backtesting, described in section 2.9,
it is necessary to make a couple of simplifying assumptions about the markets
the agent operates in:
1. No slippage, i.e., there is sufficient liquidity in the market to fill any orders
placed by the agent, regardless of size, at the quoted price. In other
words, someone is always willing to take the opposite side of the agent’s
trade. This assumption relates to external factors that may affect the
price between the time the agent is quoted the price and the time the
order is filled10 .
2. No market impact, i.e., the money invested by the agent is not signifi-
cant enough to move the market. This assumption relates to the agent’s
own trades’ impact on the market. The reasonability of this assumption
depends on the depth of the market.
5.2 Time Discretization

Financial trading involves continuously reallocating capital in one or more fi-
nancial assets. The problem does not naturally break into sub-sequences with
terminal states. Therefore, this MDP is in the continuous-time setting. A dis-
cretization operation is applied to the continuous timeline to study the reinforce-
ment learning-based algorithmic trading described in this thesis, discretizing the
timeline into steps t = 0, 1, 2, .... As described in section 2.8, sampling at fixed
time intervals is unsatisfactory in markets where activity varies throughout the
day and exhibits undesirable statistical properties like non-normality of returns
and heteroskedasticity. Instead of the traditional time-based constant duration
∆t, the observations are sampled as a function of dollar volume based on the
9 Although this thesis focuses on commodities, the model’s general concepts apply to other
financial markets.
10 In reality, prices may significantly change between receiving a quote and placing an order.
45
ideas from Mandelbrot and Taylor [Man97, MT67], and Clark [Cla73] presented
in section 2.8. Dollar volume-based sampling provides better statistical prop-
erties for the agent and can, without human supervision, adapt to changes in
market activity.
In practice, observations are sampled by sequentially summing the product
of the volume vi and price pi of every trade in the market and then sampling
a new observation once this sum breaches a predefined threshold δ > 0 before
starting the process over again. Define the sum of the total transacted dollar
volume from the past sampled point k to point i as
i
X
χi = vj · p j (5.1)
j=k+1
where i ≥ k + 1. Once χi breaches the threshold, i.e., χi > δ, the sub-sampling

scheme samples the trade at time i as a new observation, k = i + 1, and resets
the sum of dollar volume χi+1 = 0.
Due to the increasing volume in the energy futures markets in recent years,
defining an appropriate threshold δ is complicated. On the one hand, the pur-
pose of using this sampling scheme is that the sampling frequency will deviate
throughout the day and weeks depending on the transacted dollar volume. How-
ever, if structural changes in the market significantly alter the transacted dollar
volume over long periods, e.g., three months, it would be advantageous for the
threshold to adjust to that change. A constant threshold will therefore be un-
satisfactory as it would not be reactive to these structural changes over long
periods. A more robust alternative is a threshold that adjusts itself without
human supervision. Therefore, the threshold δ is defined using a simple mov-
ing average over the daily dollar volume of the past 90 days, avoiding lookahead
bugs. The threshold is tuned using one parameter, the target number of samples
per day, defined as tgt ∈ R+ . The threshold δ is defined as
SM A90d (v · p)
δ= (5.2)
tgt
which is the threshold needed to achieve the target number of samples per day
in the past 90 days. The threshold continuously updates as trades occur in
the market. There is no guarantee that the threshold will lead to the desired
amount of samples per day, as it is computed from historical data. Nonetheless,
it does achieve satisfactory results, even in unstable markets.
The time discretization scheme presented in this section represents progress
in the research area from fixed time-interval-based sampling, providing better
statistical properties while being more robust and adaptive to changing market
environments.
5.3 State Space

The universe of possible investments is limited to one instrument. The state
space of a financial market includes all market participants and their attitudes
46
toward the financial instrument. Thus, the state space S is continuous and
partially observable. Representing the environment state st to an algorithmic
trading agent is impossible, so it needs to be approximated by an agent state,
i.e., sat ≈ st . This thesis adopts the philosophy of technical traders described
in section 2.5. It uses past trades, specifically their price and volume, as ob-
servations ot of the environment. Let k ∈ R+ be the number of trades for the
instrument during the period (t − 1, t]. An observation ot at time t is defined
as
ot = [pt , vt ] (5.3)
where
• pt ∈ Rk are the prices of all k trades during the period (−1, t]. The
opening price is denoted pt .
• vt ∈ Rk are the volumes of all k trades during the period (−1, t].
A single observation ot is not a Markovian state signal, and the agent state
can be defined by the entire history ht . However, this alternative is not scal-
able. Section 5.2 introduced the time discretization scheme for this environment,
which is a form of sub-sampling. However, the computational and memory re-
quirements still grow linearly with the number of samples, so a history cut-off
is also employed. In other words, the agent will only have access to the past
n ∈ N+ observations ot−n+1:t . In addition, the recursive mechanism of consider-
ing the past action as a part of the internal state of the environment introduced
by Moody et al. [MWLS98] is adopted to consider transaction costs. The agent
state is formed by concatenating the external state consisting of stacking the
n most recent observations with the internal state consisting of the past action
at−1 , i.e.,
sat = {ot−n+1:t , at−1 } (5.4)
The dimension of the agent state vector is dim (sat ) = 2kn + 1.
5.4 Action Space

At every time step t, the agent can buy or sell the instrument on the market. The
opening price of a period pt , the price the agent can buy or sell the instrument
for at time t, is the last observed price, i.e., the closing price of the previous
period (t − 1, t]. The no slippage assumption from section 5.1 implies that the
instrument can be bought or sold in any quantity for the time t at the opening
price of that period pt .
Some trading environments allow the agent to output the trade directly;
e.g., at = −5 corresponds to selling five contracts, or at = +10 corresponds
to purchasing ten contracts. However, despite its intuitive nature, it can be
problematic because the agent must maintain a continuous record of the number
of contracts it holds and the amount of available balance at all times in order to
avoid making irrational decisions such as selling contracts they do not own or
purchasing more contracts than they can afford. Adding this layer of complexity
47
complicates the learning process. Instead, a more straightforward approach is
to have the agent output its desired position weight. In this case, a trade is not
directly outputted but inferred from the difference between the agent’s current
position and its chosen next position.
At every step t, the agent performs an action at ∈ [−1, 1], representing the
agent’s position weight during the period (t, t + 1]. The weight represents the
type and size of the position the agent has selected, where
• at > 0 indicates a long position, where the agent bets the price will rise
from time t to time t + 1. The position is proportional to the size of the
weight, where at = 1 indicates that the agent is maximally long.
• at = 0 indicates no position.
• at < 0 indicates a short position, where the agent bets the price will fall.
at = −1 indicates that the agent is maximally short. This thesis assumes
that there is no additional cost or restriction on short-selling.
The trading episode starts and ends (if it ends) with no position, i.e., a0 = aT =
0.
The weight at represents a fraction of the total capital available to the agent
at any time. For this problem formulation, it is irrelevant if at = 1 represents
$1 or $100 million. However, this requires that any fraction of the financial
instrument can be bought and sold. E.g., if the agent has $100 to trade and
wants to take the position at = 0.5, i.e., a long position worth $50, the price
might not be a factor of 50, meaning that the agent would not get the exact
position it selected. The fractional trading assumption is less reasonable the
smaller the amount of capital available to the agent. On the other hand, the
assumptions made in section 5.1 are less reasonable the higher the amount of
capital.
5.5 Reward Function

As noted in section 2.6, the goal of an algorithmic trading agent should not be
to minimize forecast loss but to maximize returns, as it is more in line with the
ultimate goal of the trader. Transaction costs represent a non-trivial expense
that must be accounted for to generalize to real-world markets. Moreover, sec-
tion 2.3 introduced the philosophy of modern portfolio theory, which advocates
maximizing risk-adjusted returns. An advantage of reinforcement learning is
that the trading agent can be directly optimized to maximize returns while con-
sidering transaction costs and risk. This section introduces a reward function
sensitive to transaction costs and risk.
The reward rt is realized at the end of the period (t − 1, t] and includes the
return of the position at−1 held during that interval. The objective of financial
trading is generally to maximize future returns, or in more vernacular terms; to
buy when the price is low and sell when the price is high. The multiplicative
48
return of a financial instrument at time t is defined as the relative change in
price from time t − 1 to t
pt
yt = −1 (5.5)
pt−1
Multiplicative returns, unlike additive returns, have the advantage that they
are insensitive to the size of the capital traded. Logarithmic returns log (yt + 1)
are typically used in algorithmic trading for their symmetric properties [JXL17,
Hua18, ZZW+ 20]. The gross log return realized at time t is
rtgross = log (yt + 1)at−1 (5.6)
At the end of the period (t − 1, t], due to price movements yt in the market,
the weight at−1 evolve into
pt
at−1 pt−1
a′t = (5.7)
at−1 yt + 1
where a′t ∈ R. At the start of the next period t, the agent must rebalance
the portfolio from its current weight a′t to its chosen weight at . As noted in
section 2.1, the subsequent trades resulting from this rebalancing are subject to
transaction costs. The size of the required rebalancing at time t is represented
by ||at − a′t ||. The log-return net of transaction costs at time t is defined as
rtnet = rtgross − λc ||at−1 − a′t−1 || (5.8)
where λη ∈ [0, 1] is the transaction cost fraction that is assumed to be identical

for buying and selling.
The log-return net of transaction costs assumes that the trader is risk-
neutral, which is rarely true. The Sharpe ratio is the most common measure of
risk-adjusted return; however, as noted in section 2.6, directly optimizing the
Sharpe ratio might not be optimal. Instead, this thesis adopts the variance over
returns [ZZW+ 20] as a risk term
σ 2 (rinet |i = t − L + 1, ..., t) = σL
2 net
(rt ) (5.9)
where L ∈ N+ is the lookback window to calculate the variance of returns. In

this thesis, the lookback window is L = 60. In conclusion, subtracting the risk
term defined in equation 5.9 from the net returns defined in equation 5.8 gives
the risk-adjusted log-return net of transaction costs rt , defined as
rt = rtnet − λσ σL
2 net
(rt ) (5.10)
where λσ ≥ 0 is a risk-sensitivity term that can be considered a trade-off hyper-

parameter for the stochastic gradient descent optimizer. If λσ = 0, the agent is
risk-neutral. The reinforcement learning agents are optimized using the reward
function defined in equation 5.10.
49
6 Reinforcement learning algorithm
This chapter presents two model-free reinforcement learning algorithms that
solve the trading MDP defined in chapter 5. There are three types of rein-
forcement learning algorithms: critic-based, actor-based, and actor-critic-based.
Despite the popularity of critic-based algorithms, such as Q-learning, they are
unsuitable in this context due to their inability to handle high-dimensional or
continuous action spaces. Actor-based and actor-critic-based methods, known
as policy gradient methods (4.6), are appropriate since they can handle continu-
ous action and state spaces. Furthermore, policy gradient methods are suitable
for continuing tasks like trading. As both actor-based and actor-critic-based
methods have advantages and disadvantages, it remains to be determined which
methodology is most appropriate for this problem. Actor-based RL methods like
REINFORCE are generally successful in stochastic continuous action spaces and
have been applied to both single instrument trading and portfolio optimization
[JXL17, ZZR20]. However, actor-based RL suffers high variance in learning and
tends to be unstable and inconvenient in online learning. Actor-critic methods
like Deep Deterministic Policy Gradient (DDPG) [LHP+ 15] have become popu-
lar lately and have been applied to several RL trading and portfolio optimization
problems [LXZ+ 18, YLZW20]. Deterministic policies can be appropriate for fi-
nancial trading, and off-policy learning combined with replay memory can be
practical for online learning. However, training two neural networks is generally
deemed to be unstable. Thus, the selection of a reinforcement learning algo-
rithm is non-trivial. This chapter presents an actor-based algorithm (6.2) and
an actor-critic-based algorithm (6.3) for solving the trading MDP.
6.1 Immediate reward environment

The zero market impact assumption in chapter 5.1 implies that the agent’s
participation in the market will not affect future prices p. In other words,
the zero market assumption implies that the agent’s actions will not affect the
future external state of the environment. However, actions performed at the
start of period t affect the transaction costs paid by the agent at the start of
the subsequent period t + 1. The reward rt+1 depends on transaction costs
incurred at time t, and thus the agent’s previous action at−1 will affect the
following action. In this framework, this influence is encapsulated by adopting
the recursive mechanism introduced by Moody et al. [MWLS98] of considering
the past action as a part of the internal state of the environment. Consequently,
large position changes are discouraged.
The goal of the policy gradient agent is to find the policy parameters θ that
maximize the average rate of reward per time step. All rewards are equally
important to the final return through commutativity. Since the agent does not
affect the subsequent state of the environment, the goal is to maximize the
expected immediate reward E[rt+1 ], exactly expressed in equation 5.10 as the
expected cumulative logarithmic return net of transaction costs and the risk-
sensitivity term. Therefore, the action-value of the action at is its immediate
50
reward rt+1 , i.e.,
Q(st , at ) = rt+1 (6.1)
∀st ∈ S, at ∈ A(st ). As an immediate reward process, the reward function can
be directly optimized by the policy gradient from rewards.
The actor-based direct policy gradient method introduced in section 6.2
optimizes the policy by using the immediate reward directly. In contrast, the
actor-critic method introduced in section 6.3 optimizes the policy using critique
from a Q-network optimized to minimize the loss to the immediate reward.
6.2 Direct policy gradient

The first actor-based reinforcement learning algorithm is a direct policy gradient
method inspired by the REINFORCE algorithm. Instead of computing learned
probabilities for each action, the direct policy gradient method stochastically
samples actions from a Gaussian distribution. Let πθ,ϵ : S → ∆(A) be the
′
stochastic policy parameterized by the weights θ ∈ Rd . The policy is defined
as a normal probability density over a real-valued scalar action
(a−µθ (s))2

1 − 2ϵ2
πθ,ϵ (a|s) = √ e (6.2)
ϵ 2π
where the mean is given by a parametric function approximator µθ (s) : R|s| →

[−1, 1] that depends on the state and outputs an independent mean for the
Gaussian distribution. The standard deviation is given as an exploration rate
ϵ. The exploration rate ϵ ∈ R is positive and decays at λϵ ∈ [0, 1] to encourage
exploration of the action space in early learning epochs. The rate has a minimum
ϵmin ≥ 0 such that ϵ ≥ ϵmin , ∀t. After each episode, the exploration rate updates
according to the following update rule
ϵ ← max (λϵ ϵ, ϵmin ) (6.3)
At every step t, the agent samples an action at ∼ πθ from the policy and clips
the action to the interval [−1, 1].
The novel idea of using the exploration rate ϵ as a controlled, decaying stan-
dard deviation of the stochastic policy represents progress in the research area.
As ϵ approaches 0, the policy becomes more or less deterministic to the mean
given by the parametric function approximation µθ , which is advantageous in
critical domains such as financial trading. However, being a stochastic policy,
the stochastic sampling required for the REINFORCE update is still available,
blending the best of both worlds for an algorithmic trading agent in an imme-
diate reward environment.
Optimization As the model should be compatible with pre-trade training and

online learning, optimization is defined in an online stochastic batch learning
51
scheme. Trajectories are divided into mini-batches [ts , te ], where ts < te . The
policy’s performance measure on a mini-batch is defined as
" t #
X e
J(πθ,ϵ )[ts ,te ] = Eπθ,ϵ rt (6.4)

t=ts +1
i.e., the expected sum of immediate rewards during the mini-batch [ts , te ] when
following the policy πθ,ϵ . Using the policy gradient theorem, the gradient of the
performance measure J with respect to the parameter weights θ is defined as
" t #
X e
∇θ J(πθ,ϵ )[ts ,te ] = Eπθ,ϵ rt ∇θ log πθ,ϵ (at |st ) (6.5)

t=ts +1
This expectation is empirically estimated from rollouts under πθ,ϵ . The param-
eter weights are updated using a stochastic gradient ascent pass
θ ← θ + α∇θ J(πθ,ϵ )[ts ,te ] (6.6)
Pseudocode The pseudocode for the actor-based algorithm is given in Algo-

rithm 2.
6.3 Deterministic actor-critic

There are several different actor-critic algorithms available, like asynchronous
advantage actor-critic (A3C) [MBM+ 16] and proximal policy optimization (PPO)
[SWD+ 17]. However, due to the critical nature of the problem, actor-critic al-
gorithms that utilize deterministic policies are of interest, as stochasticity may
negatively affect the model’s performance. Thus, the second RL algorithm is an
off-policy deterministic actor-critic algorithm inspired by the deep deterministic
policy gradient algorithm [LHP+ 15].
′
Let µθ : A → S be the deterministic policy parameterized by θ ∈ Rd . This
is the same function approximator used to generate the mean for the Gaussian
action selection in the direct policy gradient algorithm 6.2. However, the func-
tion approximator is now used to map states to actions directly and not as the
mean in a probability distribution over the actions. Instead of optimizing the
policy using the log probabilities of the sampled action scaled by the reward,
the deterministic policy is optimized using a learned action-value critic relying
on the deterministic policy gradient theorem. Let Qϕ (s, a) : S × A → R be the
′
Q-network critic parameterized by ϕ ∈ Rb . Although exploring the state space
is unnecessary in this problem, as the external environment is not affected by
the agent’s actions, it can be advantageous to explore the action space in the
early stages of learning to provide the agent with training examples from the
entire action space. To ensure the sufficient exploration of the action space, the
algorithm is trained off-policy with an exploration policy µ′θ defined as
µ′θ (s) = µθ (s) + ϵW (6.7)
52
Algorithm 2 Actor-Based Algorithm for Trading
Input: a differentiable stochastic policy parameterization πθ,ϵ (a|s)
Algorithm parameters: learning rate αθ > 0, mini-batch size b > 0, initial
exploration rate ϵ ≥ 0, exploration decay rate λϵ ∈ [0, 1], exploration minimum
ϵmin ≥ 0
Initialize: empty list B of size b
repeat
Receive initial state of the environment s0 ∈ S
repeat
for t = 0,1,...,T-1 do
Sample action at ∼ πθ,ϵ (·|st )
Execute action at in the environment and observe rt and st+1
Store pair of reward rt and log-probabilities ∇θ ln πθ,ϵ (at |st ) in B
if |B| == b or st is terminal then
Update the policy πθ,ϵ by one step of gradient ascent using:
X
∇θ J(πθ,ϵ ) ≈ rt ∇θ ln πθ,ϵ (at |st )
B
Reset B to empty list

end if
end for
until terminal state
Update the exploration rate ϵ = max (ϵλϵ , ϵmin )
until convergence
53
where W ∼ U[−1,1) is sampled noise from an uniform distribution. The explo-
ration parameters ϵ, ϵmin , λϵ are defined similarly for the direct policy gradient
algorithm 6.2. Clipping agents’ actions to the interval [−1, 1] prevents them
from taking larger positions than their available capital.
Optimization Both the actor and critic networks are updated using ran-
domly sampled mini-batches B from a replay memory D. The replay memory
provides random batches in sequential order for stateful RNNs, and random
batches not in sequential order that minimize correlation between samples for
non-stateful DNNs. The exploration policy µ′θ explores the environment and
generates transitions τ stored in the replay memory D.
The objective function J for the deterministic policy µθ is defined as
J(µθ ) = Es∼B [Qϕ (s, µθ (s))] (6.8)
and its gradient is given as
∇θ J(µθ ) = Es∼B [∇θ µθ (s)∇a Qϕ (s, a)|a=µθ (s) ] (6.9)
Since the environment is an immediate reward environment, the target for

the Q-network updates is the immediate reward, i.e., y = r. MSE loss is used
as a loss function as the outliers are of critical importance to the success of the
trading agent. The loss function L(ϕ) for the Q-network Qϕ is defined as
L(Qϕ ) = Es,a,r∼B [(Qϕ (s, a) − r)2 ] (6.10)
and its gradient is given as
∇ϕ L(Qϕ ) = Es,a,r∼B [(Qϕ (s, a) − r)∇ϕ Qϕ (s, a)] (6.11)
Pseudocode The pseudocode for the deterministic actor-critic algorithm is

given in Algorithm 3.
54
Algorithm 3 Actor-Critic Algorithm for Trading
Input: a differentiable deterministic policy parameterization µθ (s)
Input: a differentiable state-action value function parameterization Qϕ (s, a)
Algorithm parameters: learning rates αθ > 0, αϕ > 0, mini-batch size b > 0,
replay memory size d ≥ b, initial exploration rate ϵ ≥ 0, exploration decay rate
λϵ ∈ [0, 1], exploration minimum ϵmin ≥ 0
Initialize empty replay memory cache D
repeat
Receive initial state of the environment s0 ∈ S
for t = 1,...,T do
Select action at = µθ (st ) + ϵW from the exploration policy
Execute at in the environment and observe rt and st+1
Store transition τt = (st , at , rt ) in the replay memory D
Sample a random mini-batch B of |B| transitions τ from D
Update the Q-network by one step of gradient descent using
1 X
∇ϕ (Qϕ (s, a) − r)2
|B|
(s,a,r)∈B
Update the policy by one step of gradient ascent using

1 X
∇θ Qϕ (s, µθ (s))
|B|
s∈B
end for
Update the exploration rate ϵ = max (ϵ · λϵ , ϵmin )
until convergence
55
7 Network topology
The reinforcement learning algorithms introduced in chapter 6 utilize function
approximation to generalize over a continuous state and action space. Section
2.5 introduced function approximators for extracting predictive patterns from
financial data, where empirical research suggested the superiority of deep learn-
ing methods. Thus, the function approximators introduced in this chapter rely
on deep learning techniques introduced in 3. In the research presented in section
2.5, the function approximators based on convolutional neural networks (3.3.10)
and those based on the LSTM (3.3.11) consistently performed well. Thus, this
section introduces two function approximators based on CNNs and LSTMs, re-
spectively. The sequential information layer, presented in section 7.4, leverages
these techniques to extract predictive patterns from the data. Furthermore,
the decision-making layer that maps forecasts to market positions, presented
in section 7.5, employs the recursive mechanism introduced by Moody et al.
[MWLS98], enabling the agent to consider transaction costs.
The direct policy gradient algorithm presented in chapter 6.2 is an actor-
based RL algorithm that only uses a parameterized policy network. The de-
terministic actor-critic algorithm presented in chapter 6.3 uses a parameterized
policy network and a parameterized critic network. This chapter outlines these
function approximators, which fortunately consist of many of the same compo-
nents. Section 7.2 describes the policy network, while section 7.3 describes the
Q-network. The last section 7.6 describes the optimization and regularization
of the networks.
7.1 Network input

The first step is to specify the input into the networks. Section 5.3 defined the
agent state sat of the partially observable environment. This section describes the
′
modified version of the agent state sat , which both the policy and critic agents
receive as input. The modified agent state applies two forms of processing to the
network input; the first is extracting the relevant observations from the agent
state, which ensures that the network input is of fixed size, and the second is
normalizing the network input, which is advantageous for the non-linear function
approximators introduced in this chapter. This thesis adopts the philosophy of
technical traders of the price reflecting all necessary information, and therefore
the past price is used to represent the agent state. The primary reason for
selecting the price series alone as the state representation is to examine the
ability of a general deep reinforcement learning model to extract patterns from
raw, relatively unprocessed data. Although additional data sources could aid
the agent in discovering predictive patterns, that is beyond the scope of this
thesis.
Adopting the ideas of Jiang et al. [JXL17], the agent state is down-sampled
by extracting the three most relevant prices from a period; the closing price,
the highest price, and the lowest price. Thus, the price tensor used to represent
56
the external agent state at time t is defined as
h i
p̂t = pt , phigh
t , plow
t (7.1)
Normalizing input data for neural networks speeds up learning [GBC16] and
is beneficial for reinforcement learning as well [ARS+ 20]. However, normalizing
the whole time series ex-ante is a form of lookahead. The normalization scheme
can only use data up to time ≤ t for the observation pt ∀t. The choice of instru-
ment weights depends on relative log returns rather than absolute price changes.
The price tensor p̂t is normalized using the log-returns from the previous clos-
ing price pt−1 . Additionally, adopting the ideas from Zhang et al. [ZZR20], the
input is further normalized by dividing by a volatility term defined as
√ √

pi
σ 2 log |i = t − L + 1, ..., t 2
L = σL,t L (7.2)
pi−1
where L ∈ N+ is the lookback window to calculate the volatility of the closing
price, which is set to L = 60 similarly as [ZZR20]. The normalized price tensor
at time t is thus defined as
2
√
p̄t = log (p̂t ⊘ pt−1 ) ⊘ σL,t L (7.3)
As a precaution against outliers in volatile markets, which can be detrimental

to the performance of DNNs, the normalized price tensor p̄t is clipped to the
interval [−1, 1].
Stacking the past n observations produces the approximated environment
state. Thus, the final price tensor is adjusted to contain the n most recent
observations p̄t−n+1:t ∈ R3×n . The stacked price tensor is considered the ex-
ternal agent state and defined as xSt = p̄t−n+1:t The networks also adopt the
recursive mechanism introduced by Moody et al. [MWLS98] of considering the
past action as a part of the internal environment, allowing the agent to take the
effects of transaction costs into account. The instrument weight from the previ-
ous period at−1 is inserted into the final decision-making layer after extracting
the sequential features in the sequential information layer. The modified agent
state thus approximates the state of the environment
′
sat = (xSt , at−1 ) (7.4)
′
The policy networks and Q-network receive this modified agent state sat as
input. As an action-value function, the Q-network also takes the current action
at as input.
7.2 Policy network

The deterministic policy 6.3 and the mean-generating parametric function ap-
proximator in the stochastic policy 6.2 are the same function approximator
′
µθ : R|S| → [−1, 1] parameterized by θ ∈ Rd , and will in this chapter be re-
ferred to as the policy network. The policy network consists of a sequential
57
information layer, a decision-making layer, and a tanh function. The input to
′
the policy network is the modified agent state sat . The external part of the agent
S
state xt , i.e., the price tensor of stacked observations, is input into the sequential
information layer. The sequential information layer output is concatenated with
the previous action at−1 to produce input into the decision-making layer. The
output from the decision-making layer maps to a tanh function that produces
the action constrained to the action space [−1, 1].
Figure 7.1: Policy network architecture
7.3 Q-network
The Q-network Qϕ : R|S| × R|A| → R is a function approximator parameterized
′
by ϕ ∈ Rb . It is an action-value function that assigns the value of performing
a specific action in a specific state and thus takes two arguments, the modified
′
agent state sat and the action at . Other than that, there are two differences
between the critic and policy networks. Firstly, the Q-network has an additional
layer before the sequential information net that concatenates the agent state sat
and the current action at and maps it through a fully-connected layer into
a leaky-ReLU activation function with negative slope 0.01 and dropout with
probability 0.2. The second difference is that the output after the decision-
making layer does not map to a tanh function since the Q-network outputs
action-values, which are not constrained to any specific interval.
Figure 7.2: Q-network architecture
58
7.4 Sequential information layer
In essence, an algorithmic trading agent places bets on the relative price change,
or returns, of financial instruments. The agent’s success ultimately depends on
its ability to predict the future. However, doing so in highly competitive and
efficient markets is non-trivial. To remain competitive in continuously evolving
markets, the agent must learn to recognize patterns and generate rules based
on past experiences. The sequential information layer extracts the sequential
features from the input data and is arguably the most integral part of the
model. Let xIt be the input into the sequential information net (for the policy
network xIt = xSt ). The sequential information layer is a parametric function
approximator that takes the input xIt and outputs a feature vector gt , defined
as
f S (xIt ) = gt (7.5)
The choice of the appropriate function approximator for this task is non-
trivial. The inductive bias of the model must align with that of the problem
for the model to generalize effectively. Therefore, selecting a model that cap-
tures the problem’s underlying structure while also being efficient and scalable
is imperative. Research on financial time series forecasting found that deep
learning models, specifically those based on the CNN and LSTM architecture,
consistently outperformed traditional time series forecasting methods such as
the ARIMA and GARCH [XNS15, MRC18, SNN18, SGO20]. The universal ap-
proximation theorem (3.3.8) establishes that there are no theoretical constraints
on feedforward networks’11 expressivity. However, feedforward networks are not
as naturally well-suited to processing sequential data as CNNs and LSTMs.
Therefore, they may not achieve the same level of performance, even though it
is theoretically possible. Additionally, feedforward networks may require signif-
icantly more computing power and memory to achieve the same performance as
CNNs or LSTMs on sequential data. Transformers were also considered due to
their effectiveness in forecasting time series [MSA22]. Transformers employ an
encoder-decoder architecture and rely on attention mechanisms to capture long-
term dependencies. Thus, they do not require a hidden state, like RNNs, and
are relatively easy to parallelize, enabling efficient training on large datasets. A
variant called decision transformers [CLR+ 21] has been applied to offline rein-
forcement learning. However, it is unclear how to apply the transformer in its
conventional encoder-decoder topology to online reinforcement learning. There-
fore, the transformer is, for the moment, unsuitable for this problem. The gated
recurrent unit (GRU) is a newer version of the recurrent neural network that is
less computationally complex than the LSTM. However, LSTMs are generally
considered superior for forecasting financial data [SGO20].
This section defines two distinct DNN topologies for the sequential informa-
tion layer; the first is based on convolutional neural networks, while the second is
based on recurrent neural networks, specifically the LSTM. The two sequential
information topologies both consist of two hidden layers, which is enough for
11 Of arbitrary width or height.
59
the vast majority of problems. Performance is usually not improved by adding
additional layers.
7.4.1 Convolutional neural network

The CNN-based sequential information layer topology includes two 1-dimensional
convolutional layers. In the absence of established heuristics, determining the
parameters for a CNN can be challenging. Thus, the parameters chosen for
these layers are partly informed by research on CNNs in financial time series
forecasting [SGO20] and partly determined through experimentation. The first
convolutional layer has kernel size 3 and stride 1 and processes each of the 3
columns in the input xIt as separate channels of size 1×n, where n is the number
of stacked observations. It outputs 32 feature maps of size 1 × n − 2. The second
convolutional layer has kernel size 3 and stride 1 and outputs 32 feature maps
of size 1 × n − 4. Batch norm is used after both convolutional layers on the fea-
ture maps to stabilize and speed up learning. The CNN uses the Leaky-ReLU
activation function with a negative slope of 0.01 after the batch norm layers to
generate the activation maps. Dropout with probability p = 0.2 is used between
the layers. Max pooling with kernel size 2 and stride 2 is applied after the final
convolutional layer to down-sample the output before all activation maps are
concatenated into one big activation map.
Figure 7.3: Convolutional sequential information layer architecture
7.4.2 Long Short-Term Memory

The second sequential information network topology introduces memory through
a recurrent neural network to solve the partially observable MDP. Long Short-
term Memory (LSTM) is the go-to solution for environments where memory
is required and are good at modeling noisy nonstationary data. The LSTM
sequential information net architecture consists of two stacked LSTM layers.
Both LSTM layers have 128 units in the hidden state, which was chosen ex-
perimentally. Following both LSTM layers, the network employs dropout with
dropout-probability p = 0.2. The LSTM cell contains three sigmoid functions
and one hyperbolic tangent function, so inserting an activation function after
the LSTM layer is superfluous. Batchnorm is incompatible with RNNs, as the
60
recurrent part of the network is not considered when computing the normaliza-
tion statistic and is, therefore, not used.
Figure 7.4: LSTM sequential information layer architecture
7.5 Decision-making layer

In the decision-making layer, the previous action at−1 is concatenated into the
features gt , i.e., the output from the sequential information layer, and mapped
to a single output value. The previous action weight at−1 allows the agent to
consider transaction costs when making trading decisions (policy network) or
when assigning value to actions in states (Q-network). The mapping comprises
a fully-connected layer between the features and the previous action to a single
output value. This output value is the action (or Gaussian mean) for the policy
(after mapping it to a tanh function) or the action-value for the Q-network. The
input to the decision-making layer is defined as
xD
t = (gt , at−1 ) (7.6)
D
The decision-making layer is a dot product between a weight vector wD ∈ R|xt |
and the input xD

t , defined as
D ⊤ D
fD (xD
t ) = (w ) xt (7.7)
7.6 Network optimization

The weights θ and ϕ of the policy network µθ and Q-network Qϕ are initialized
using Kaiming initialization which ensures that the initial weights of the net-
work are not too large and have a small variance, which helps to prevent the
network from getting stuck in local minima and allows the network to general-
ize better. Additionally, Kaiming initialization considers the type of activation
function used in the network, contrary to conventional initialization schemes.
The weight initialization scheme centers the initial output distribution of the
networks around 0 with a small standard deviation regardless of the input. The
weights are updated using the Adam stochastic gradient descent algorithm on
mini-batches, allowing the network to update the weights more efficiently and
accurately than other SGD algorithms. The gradient norm for each mini-batch
61
is clipped to 1 to prevent exploding gradients. There are many potential ac-
tivation functions for neural networks, including the default recommendation,
the ReLU. To combat the “dying ReLU problem”, the leaky-ReLU activation
function is used in the networks. The negative slope, or the “leak”, is set to the
standard value of 0.01.
7.6.1 Regularization
Machine learning research generally focuses on problems with complex struc-
tures and high signal-to-noise ratios, such as image classification. For these
problems, complicated non-linear models like neural nets have demonstrated
their effectiveness. However, in a high-noise environment such as financial fore-
casting, where the R-squared is often of order 10−4 [Isi21], anything beyond
linear regression poses a significant overfitting risk. An overfitted network will
likely perform well on the training set but generalize poorly out-of-sample. An
algorithmic trading agent that performs well on the training set is of little use,
and it is imperative to reduce the generalization error. Therefore, regulariza-
tion is needed to mitigate the risk of overfitting and reduce the generalization
error. A description of the regularization techniques used for these networks is
provided in section 3.3.6.
For ML models to be generalizable, they must learn data patterns rather
than individual data points to identify a bigger picture agnostic of noisy details.
Regularization techniques such as weight decay limit the capacity of the net-
works by adding a parameter norm penalty to the objective function. Weight
decay uses the L2 norm; other norms, such as the L1 norm, can also be used.
The L2 norm is appropriate since it punishes outliers harsher and is easier to op-
timize with gradient-based methods. The parameter λwd controls the degree of
penalization, balancing the tradeoff between increased bias and decreased vari-
ance. The network optimizer introduced in this section uses weight decay with
the constant parameter λwd = 0.001 to mitigate the network’s overfitting risk.
Experimentally, this value delivered the optimal balance for the bias-variance
tradeoff. Although increasing the weight decay penalty could further reduce
overfitting risk, this was too restrictive for the networks.
It is important to note that weight decay reduces, but does not eliminate, the
risk of overfitting. Dropout is another explicit regularization technique almost
universally used in deep neural networks. Dropout forces the network to learn
multiple independent data representations, resulting in a more robust model.
When training networks on noisy financial data, dropout effectively ensures
the network ignores the noise. Similarly to weight decay, the dropout rate is a
tradeoff. There is no established heuristic for choosing the dropout rate; instead,
it is usually chosen through experimentation. In this case, a dropout rate of
0.2 provided a suitable regularizing effect where the model generalized well and
produced accurate predictions. Dropout is used between all hidden layers in the
networks.
Although explicit regularizers such as weight decay and dropout reduce over-
fitting risk, it remains tricky to determine the optimal training duration. This
62
challenge is addressed with early stopping, which functions as an implicit regu-
larizer. The networks are trained in an early stopping scheme, with testing on
the validation set every 10th epoch. As reinforcement learning involves random
exploration, the models are tested slightly less frequently than conventional to
prevent premature stopping.
63
Part III
Experiments
64
8 Experiment and Results
Experiments play a vital role in science and provide the basis for scientific knowl-
edge. This chapter presents the experiments and results where the methods
presented in part II are tested on historical market data using the backtesting
framework described in section 2.9. The backtest requires simplifying market
assumptions, specified in chapter 5. Section 8.1 details the experiment setting.
The results of the experiment are presented and discussed in sections 8.2 and
8.3. Finally, the overall approach is discussed in section 8.4. The experiment
aims to answer the research questions posed at the start of this thesis.
1. Can the risk of algorithmic trading agents operating in volatile markets
be controlled?
2. What reinforcement learning algorithms are suitable for optimizing an

algorithmic training agent in an online, continuous time setting?
3. What deep learning architectures are suitable for modeling noisy, non-
stationary financial data?
8.1 Materials and Methods

Chapter 6 described two reinforcement learning algorithms to solve the com-
modity trading problem; the direct policy gradient (PG) and the deterministic
actor-critic (AC). Chapter 7 described two sequential information layers, one
based on CNN architecture and the other LSTM-based. Both the actor and
critic in the actor-critic algorithm are modeled using the same architecture. In
total, that leaves four combinations which are specified below with their respec-
tive abbreviations
• PG-CNN: Direct policy gradient algorithm where the policy network is
modeled using the CNN-based sequential information layer.
• PG-LSTM: Direct policy gradient algorithm where the policy network

is modeled using the LSTM-based sequential information layer.
• AC-CNN: Deterministic actor-critic algorithm where the policy and Q-
network are modeled using the CNN-based sequential information layer.
• AC-LSTM: Deterministic actor-critic algorithm where the policy and Q-

network are modeled using the LSTM-based sequential information layer.
8.1.1 Baselines
Defining a baseline can be helpful when evaluating the performance of the meth-
ods presented in part II. A challenge with testing algorithmic trading agents is
the lack of established baselines. However, the by far most common alternative
is the buy-and-hold baseline [MWLS98][ZZR20][ZZW+ 20]. The buy-and-hold
65
baseline consists of buying and holding an instrument throughout the experi-
ment, i.e., at = 1, ∀t. Compared to a naive buy-and-hold baseline, an intelligent
agent actively trading a market should be able to extract excess value and reduce
risk.
8.1.2 Hyperparameters
Table 1 shows the hyperparameters used in this experiment. The learning rates
for the policy network and Q-network are denoted αactor and αcritic , respec-
tively, and were tuned experimentally. |B| is the batch size, and |D| is the
replay memory size. Large batch sizes are necessary to obtain reliable gradient
estimates. However, large batch sizes also result in less frequent updates to
the agent and updates that may contain outdated information. As a result of
this tradeoff, the batch and replay memory sizes used in this experiment were
selected as appropriate values. The transaction cost fraction λc is set to a rea-
sonable value that reflects actual market conditions. The initial exploration rate
is denoted ϵ, with decay rate λϵ and minimum ϵmin . The number of stacked
past observations is given by n, considered a reasonable value for the agent to
use for short-term market prediction.
Model αactor αcritic |B| |D| λc ϵ λϵ ϵmin n

PG 0.0001 - 128 - 0.0002 1 0.9 0.01 20
AC 0.0001 0.001 128 1000 0.0002 1 0.9 0.01 20
Table 1: Hyperparameters
8.1.3 Training scheme

The dataset is split into three parts; a training set, a validation set, and a test
set, in fractions of 1/4, 1/4, and 1/2, respectively. The RL agents train on
the training set, the first 1/4 of the dataset, and then validate on the valida-
tion set, the next 1/4. Early stopping is used, with testing every 10th epoch.
The early stopping frequency is low because the RL agents exhibit randomness
and stochasticity, especially in the early epochs. Setting a high early stopping
frequency can cause premature convergence. The weight initialization scheme,
described in section 7.6, causes the initial action distribution of the policy to be
centered around 0 with a small standard deviation. However, the agents learn
faster when exploring the edge values of the state space in the early stages.
Exploration of the action space is controlled, for both RL algorithms, by the
three exploration parameters ϵ, λϵ , ϵmin . During training, the exploration rate
starts at ϵ = 1 and decays at λϵ = 0.9 per episode to a minimum exploration
rate of ϵmin = 0.01.
When the agent has finished training, it tests once out-of-sample on the test
set, the last 1/2 of the dataset. Leaving half of the dataset for final testing en-
sures the test set is sufficiently large to evaluate the trading agents. Exploring
the action space is no longer necessary after initial training. Therefore, ϵ = 0
for the out-of-sample test. According to the optimization strategies specified in
66
their respective pseudocodes 2 and 3, the RL agents continuously refit them-
selves as they observe transitions. The results section (8.2) presents results from
these backtests.
8.1.4 Performance metrics

The objective of the algorithmic trading agent is described by modern portfolio
theory (2.3) of maximizing risk-adjusted returns, usually represented by the
Sharpe ratio. Thus, the Sharpe ratio defined in equation 2.1 will be the primary
performance metric for the backtest. The reward function defined in equation
5.10 is not a comparable performance measure to related work. Instead, the
standard method for assessing performance by linear returns net of transaction
costs is adopted. In a backtest, the agent interacts with the environment and
generates a sequence of actions {a0 , a1 , ..., aT −1 , aT }. The linear net return after
T ∈ N+ trades is defined as
T
Y
RT = (yt · at−1 ) + 1 − λc ||a′t−1 − at−1 || (8.1)
t=1
where yt , at , a′t , λc are defined in chapter 5. The return RT is used to calculate

the Sharpe ratio. As there is randomness in the models, either through stochas-
tic action selection or random mini-batch sampling, along with random weight
initialization, performance is averaged over 10 runs. In addition to the Sharpe
ratio, additional performance metrics can help paint a clearer picture of the per-
formance of an algorithmic trading agent. This thesis adopts some of the perfor-
mance metrics most frequently found in related work [JXL17, ZZW+ 20, ZZR20].
The performance metrics used in this thesis are defined as
1. E[R]: the annualized expected rate of linear trade returns.
2. Std(R): the annualized standard deviation of linear trade returns.
3. Sharpe: a measure of risk-adjusted returns defined in equation 2.1. The
risk-free rate is assumed to be zero, and the annualized Sharpe ratio is
thus E[R]/Std(R).
4. MDD: Maximum Drawdown (MDD), the maximum observed loss from
any peak.
5. Hit: the rate of positive trade returns.
8.1.5 Dataset
The dataset consists of the front-month TTF Natural Gas Futures contracts
from 2011 to 2022. The observations are sampled according to the transacted
euro volume on the exchange, defined in section 5.2. Larger sample sizes are de-
sirable to ensure statistical significance, especially for highly overparameterized
approximators such as neural networks. In addition, predictability is generally
67
higher over shorter horizons [Isi21]. However, as sampling frequency (and there-
fore trading frequency) increases, simplifying assumptions, such as no impact
and perfect liquidity, become increasingly consequential. Thus, an appropriate
target number of samples per day is tgt = 5, which provides a little over 20 000
total samples. The data processing is limited to what is described in section
7.1.
The first quarter of the dataset consisting of trades from 01/01/2011 to
01/01/2014 makes up the training set. The validation set is the second quarter
of the dataset from 01/01/2014 to 01/01/2017. Finally, the test set is the second
half of the dataset from 01/01/2017 to 01/01/2023. Figure 8.1 illustrates the
training-validation-test split.
Figure 8.1: The training-validation-test split
8.2 Results
This section presents the results of the experiments described in the previous
section. The models are tested using four different values of the risk-sensitivity
term λσ (0, 0.01, 0.1, and 0.2), and the results of all four values are presented.
The results are visualized using three tools; a table and two types of plots, and
they are briefly described below
• The table consists of the performance metrics (described in section 8.1.4)
of each model (described in section 8.1) from the backtests.
• A standard line plot illustrates the performance of the models against the
baseline indexed in time of the cumulative product of logarithmic trade
returns, where the trade returns are defined in equation 8.1.
• A boxplot illustrates the distribution of the monthly logarithmic returns12
of each model and the baseline. Boxplots summarize a distribution by
its sampled median, the first quantile (Q1 ), and the third quantile (Q3 ),
represented by the box. The upper whisker extends to the largest observed
value within Q3 + 32 IQR, and the lower whisker extends to the smallest
observed value within Q1 − 32 IQR, where the interquartile range (IQR) is
Q3 − Q1 . Dots represent all values outside of the whiskers (outliers).
The plots display the performance of all models and the baseline and are grouped
by risk-sensitivity terms.
12 Again, trade returns are defined in equation 8.1 and resampled to produce monthly values.
The logarithmic monthly returns are then calculated based on these values.
68
Table 2 below shows the results of the backtests averaged over 10 runs. The
variation between runs was small enough to warrant the level of precision of the
results given in the table.
E[R] Std(R) Sharpe MDD Hit

Buy & Hold 0.271 0.721 0.376 0.877 0.524
λσ = 0
PG-CNN 0.403 0.558 0.722 0.753 0.529
PG-LSTM 0.297 0.502 0.591 0.726 0.527
AC-CNN 0.302 0.610 0.495 0.724 0.538
AC-LSTM 0.226 0.694 0.325 0.637 0.541
Average 0.307 0.591 0.533 0.710 0.534
λσ = 0.01
PG-CNN 0.401 0.437 0.918 0.665 0.537
PG-LSTM 0.258 0.326 0.791 0.540 0.526
AC-CNN 0.346 0.471 0.735 0.601 0.545
AC-LSTM 0.251 0.300 0.837 0.443 0.535
Average 0.314 0.383 0.820 0.562 0.536
λσ = 0.1
PG-CNN 0.371 0.356 1.042 0.591 0.537
PG-LSTM 0.235 0.264 0.890 0.373 0.524
AC-CNN 0.091 0.239 0.380 0.392 0.539
AC-LSTM 0.110 0.190 0.579 0.261 0.525
Average 0.202 0.262 0.723 0.404 0.531
λσ = 0.2
PG-CNN 0.243 0.298 0.815 0.410 0.533
PG-LSTM 0.179 0.247 0.725 0.373 0.537
AC-CNN 0.136 0.198 0.687 0.454 0.531
AC-LSTM 0.114 0.229 0.498 0.341 0.522
Average 0.168 0.243 0.681 0.394 0.531
Table 2: Backtest results
A pair of plots (line plot and boxplot) are grouped by risk-term values (0,
0.01, 0.1, and 0.2, respectively).
69
Figure 8.2: Cumulative logarithmic trade returns for λσ = 0
Figure 8.3: Boxplot of monthly logarithmic trade returns for λσ = 0
70
Figure 8.4: Cumulative logarithmic trade returns for λσ = 0.01
Figure 8.5: Boxplot of monthly logarithmic trade returns for λσ = 0.01
71
72
73
8.3 Discussion of results
This section will discuss the experiment results and how they relate to the three
research questions posed at the start of this thesis.
8.3.1 Risk/reward
The first research question posed at the start of this thesis was:
Can the risk of algorithmic trading agents operating in volatile mar-

kets be controlled?
Note some general observations about the buy-and-hold baseline against which
the models are compared. The baseline mirrors the direction of the natural gas
market during the out-of-sample test. Due to the volatility and upwards price
pressure stemming from the energy crisis in 2021-2022, the buy-and-hold base-
line has a high annualized expected return but also a high annualized standard
deviation of returns. As a result, the Sharpe ratio, the primary performance
metric in this experiment, is relatively low. The primary goal of the deep re-
inforcement learning models is to increase the Sharpe ratio. Increasing the
Sharpe ratio is achieved by increasing the expected returns or decreasing the
standard deviation. The reward function 5.10 defines this trade-off, where the
risk-sensitivity term λσ functions as a trade-off hyperparameter. Low values
of λσ make the agent more risk-neutral, i.e., more concerned with increasing
expected returns and less concerned with decreasing the standard deviation of
returns. Conversely, high values of λσ make the agent more risk-averse, i.e.,
more concerned with decreasing the standard deviation of returns and less con-
cerned with increasing the expected returns. The experiments in this thesis use
four risk-sensitivity terms; 0, 0.01, 0.1, and 0.2. For λσ = 0, the agent is risk-
neutral and only concerned with maximizing the expected return. For values
exceeding 0.1, the agent becomes so risk-averse that it hardly participates in
the market. This trade-off is evident in the results where the annualized ex-
pected return and the standard deviation are on average 83% and 143% higher,
respectively, for λσ = 0 compared to λσ = 0.2. The boxplots (figures 8.3, 8.5,
8.7, 8.9) illustrate how the monthly logarithmic returns are more concentrated
as λσ increase. This phenomenon is also observed in the line plots (figures 8.2,
8.4, 8.6, 8.8), where it is apparent that the variability of returns decreases as
λσ increases. The agent’s action, i.e., the position size (and direction), is its
only means to affect the reward. By definition, a risk-averse agent will choose
outcomes with low uncertainty, so a preference for smaller positions is a natural
consequence of increasing risk sensitivity. Consequently, the risk-averse deep
RL agents have a significantly lower maximum drawdown than the baseline,
but they do not fully capitalize on the increasing prices from mid-2020 to 2023.
Averaged over all the deep RL models for all risk-sensitivity terms, they
produce 83% higher Sharpe than the baseline. This gain comes from reducing
the standard deviation of returns, which is reduced by 49%, whereas the return
is only slightly reduced by 8%. Looking at the various risk-sensitivity terms, the
74
risk-neutral agents (i.e., those where λσ = 0), on average, increase the returns
by 13% compared to the baseline. Although they have no risk punishment,
they also decrease the standard deviation of returns by 18%. This last point is
surprising but could be a byproduct of an intelligent agent trying to maximize
returns. For λσ = 0.01, the deep RL agents, on average, produce 16% increased
returns and 47% reduced standard deviation of returns compared to the baseline.
This combination results in a 118% higher Sharpe ratio. For λσ = 0.1, the
agents on average produce 25% lower returns; however, the standard deviation
of returns is reduced more by 64%. Thus, the Sharpe is increased by 92%
compared to the baseline. The most risk-averse agents (i.e., those where λσ =
0.2) on average produce 38% lower returns with 66% lower standard deviation
of returns, yielding an 83% increase in Sharpe compared to the baseline. The
risk-sensitivity term λσ = 0.01 produces the highest Sharpe ratio on average.
Thus, the backtests suggest that of the four risk-sensitivity options tested in
this thesis, λσ = 0.01 strikes the best risk/reward balance.
8.3.2 RL models
The second research question posed at the start of this thesis was:
What reinforcement learning algorithms are suitable for optimizing
an algorithmic training agent in an online, continuous time setting?
A curious result from the experiment is that, for three out of four risk-sensitivity
terms13 , the model with the highest hit-rate has the lowest Sharpe. In other
words, the model making the highest rate of profitable trades also produces the
lowest risk-adjusted returns. This result illustrates the complexity of trading
financial markets and justifies the methods chosen in this thesis. Firstly, there
is no guarantee that a higher percentage of correct directional calls will result in
higher returns. Therefore, a forecast-based supervised learning approach opti-
mized for making correct directional calls may not align with the stakeholder’s
ultimate goal of maximizing risk-adjusted returns. For an algorithmic trad-
ing agent to achieve the desired results, e.g., making trades that maximize the
Sharpe ratio, it should be optimized directly for that objective. However, doing
so in a supervised learning setting is not straightforward. Reinforcement learn-
ing, on the other hand, provides a convenient framework for learning optimal
sequential behavior under uncertainty. Furthermore, discrete position sizing, a
drawback of value-based reinforcement learning, can expose the agent to high
risks. However, the agent can size positions based on confidence through the
continuous action space offered by policy gradient methods, allowing for more
effective risk management. Section 2.6 presented research arguing that, for al-
gorithmic trading, reinforcement learning is superior to supervised learning and
policy gradient methods are superior to value-based methods, and this result
supports those arguments.
Although previous research supported policy gradient methods, there was no
consensus on which one was superior in this context. Chapter 6 presented two
13 λ = 0, 0.01, and 0.1
σ
75
policy gradient methods: one based on an actor-only framework and the other
based on an actor-critic framework, and discussed their respective advantages
and disadvantages. The previous section (8.2) presented results from the back-
tests where both algorithms were tested out-of-sample. For all risk-sensitivity
terms, the direct policy gradient algorithm, on average, produces a 49% higher
Sharpe ratio than the deterministic actor-critic algorithm. Comparing the two
algorithms using the same network architecture and risk sensitivity term reveals
that the actor-based algorithm outperforms the actor-critic-based algorithm in
7 out of 8 combinations. The only case where the actor-critic-based algorithm
performs better14 is the case with the smallest performance gap. Furthermore,
the actor-only direct policy gradient method strictly increases the Sharpe ratio
for both network architectures as the risk-sensitivity parameter λσ increases to
a maximum at λσ = 0.1. The actor-critic method does not follow this pattern,
suggesting it fails to achieve its optimization objective.
The performance gap between the actor-based and actor-critic-based algo-
rithms is significant enough to warrant a discussion. An explanation for the
performance gap could be that the actor-critic-based algorithm optimizes the
policy using a biased Q-network reward estimate instead of the observed unbi-
ased reward. As a data-generating process, the commodity market is complex
and non-stationary. If the Q-network closely models the data-generating distri-
bution, using reward estimates from sampled experience from a replay memory
is an efficient method for optimizing the policy. On the other hand, it is also
clear that a policy that is optimized using Q-network reward estimates that are
inaccurate will adversely affect performance. The direct policy gradient algo-
rithm optimizes the policy using the observed unbiased reward and avoids this
problem altogether. Given that the reward function is exactly expressed, opti-
mizing it directly, as the direct policy gradient method does, is the most efficient
approach. Many typical RL tasks work well with the actor-critic framework, but
the backtests indicate that financial trading is not one of them.
8.3.3 Networks
The third and final research question posed at the start of this thesis was:
What deep learning architectures are suitable for modeling noisy,
non-stationary financial data?
In the research presented in section 2.5, two types of deep learning architectures
stood out; the long short-term memory and the convolutional neural network.
Chapter 7 presented two types of parametric function approximators based on
the CNN- and LSTM-architecture, respectively. The previous section (8.2) pre-
sented results from the backtests where both these function approximators are
tested out-of-sample. On average, the CNN-based models produce over 5%
higher Sharpe than those based on the LSTM, which is surprising, as LSTMs
are generally viewed as superior in sequence modeling and, due to their memory,
14 PG-LSTM vs. AC-LSTM for λσ = 0.01
76
are the preferred option when modeling POMDPs. In contrast to the CNN, the
LSTM can handle long-term dependencies, but it seems the lookback window
provides enough historical information for the CNN to make trade decisions.
However, the performance gap is not big enough to say anything conclusive,
and the LSTM outperforms the CNN in some tests, so it is unclear which is
most suitable.
One interesting observation is that the CNN-based models produce higher
returns and standard deviation of returns compared to the LSTM. On average,
the CNN-based models produce 37% higher returns and 15% higher standard
deviation of returns. From the line plots in figures 8.2, 8.4, 8.6, and 8.8, it
looks like a possible explanation for this is that the LSTM-based models prefer
smaller position sizes compared to the CNN-based models. One potential reason
for this phenomenon involves the difference in how the CNN and LSTM are
optimized. Generally speaking, the CNN-based model is far easier and quicker
to optimize than the LSTM-based model, partly due to batch norm, which in
its conventional form is incompatible with RNNs. Another reason is that when
the LSTM is trained for long sequences, the problem of vanishing gradients
makes back-propagating the error difficult and slow. Increasing the learning
rate leads to exploding gradients. The CNN-based model with batch norm
quickly and effectively adjusts its parameters to take full advantage of newly
observed information during out-of-sample tests. The LSTM-based model, on
the other hand, adjusts its parameters much slower. As a result, the actions it
selects often end up someplace in the middle of the action space causing smaller
position sizes, lower returns, and lower standard deviation of returns. For that
reason, the author of this thesis theorizes that the performance gap between the
CNN-based and LSTM-based models would increase with time.
8.4 Discussion of model

Following the discussion of the results, it is interesting to take a step back
and have a more general discussion of the model. This includes discussing the
strengths and weaknesses of the model, as well as potential applications and
limitations.
8.4.1 Environment
Solving complex real-world problems with reinforcement learning generally re-
quires creating a simplified version of the problem that lends itself to analytical
tractability. Usually, this involves removing some of the frictions and constraints
of the real-world problem. In the context of financial trading, the environment
described in chapter 5 makes several simplifying assumptions about the envi-
ronment, including no market impact, no slippage, the ability to purchase or
sell any number of contracts at the exact quoted price, no additional costs or
restrictions on short-selling, and fractional trading. It is imperative to note that
these assumptions do not necessarily reflect real-world conditions. As such, it is
crucial to know the problem formulation’s limitations and how it will negatively
77
affect the model’s generalizability to the real-world problem. Poorly designed
environments, where agents learn to exploit design flaws rather than the ac-
tual problem, are a frequent problem in reinforcement learning [SB18]. At the
same time, these simplifying assumptions allow for a clean theoretical analysis
of the problem. Furthermore, the environment introduces some friction through
transaction costs, an improvement over many existing models.
Lookahead bias in the input data is avoided by using the price series alone
as input, as described in section 7.1. The price series of a financial instrument
is generally the most reliable predictor of future prices. However, price series
only provide a limited view of the market and do not consider the broader
economic context and the potential impact of external factors. As a result,
an agent relying solely on price series may miss out on meaningful predictive
signals. Furthermore, since the model learns online, an effective data gover-
nance strategy is required to ensure the quality and integrity of the real-time
input data stream, as data quality issues can harm the model’s performance.
The dollar bars sampling scheme described in section 5.2 has solid theoretical
foundations for improving the statistical properties of the sub-sampled price
series compared to traditional time-based sampling. When using this sampling
scheme, however, the agent cannot be certain of the prediction horizon, which
complicates forecasting.
Commodity trading firms often conduct asset-backed trading in addition
to paper trading, which incurs additional costs for booking pipeline capacity,
storage capacity, or LNG tankers. The model does not currently include these
costs, but the environment could be adjusted to include them.
8.4.2 Optimization
Statistical learning relies on an underlying joint feature-target distribution F (x, y),
with non-vanishing mutual information. The algorithmic trading agent approx-
imates this function by learning the distribution through historical data. As
financial markets are nonstationary, statistical distributions constantly change
over time, partly because market participants learn the market dynamics and
adjust their trading accordingly. In order to remain relevant for the near future,
the model must be continuously refitted using only data from the immediate
past, at the expense of statistical significance. On the other hand, training a
complex model using only a relatively small set of recent data is challenging in a
high-noise setting such as financial forecasting, often resulting in poor predictive
power. This tradeoff between timeliness and statistical significance is known as
the timeliness-significance tradeoff [Isi21]. The timeliness-significance tradeoff
highlights a central challenge in optimizing algorithmic trading models.
This thesis investigates the use of reinforcement learning in algorithmic trad-
ing, a field traditionally dominated by supervised learning-based approaches.
Supervised learning is a straightforward method for easily labeled tasks, such
as forecasting financial prices. Reinforcement learning, on the other hand, is
better suited to complex problems, such as sizing positions, managing risk and
transaction costs. In fact, with only minor modifications, the model outlined in
78
this thesis can optimize an agent trading a portfolio of arbitrary size. For this
reason, reinforcement learning was chosen as the algorithmic trading agents’ op-
timization framework. Temporal credit assignment is one of the main strengths
of reinforcement learning, making it ideal for game playing and robotics, in-
volving long-term planning and delayed reward. In this problem, however, tem-
poral credit assignment does not apply since trading is an immediate reward
process. Furthermore, the complexity of reinforcement learning compared to
supervised learning comes at a price. As well as requiring a model of the en-
vironment in which the agent interacts and learning to associate context with
reward-maximizing actions, reinforcement learning introduces added complexity
by introducing the exploration-exploitation tradeoff. With more moving parts,
reinforcement learning can be significantly more challenging to implement and
tune and is generally less sample efficient than supervised learning. The learn-
ing process is more convoluted as the agent learns through reinforcement signals
generated by interaction with the environment and involves stochasticity. Con-
sequently, the model can display unstable behavior where the policy diverges, or
the agent overfits to noise. E.g., if the market experiences a sustained downward
trend, the agent can be deceived into believing that the market will continue to
decline indefinitely. As a result, the agent may adjust its policy to always short
the market, which will have disastrous effects once the market reverses. The
phenomenon is caused by the temporal correlation of sequential interactions
between RL agents and the market, and that reinforcement learning is sample
inefficient, making it difficult to obtain good gradient estimates. Replay memory
can ensure that gradient estimates are derived from a wide variety of market con-
ditions. However, replay memory introduces biased gradient estimates, which,
according to backtests, is a poor tradeoff. The timeliness-significance tradeoff
further complicates this problem of obtaining suitable gradient estimates. A
supervised learning framework is more straightforward and avoids much of the
complexity associated with reinforcement learning. Thus, it is unclear whether
reinforcement learning or supervised learning is the most appropriate optimiza-
tion framework for algorithmic trading.
8.4.3 Interpretability and trust

Interpretability is the ability to determine the cause and effect of a model. Due
to their over-parameterized nature, deep neural networks possess a remarkable
representation capacity, enabling them to solve a wide range of complex machine
learning problems, but at the cost of being difficult to interpret. Neural net-
works are black boxes, acceptable for low-risk commercial applications such as
movie recommendations and advertisements. Interpreting these predictions is
of little concern as long as the models have sufficient predictive power. However,
deep learning’s opaque nature prevents its adoption in critical applications, as
the failure of a commodity trading model could result in substantial financial
losses and threaten global energy security. Setting aside people generally being
risk-averse and unwilling to bet large sums of money on a black box, is the aver-
sion to applying deep learning in critical domains such as commodity trading
79
reasonable? Does understanding the model even matter as long as it delivers
satisfactory backtest performance?
This question can be answered by reviewing statistical learning theory. Gen-
erally, machine learning models are tested under the assumption that observa-
tions are drawn from the same underlying distribution, the data-generating dis-
tribution, and that observations are IID. In this setting, the test error serves as
a proxy for the generalization error, i.e., the expected error on new observations.
However, the dynamics of the financial markets are constantly changing. Fierce
competition in financial markets creates a cycle in which market participants
attempt to understand the underlying price dynamics. As market participants
better understand market dynamics, they adjust their trading strategies to ex-
ploit that knowledge, further changing market dynamics. Due to the constantly
changing dynamics of the market, models that worked in the past may no longer
work in the future as inefficiencies are arbitraged away15 . Therefore, it is impor-
tant to be cautious when interpreting backtest errors as generalization errors,
as it is unlikely that observations sampled at different points in time are drawn
from the same probability distribution. Even if one disregards all the flaws
of a backtest16 , the backtest, at best, only reflects performance on historical
data. In no way is this intended to discourage backtesting. However, naively
interpreting backtest performance as an assurance of future performance is dan-
gerous. Referring back to section 2.9; a backtest is a historical simulation of how
the model would have performed should it have been run over a past period.
Even exceptional results from the most flawlessly executed backtest can never
guarantee that the model generalizes to the current market. Furthermore, the
results should be interpreted cautiously if no ex-ante logical foundation exists
to explain them. Deep neural networks are highly susceptible to overfitting to
random noise when trained on noisy financial time series. It is, however, difficult
to determine if the agent has detected a legitimate signal if the model is not
interpretable. Even if the model detects a legitimate signal in the backtests,
other market participants may discover the same signal and render the model
obsolete. Again, determining this is difficult without knowing what inefficiencies
the model exploits, and deploying it until it displays a sustained period of poor
performance will be costly.
In response to the question of whether or not understanding a model matters
if it performs well on a backtest, the answer is an emphatic yes. Blindly taking
backtest performance of a black box as an assurance of future performance in
a noisy and constantly changing environment can prove costly. Thus, the aver-
sion to adopting deep learning in algorithmic trading is reasonable. Ensuring
that the trading decisions are explainable and the models are interpretable is
essential for commercial and regulatory acceptance. To address this challenge,
models should be created with a certain degree of interpretability. This way,
stakeholders can get insight into which inefficiencies the model exploits, evalu-
ate its generalizability, and identify its obsolescence before incurring significant
15 Volatility, for example, used to be a reliable indicator of future returns [Isi21].
16 E.g., not accounting for market impact and lookahead bias.
80
losses. The use of deep learning in algorithmic trading can still be viable with
techniques such as explainable AI and model monitoring.
81
9 Future work
The methods presented in this thesis leave room for improvement in further
work. More investigation should be done to evaluate the effectiveness of exist-
ing methods in different contexts. Further investigation will enable a deeper
understanding of the model and its generalizability and provide an opportunity
to identify potential areas for improvement. Considering the lack of real-world
market data, one option is to use generative adversarial networks (GANs) to
generate synthetic markets [YJVdS19]. GANs can generate unlimited data,
which can be used to train and test the model and its generalizability. Addi-
tionally, the lack of established baselines could be improved upon. While the
buy-and-hold baseline is well understood and trusted, it is unrealistic in this
context, as futures contracts expire. Although it presents its own challenges,
developing a baseline more appropriate for futures trading would improve the
current model. Furthermore, a greater level of interpretability is required to
achieve real-world adoption. Therefore, combining algorithmic trading research
with explainable AI is imperative to improve existing methods’ interpretability.
Incorporating non-traditional data sources, such as social media sentiment or
satellite images, may prove beneficial when forecasting market returns. Alterna-
tive data can provide a more comprehensive and holistic view of market trends
and dynamics, allowing for more accurate predictions. By leveraging alternative
data, algorithmic trading agents can gain an edge over their competitors and
make better-informed decisions. Using deep learning techniques such as natural
language processing and computer vision to analyze text or image data in an al-
gorithmic trading context is promising. Neural networks are generally effective
in problems with complex structures and high signal-to-noise ratios. Thus, it
may be more appropriate to use deep learning to extract features from images
or text rather than analyzing price series.
Lastly, the methods presented in this thesis are limited to trading a single
instrument. They are, however, compatible with portfolio optimization with
minimal modifications. Further research in this area would be interesting, as
it better utilizes the potential of the reinforcement learning framework and the
scalability of data-driven decision-making.
82
10 Conclusion
This thesis investigates the effectiveness of deep reinforcement learning methods
in commodities trading. Previous research in algorithmic trading, state-of-the-
art reinforcement learning, and deep learning algorithms was examined, and the
most promising methods were implemented and tested. This chapter summa-
rizes the thesis’ most important contributions, results, and conclusions.
This thesis formalizes the commodities trading problem as a continuing
discrete-time stochastic dynamical system. The system employs a novel time-
discretization scheme that is reactive and adaptive to market volatility, provid-
ing better statistical properties of the sub-sampled financial time series. Two
policy gradient algorithms, an actor-based and an actor-critic-based, are pro-
posed to optimize a transaction-cost- and risk-sensitive agent. Reinforcement
learning agents parameterized using deep neural networks, specifically CNNs
and LSTMs, are used to map observations of historical prices to market posi-
tions.
The models are backtested on the front month TTF Natural Gas futures
contracts from 01-01-2017 to 01-01-2023. The backtest results indicate the vi-
ability of deep reinforcement learning agents in commodities trading. On av-
erage, the deep reinforcement learning agents produce an 83% higher Sharpe
ratio out-of-sample than the buy-and-hold baseline. The backtests suggest that
deep RL models can adapt to the unprecedented volatility caused by the en-
ergy crisis during 2021-2022. Introducing a risk-sensitivity term functioning as
a trade-off hyperparameter between risk and reward produces satisfactory re-
sults, where the agents reduce risk as the risk-sensitivity term increases. The
risk-sensitivity term allows the stakeholder to control the risk of an algorithmic
trading agent in volatile markets. The direct policy gradient algorithm produces
significantly higher Sharpe (49% on average) than the deterministic actor-critic
algorithm, suggesting that an actor-based policy gradient method is more suited
to algorithmic trading in an online, continuous time setting. The parametric
function approximator based on the CNN architecture performs slightly better
(5% higher Sharpe on average) than the LSTM, possibly due to the problem of
vanishing gradients for the LSTM.
The algorithmic trading problem is made analytically tractable by simpli-
fying assumptions that remove market frictions. Performance may be inflated
due to these assumptions and should be viewed with a high degree of caution.
83
Acronyms
AC Actor-Critic. 41
AMH Adaptive Market Hypothesis. 7
ANN Artificial Neural Network. 19
CNN Convolutional Neural Network. 29

CV Cross-Validation. 12
DL Deep Learning. 29
DNN Deep Neural Network. 19
DQN Deep Q-Network. 39
DRQN Deep Recurrent Q-Network. 39
EMH Efficient Market Hypothesis. 7
FFN Feedforward Network. 19
IID Independent and Identically Distributed. 12
LSTM Long Short-Term Memory. 32
MDD Maximum Drawdown. 67

MDP Markov Decision Process. 34
ML Machine Learning. 16
MPT Modern Portfolio Theory. 6
MSE Mean Squared Error. 9
PG Policy Gradient. 39
POMDP Partially Observable Markov Decision Process. 36
ReLU Rectified Linear Unit. 25

RL Reinforcement Learning. 34
RNN Recurrent Neural Network. 30
SGD Stochastic Gradient Descent. 22

SL Supervised Learning. 17
84
References
[AAC+ 19] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz
Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias
Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s
cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
[AG00] Thierry Ané and Hélyette Geman. Order flow, transaction
clock, and normality of asset returns. The Journal of Finance,
55(5):2259–2284, 2000.
[AHM19] Rob Arnott, Campbell R Harvey, and Harry Markowitz. A back-
testing protocol in the era of machine learning. The Journal of
Financial Data Science, 1(1):64–74, 2019.
[ARS+ 20] Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu

Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot,
Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What
matters in on-policy reinforcement learning? a large-scale em-
pirical study. arXiv preprint arXiv:2006.05990, 2020.
[BHMM19] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal.
Reconciling modern machine-learning practice and the classical
bias–variance trade-off. Proceedings of the National Academy of
Sciences, 116(32):15849–15854, 2019.
[BSF94] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning
long-term dependencies with gradient descent is difficult. IEEE
transactions on neural networks, 5(2):157–166, 1994.
[CGLL19] Raymond H Chan, Yves ZY Guo, Spike T Lee, and Xun Li.
Financial Mathematics, Derivatives and Structured Products.
Springer, 2019.
[Cla73] Peter K Clark. A subordinated stochastic process model with

finite variance for speculative prices. Econometrica: journal of
the Econometric Society, pages 135–155, 1973.
[CLR+ 21] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya
Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor
Mordatch. Decision transformer: Reinforcement learning via
sequence modeling. Advances in neural information processing
systems, 34:15084–15097, 2021.
[Cyb89] George Cybenko. Approximation by superpositions of a sig-
moidal function. Mathematics of control, signals and systems,
2(4):303–314, 1989.
85
[DBK+ 16] Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qiong-
hai Dai. Deep direct reinforcement learning for financial signal
representation and trading. IEEE transactions on neural net-
works and learning systems, 28(3):653–664, 2016.
[DHS11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgra-
dient methods for online learning and stochastic optimization.
Journal of machine learning research, 12(7), 2011.
[DP18] Marcos Lopez De Prado. Advances in financial machine learn-
ing. John Wiley & Sons, 2018.
[eur] A european green deal. https://commission.europa.

eu/strategy-and-policy/priorities-2019-2024/
european-green-deal_en.
[Fam70] Eugene F Fama. Efficient capital markets: A review of the-
ory and empirical work. The journal of Finance, 25(2):383–417,
1970.
[FFLa] Danfei Xu Fei-Fei Li, Ranjay Krishna. Lecture 10: Recur-
rent neural networks. http://cs231n.stanford.edu/slides/
2020/lecture_10.pdf.
[FFLb] Danfei Xu Fei-Fei Li, Ranjay Krishna. Lecture 4: Neural net-
works and backpropagation. http://cs231n.stanford.edu/
slides/2020/lecture_4.pdf.
[FFLc] Danfei Xu Fei-Fei Li, Ranjay Krishna. Lecture 5: Con-
volutional neural networks. http://cs231n.stanford.edu/
slides/2020/lecture_5.pdf.
[Fis18] Thomas G Fischer. Reinforcement learning in financial markets-

a survey. Technical report, FAU Discussion Papers in Economics,
2018.
[GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep
learning. MIT press, 2016.
[GD34] Benjamin Graham and David Dodd. Security analysis. McGraw-

Hill Book Co., 1934.
[HA18] Rob J Hyndman and George Athanasopoulos. Forecasting: prin-
ciples and practice. OTexts, 2018.
[Has07] Joel Hasbrouck. Empirical market microstructure: The institu-

tions, economics, and econometrics of securities trading. Oxford
University Press, 2007.
86
[HGMS18] Ma Hiransha, E Ab Gopalakrishnan, Vijay Krishna Menon, and
KP Soman. Nse stock market prediction using deep-learning
models. Procedia computer science, 132:1351–1362, 2018.
[Hor91] Kurt Hornik. Approximation capabilities of multilayer feedfor-
ward networks. Neural networks, 4(2):251–257, 1991.
[HS97] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term

memory. Neural computation, 9(8):1735–1780, 1997.
[HS15] Matthew Hausknecht and Peter Stone. Deep recurrent q-learning
for partially observable mdps. In 2015 aaai fall symposium se-
ries, 2015.
[HTFF09] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and
Jerome H Friedman. The elements of statistical learning: data
mining, inference, and prediction, volume 2. Springer, 2009.
[Hua18] Chien Yi Huang. Financial trading as a game: A deep rein-
forcement learning approach. arXiv preprint arXiv:1807.02787,
2018.
[HZRS15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delv-
ing deep into rectifiers: Surpassing human-level performance on
imagenet classification. In Proceedings of the IEEE international
conference on computer vision, pages 1026–1034, 2015.
[Int20] Mordor Intelligence. Algorithmic trading market—growth,
trends, covid-19 impact, and forecasts (2021-2026), 2020.
[IS15] Sergey Ioffe and Christian Szegedy. Batch normalization: Ac-
celerating deep network training by reducing internal covariate
shift. In International conference on machine learning, pages
448–456. PMLR, 2015.
[Isi21] M Isichenko. Quantitative portfolio management: The art and
science of statistical arbitrage, 2021.
[JES16] Olivier Jin and Hamza El-Saawy. Portfolio management using

reinforcement learning. Stanford University, 2016.
[JXL17] Zhengyao Jiang, Dixing Xu, and Jinjun Liang. A deep reinforce-
ment learning framework for the financial portfolio management
problem. arXiv preprint arXiv:1706.10059, 2017.
[KB14] Diederik P Kingma and Jimmy Ba. Adam: A method for

stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[LB14] Michael Lewis and Dylan Baker. Flash boys. Simon & Schuster
Audio, 2014.
87
[LHP+ 15] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nico-
las Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wier-
stra. Continuous control with deep reinforcement learning. arXiv
preprint arXiv:1509.02971, 2015.
[Lo04] Andrew W Lo. The adaptive markets hypothesis. The Journal
of Portfolio Management, 30(5):15–29, 2004.
[LPW+ 17] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei
Wang. The expressive power of neural networks: A view from
the width. Advances in neural information processing systems,
30, 2017.
[LXZ+ 18] Xiao-Yang Liu, Zhuoran Xiong, Shan Zhong, Hongyang Yang,
and Anwar Walid. Practical deep reinforcement learning ap-
proach for stock trading. arXiv preprint arXiv:1811.07522, 2018.
[Man97] Benoit B Mandelbrot. The variation of certain speculative prices.
In Fractals and scaling in finance, pages 371–418. Springer, 1997.
[Mar68] Harry M Markowitz. Portfolio selection. In Portfolio selection.
Yale university press, 1968.
[MBM+ 16] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza,
Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and
Koray Kavukcuoglu. Asynchronous methods for deep reinforce-
ment learning. In International conference on machine learning,
pages 1928–1937. PMLR, 2016.
[MKS+ 13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex
Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Ried-
miller. Playing atari with deep reinforcement learning. arXiv
preprint arXiv:1312.5602, 2013.
[MM97] Tom M Mitchell and Tom M Mitchell. Machine learning, vol-
ume 1. McGraw-hill New York, 1997.
[MRC18] Sean McNally, Jason Roche, and Simon Caton. Predicting the
price of bitcoin using machine learning. In 2018 26th euromi-
cro international conference on parallel, distributed and network-
based processing (PDP), pages 339–343. IEEE, 2018.
[MS01] John Moody and Matthew Saffell. Learning to trade via di-
rect reinforcement. IEEE transactions on neural Networks,
12(4):875–889, 2001.
[MSA22] S Makridakis, E Spiliotis, and V Assimakopoulos. The m5 accu-
racy competition: Results, findings and conclusions. 2020. URL:
https://www. researchgate. net/publication/344487258 The M5
Accuracy competition Results findings and conclusions, 2022.
88
[MT67] Benoit Mandelbrot and Howard M Taylor. On the distribution
of stock price differences. Operations research, 15(6):1057–1062,
1967.
[MW97] John Moody and Lizhong Wu. Optimization of trading systems
and portfolios. In Proceedings of the IEEE/IAFE 1997 com-
putational intelligence for financial engineering (CIFEr), pages
300–307. IEEE, 1997.
[MWLS98] John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell.
Performance functions and reinforcement learning for trading
systems and portfolios. Journal of Forecasting, 17(5-6):441–470,
1998.
[Nar13] Rishi K Narang. Inside the black box: A simple guide to quan-
titative and high frequency trading, volume 846. John Wiley &
Sons, 2013.
[Pet22] Martin Peterson. The St. Petersburg Paradox. In Edward N.
Zalta, editor, The Stanford Encyclopedia of Philosophy. Meta-
physics Research Lab, Stanford University, Summer 2022 edi-
tion, 2022.
[Pla22] Aske Plaat. Deep Reinforcement Learning. Springer, 2022.
[PMR+ 17] Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando
Miranda, and Qianli Liao. Why and when can deep-but not
shallow-networks avoid the curse of dimensionality: a review.
International Journal of Automation and Computing, 14(5):503–
519, 2017.
[RHW85] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams.
Learning internal representations by error propagation. Techni-
cal report, California Univ San Diego La Jolla Inst for Cognitive
Science, 1985.
[SB18] Richard S Sutton and Andrew G Barto. Reinforcement learning:
An introduction. MIT press, 2018.
[SGO20] Omer Berat Sezer, Mehmet Ugur Gudelek, and Ahmet Murat
Ozbayoglu. Financial time series forecasting with deep learn-
ing: A systematic literature review: 2005–2019. Applied soft
computing, 90:106181, 2020.
[Sha98] William F Sharpe. The sharpe ratio. Streetwise–the Best of the
Journal of Portfolio Management, pages 169–185, 1998.
[SHK+ 14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya
Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way
to prevent neural networks from overfitting. The journal of ma-
chine learning research, 15(1):1929–1958, 2014.
89
[SHM+ 16] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Lau-
rent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioan-
nis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.
Mastering the game of go with deep neural networks and tree
search. nature, 529(7587):484–489, 2016.
[SHS+ 17] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis
Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Lau-
rent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering
chess and shogi by self-play with a general reinforcement learning
algorithm. arXiv preprint arXiv:1712.01815, 2017.
[SLH+ 14] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan
Wierstra, and Martin Riedmiller. Deterministic policy gradient
algorithms. In International conference on machine learning,
pages 387–395. PMLR, 2014.
[SNN18] Sima Siami-Namini and Akbar Siami Namin. Forecasting eco-
nomics and financial time series: Arima vs. lstm. arXiv preprint
arXiv:1803.06386, 2018.
[SRH20] Simon R Sinsel, Rhea L Riemke, and Volker H Hoffmann. Chal-
lenges and solution technologies for the integration of variable re-
newable energy sources—a review. renewable energy, 145:2271–
2285, 2020.
[SSS00] Robert H Shumway, David S Stoffer, and David S Stoffer. Time
series analysis and its applications, volume 3. Springer, 2000.
[SSS+ 17] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis
Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas
Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of
go without human knowledge. nature, 550(7676):354–359, 2017.
[SVL14] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to se-
quence learning with neural networks. Advances in neural infor-
mation processing systems, 27, 2014.
[SWD+ 17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford,
and Oleg Klimov. Proximal policy optimization algorithms.
arXiv preprint arXiv:1707.06347, 2017.
[Tal97] Nassim Nicholas Taleb. Dynamic hedging: managing vanilla and
exotic options, volume 64. John Wiley & Sons, 1997.
[TOdSMJZ20] Danilo Tedesco-Oliveira, Rouverson Pereira da Silva, Walter
Maldonado Jr, and Cristiano Zerbato. Convolutional neural
networks in predicting cotton yield from images of commercial
fields. Computers and Electronics in Agriculture, 171:105307,
2020.
90
[VSP+ 17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polo-
sukhin. Attention is all you need. Advances in neural informa-
tion processing systems, 30, 2017.
[XNS15] Ruoxuan Xiong, Eric P Nichols, and Yuan Shen. Deep learn-
ing stock volatility with google domestic trends. arXiv preprint
arXiv:1512.04916, 2015.
[YJVdS19] Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar.
Time-series generative adversarial networks. Advances in neural
information processing systems, 32, 2019.
[YLZW20] Hongyang Yang, Xiao-Yang Liu, Shan Zhong, and Anwar Walid.
Deep reinforcement learning for automated stock trading: An
ensemble strategy. In Proceedings of the first ACM international
conference on AI in finance, pages 1–8, 2020.
[ZBH+ 21] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht,
and Oriol Vinyals. Understanding deep learning (still) re-
quires rethinking generalization. Communications of the ACM,
64(3):107–115, 2021.
[ZS10] Wenbin Zhang and Steven Skiena. Trading strategies to exploit
blog and news sentiment. In Proceedings of the international
AAAI conference on web and social media, volume 4, pages 375–
378, 2010.
[Zuc19] Gregory Zuckerman. The man who solved the market: How Jim
Simons launched the quant revolution. Penguin, 2019.
[ZZR20] Zihao Zhang, Stefan Zohren, and Stephen Roberts. Deep rein-
forcement learning for trading. The Journal of Financial Data
Science, 2(2):25–40, 2020.
[ZZW+ 20] Yifan Zhang, Peilin Zhao, Qingyao Wu, Bin Li, Junzhou Huang,
and Mingkui Tan. Cost-sensitive portfolio selection via deep
reinforcement learning. IEEE Transactions on Knowledge and
Data Engineering, 34(1):236–248, 2020.
91

Deep Policy Gradient Methods in Commodity Markets

Uploaded by

Deep Policy Gradient Methods in Commodity Markets

Uploaded by

Deep Policy Gradient Methods

Jonas Rotschi Hanetho

Thesis submitted for the degree of

Institute for Informatics

Jonas Rotschi Hanetho

Deep Policy Gradient Methods in Commodity Markets

Printed: Reprosentralen, University of Oslo

6 Reinforcement learning algorithm 50

1.2 Problem description

1.3 Thesis Organization

• Chapter 2: Overview of relevant concepts in algorithmic trading.

• Chapter 5: Formalization of the problem setting.

• Chapter 10: Summary of contributions, results, and main conclusions.

2.1 Commodity markets

2.2 Financial trading

2.3 Modern portfolio theory

2.4 Efficient market hypothesis

E[Rt+1 |It ] = Rt (2.2)

2.6 Mapping forecasts to market positions

2.8 Sub-sampling schemes

lost their competitive advantage to simple line-of-sight microwave networks [LB14].

Figure 2.1: Time series cross-validation (backtesting) compared to standard

Backtesting is an alternative to this expensive and time-consuming process

When conducting historical simulations, knowing what information was avail-

3.1 Machine learning

3.1.2 The curse of dimensionality

3.2 Supervised learning

3.2.1 Function approximation

Bias2 (fˆ(x)) = [E[fˆ(x)] − f (x)]2 (3.2)

and is a decreasing function of complexity. Variance measures the variability in

V ar(fˆ(x)) = E[fˆ(x) − E[fˆ(x)]]2 (3.3)

and is an increasing function of complexity. The out-of-sample mean square

M SE = Bias2 (fˆ) + V ar(fˆ) + V ar(ϵ) (3.4)

3.3 Artificial neural networks

3.3.1 Feedforward neural networks

hθ = g ◦ fk ◦ ... ◦ f2 ◦ f1 (x) (3.5)

fj (x) = ϕj (θj x + bj ) (3.6)

Figure 3.2: Feedforward neural network from [FFLb].

3.3.2 Parameter initialization

Biases are initialized at 0.

3.3.3 Gradient-based learning

θt+1 = θt − α∇θ J(θt ) (3.8)

θt+1 = θt − αt ∇θ J (j) (θt ) (3.9)

Adaptive gradient algorithm The Adaptive Gradient (AdaGrad) is an ex-

Root mean square propagation Root Mean Square Propagation (RM-

where E[g 2 ]t = βE[g 2 ]t + (1 − β)gt2 where vt is the exponentially decaying

Adam The Adam optimization algorithm is an extension of stochastic gradi-

E[g 2 ]t ˆ t = E[g]t . The authors recommend learning

which can be written in vector notation as

3.3.5 Activation function

Figure 3.3: ReLU activation function from [FFLb].

Figure 3.4: Leaky ReLU activation function from [FFLb].

where λ ≥ 0 is a constant weight decay parameter. Increasing λ punishes larger

3.3.7 Batch normalization

3.3.8 Universal approximation theorem

where αi , bi ∈ R and θi ∈ R2 , for which

3.3.9 Deep neural networks

i.e., any continuous multivariate function f : Rn → R can be approximated by

3.3.10 Convolutional neural networks

where x(t) ∈ R and w(t) is a weighting function.

Figure 3.6: 3D convolutional layer from [FFLc].

After the convolutional layer, a nonlinear activation function is applied to

3.3.11 Recurrent neural networks

Figure 3.7: Recurrent neural network from [FFLa].

RNNs generate a sequence of hidden states ht . The hidden states enable

h(t) = f (h(t−1) , x(t) ; θ) (3.25)

Hidden states ht are utilized by RNNs to summarize problem-relevant as-

Long short-term memory Long Short-Term Memory (LSTM) is a form of

The output from the gates is defined as

ht = o ⊙ tanh (ct ) (3.29)

4.2 Markov decision process

Figure 4.1: Agent-environment interaction from [SB18].

The dynamics of the system can be completely described by the one-step

p(s′ , r|s, a) = P r{st = s′ , rt = r|st−1 = s, at−1 = a} (4.1)

for all s, s′ ∈ S, r ∈ R, and a ∈ A(s). It defines a probability distribution such

The reward generating function r : S × A → R, is defined through the dynamics