MScFE 650 MLF - Video - Transcripts - M1
MScFE 650 MLF - Video - Transcripts - M1
MScFE 650 MLF - Video - Transcripts - M1
Revised: 08/19/2019
© 2019 - WorldQuant University – All rights reserved.
1
MScFE 650 Machine Learning in Finance– Table of Contents
Welcome Video
Hello and welcome to Machine Learning in Finance, the seventh course in the WorldQuant
University Master’s in Financial Engineering. My name is Jacques Joubert and I will be your
lecturer for the duration of the Machine Learning in Finance course.
The previous course looked at single- and multi-period portfolio theory, including the application
of Bayes’ theorem to modern portfolio theory and offered an introduction to stochastic dynamic
control. It also examined theory and methods for the pricing of stocks and bonds, considering
single-period and multiple-period scenarios. Lastly, it engaged with passive management, which is
the alternative to active management as well as market factors affecting the efficiency of markets.
In Machine Learning in Finance, students will be introduced to the applications of machine
learning within a financial context.
The Machine Learning in Finance course consists of the following seven interrelated modules:
Like other courses in the MSc program, Machine Learning in Finance primarily consists of short
lecture videos like this, as well as supplementary notes which you are able to download.
Remember that every module includes two multiple-choice quizzes, and that modules 1-6 include
collaborative review tasks for submission every Sunday for the next six weeks.
If you have any questions about the course content, remember to post them on the “Ask your
lecturer” forum. Every Monday, the forum’s most upvoted questions will be addressed in a live
lecture hosted here on the platform.
For any technical concerns, you can visit the Frequently Asked Questions page or contact the
support team via the Technical support form.
When you are ready, let's begin with the first module in the Machine Learning in Finance course,
Introduction to Machine Learning. Good luck!
In this video we will be taking a brief look at what exactly machine learning is and how it developed
– starting from its philosophical inception in the 1950s to the sophisticated systems that we use
today.
An early definition of machine learning by Arthur Samuel is that it is “the field of study that gives
computers the ability to learn without being explicitly programmed”.
In 1950, Alan Turing published his landmark paper “Computing Machinery and Intelligence”,
which marked the first serious proposal in the philosophy of artificial intelligence (AI). In it, Turing
speculated about the possibility of thinking machines, and conceptualized the Turing Test, also
known as the Imitation Game, which could be used to determine the point at which a machine
could be thought of as ‘thinking’. From this, he concluded that thinking machines were at least
plausible.
However, in the 1950s, algorithms weren’t nearly as refined as today; and even with modern
advances we are still very far away from ‘thinking machines’.
Now, you may be wondering ‘why use machine learning in the first place?’ The only real alternative
to a machine learning algorithm is a rule-based system that uses ‘if X, do Y’ functions. This
alternative has critical shortcomings that prevent it from competing with machine learning in
certain applications. This is because ‘if X, do Y’ functions are:
On the other hand, machine learning techniques, and the algorithms they use, can counteract
many of these shortcomings, as they:
AI is split into two forms – namely, strong form and weak form. Strong form (or artificial general
intelligence) is a machine that can successfully perform multiple tasks, much like a human can. This
leads to the very exciting idea of thinking machines. We are very far away from achieving artificial
general intelligence. To put this in more practical terms: my Tesla Roadster may be able to self-
drive, but the same algorithm can’t be used to discuss what it means to play chess.
Weak form (or narrow AI) is focused on one narrow task – such as the example of identifying a cat
in a photo. Machine learning is a subdomain of narrow AI.
It is important to note that all machine learning is AI but not all AI is machine learning. Now that
we have a better understanding of where machine learning falls within AI, we can begin to look at
the classes of algorithms that fall within machine learning. There are three main classes used to
describe machine learning and the algorithms they use. These classes are supervised learning,
unsupervised learning, and reinforcement learning.
The following are formal definitions of supervised, unsupervised and reinforcement learning:
2 “Unsupervised learning is where you only have input data (X) and no corresponding
output variables. The goal for unsupervised learning is to model the underlying
structure or distribution in the data in order to learn more about the data. These are
called unsupervised learning because unlike supervised learning above there are no
correct answers and there is no teacher. Algorithms are left to their own devices to
discover and present the interesting structure in the data.” – Brownlee (2016)
In this video we briefly touched on what machine learning is and how it relates to artificial
intelligence. We also briefly discussed the formal definitions of the three main classes of machine
learning. In the provided set of notes, we discuss the history and definition of machine learning in
detail.
Hi – in this lecture video we will be diving into the machine learning community by investigating
some of the important events, resources and open-source projects. At this point in time – 2018 –
it’s important to point out that, at present, Python is widely considered the de facto language for
data science and machine learning, not R, not Matlab and definitely not SAS.
Competitions
Kaggle is a company that hosts data science competitions and projects, giving you an opportunity
to use machine learning to solve real-world problems. For instance, competitions can range from
object detection for self-driving cars to High Energy Physics – particle-tracking in CERN
detectors.
Kaggle has interesting datasets, notebook tutorials, and a strong community. You can log on,
download a dataset and start hacking away on a machine learning problem. Due to the strong
community, it’s easy to find a friend to help you as you progress.
Two Sigma, Winton and BattleFin are examples of a few financial companies that have launched
competitions on machine learning, many for the purpose of hiring.
Two Sigma hosted the Financial Modeling Challenge and Rental Listing Inquiries; Winton, the
Stock Market Challenge and Observing Dark Worlds; and BattleFin, the Big Data Combine.
They also interview winners of competitions during which the winners share some of the
techniques that helped them to win. It’s recommended that you read these interviews – a good
place to start is the competition from Two Sigma.
Conferences
o MLSS (or Machine Learning Summer School) was started in 2002 with the motivation to
promulgate modern methods of statistical machine learning and inference. It was
motivated by the observation that while many students are keen to learn about machine
learning, and an increasing number of researchers want to apply machine learning
methods to their research, only a few machine learning courses are taught at universities.
Machine learning summer schools present topics which are at the core of modern
machine learning, from fundamentals to state-of-the-art practice. The speakers are
leading experts in their field who talk with enthusiasm about their subjects.
One conference that requires special mention for students is QuantCon. QuantCon is hosted by
Quantopian, who “inspire talented people everywhere to write investment algorithms and then
selected authors may license their algorithms to them and get paid based on their performance”.
Quantopian is famous for democratizing Wall Street by making a full quantitative research
environment available for anybody out there who wants to build a strategy. They have the
infrastructure, the data, the community, notebooks, and access to a backtesting engine. For those
of you looking to start your own fund, setting up the infrastructure from scratch is a monumental
task – Quantopian offers a great alternative.
QuantCon is a quantitative finance conference hosted every single year, twice a year, once in
Singapore and once in New York. This conference focuses on quantitative finance today. While not
strictly a machine learning conference, in 2018 it was clear that one of the most interesting
questions in finance revolved around machine learning and alternative data sources.
The final keynote speaker at QuantCon 2018 was Dr Marcos Lopéz de Prado. Dr Lopéz de Prado
recently published Advances in Financial Machine Learning, the first postgraduate textbook in the
field. While this book is not required reading for this course, I highly recommend reading it.
We will cover some of the topics mentioned in the book throughout this course, and – courtesy of
Bloomberg – we have a guest lecture seminar by Dr Lopéz de Prado himself included in this
course. In our final module, Module 7: Machine Learning in Finance, specifically, we are going to
address some of the early chapters in this textbook and some of the solutions.
In this video we looked at the machine learning community, events and resources. The provided
notes contain many additional resources and links for students to explore this rich community.
In this video we will be looking at machine learning in financial contexts. We will specifically
discuss the seven reasons most machine learning funds fail, according to Dr Marcos Lopéz de
Prado.
Many courses are catered to the general computer scientist who works in the tech industry;
however, we are going to give special attention to the question of how you can apply these
algorithms to finance.
One of the problems faced with teaching such a course is that most financial data is not free or
easily available. Many stock exchanges charge royalties on their data and it can get expensive. It is
for this reason that we will have to rely on academic papers to help build up the intuition for how
to apply our new skills.
You will get many opportunities to implement the algorithms that we cover in class as well as
receive good exposure to some ground-breaking techniques, like Hierarchical Risk Parity for
portfolio optimization. We will include compulsory reading of academic papers which will aid in
building an intuition on how to apply these algorithms within the context of finance.
Dr Marcos Lopéz de Prado gave the final keynote lecture at QuantCon 2018, entitled The 7
Reasons Most Machine Learning Funds Fail. In this lecture video I will briefly go over the points
that he mentioned so that as this course progresses you are aware of some of the difficulties that
machine learning funds face.
If you turn to the next set of notes, you can go through the presentation by Dr Lopéz de Prado. The
link to the presentation can also be found there and at the bottom of this screen –
https://ssrn.com/abstract=3031282. For those of you that would like to read his full paper, follow
this link: http://ssrn.com/abstract=3104816. However, if you wait until the last week in this
course, our guest lecture is by Dr Lopéz de Prado himself.
According to Dr Lopéz de Prado, the seven reasons most machine learning funds fail are as
follows:
In Greek mythology Sisyphus was punished by being forced to roll a boulder up a hill only to have it
roll down when it nears the top, repeating this task for the rest of eternity.
What the myth of Sisyphus highlights is the task of repeating futile and hopeless labor.
Discretionary firms have attributed much of their success to having portfolio managers work in
isolation for them to generate unique trading ideas. The same is not true for quantitative finance,
where it takes specialized teams, working together to identify new strategies. Dr Lopéz de Prado
suggests setting up a research lab much in the same way that top US national laboratories have.
Berkeley Lab, as illustrated in this video, sets a good example.
2 Integer differentiation
After the course on Econometrics, students will be familiar with differencing the log prices in
order to work with an invariant process.
Preprocessing features in this way makes the series stationary but at the cost of removing all
memory. As you will see in this week’s paper, memory is an important feature.
In this chart you can see the green line which is the E-mini S&P 500 Futures trade bars and the
blue line which is the fractionally differentiated series. On the short time span it resembles
returns, but on the longer frame it resembles price levels.
3 Inefficient sampling
Information to market doesn’t arrive in constant fixed time intervals, so why do we sample that
way? Using fixed time intervals means that we are over-sampling in quiet periods and under-
sampling in busy ones.
Figure 11: Partial recovery of normality through a price sampling process subordinated to a
volume clock
(Adapted from Lopéz de Prado, 2012)
4 Wrong labeling
In most papers you will find that the author will make use of some machine learning algorithm in a
classification setting where they are trying to predict the next period’s directional move.
Dr Lopéz de Prado introduces several techniques which offer better results – for example, the
triple-barrier method and meta-labeling.
For most machine learning tasks, it is assumed that your data is generated using an IID process.
This, however, is not true in finance as the “labels are decided by the outcomes and the outcomes
are decided over multiple observations, because labels overlap in time, we cannot be certain about
what observed features caused an effect” (Lopéz de Prado, 2018).
To help address this, observations are weighted by uniqueness. See chapter 4 in Advances in
Financial Machine Learning for more detail.
6 Cross-validation leakage
In some papers you will notice authors making use of k-fold cross-validation, but the reason that
this doesn’t work in finance is because observations can’t be assumed to be drawn from an IID
process. This leads to data leakage.
Possible solutions are to make use of Purged and Embargoed K-Fold cross-validation.
7 Backtest overfitting
Most backtest overfitting can be attributed to selection bias and multiple testing. This leads to a
higher likelihood of the backtest being a false discovery.
A tool to help combat this is the Deflated Sharpe Ratio (DSR), which, according to Dr Lopéz de
Prado, “computes the probability that the Sharpe Ratio (SR) is statistically significant, after
controlling for the inflationary effect of multiple trials, data dredging, non-normal returns and
shorter sample lengths” (2018).
DSR can be used to determine the probability that a discovered strategy is a False Positive. The
key is to record all trials, and determine correctly the number of effectively independent trials.
This video introduced how students will get exposure to financial contexts via academic papers
and highlighted some of the challenges faced by practitioners. Study the notes on the subject for a
more detailed discussion on the topic.
In this lecture video we will be reviewing an academic paper. The provided set of notes will also
summarize some of its most important points, but please note that I would highly recommend you
read the paper itself.
As previously discussed in this module, the aim of introducing academic papers is to help you to
make the most of this course and develop a nuanced understanding of how to apply machine
learning to financial data.
When it comes to doing a literature review on a topic, it is a good idea to start with a couple of
survey papers. In getting an idea of what has been tried before, you can then decide which papers
require deeper study.
The paper that we are going to review is “An Empirical Comparison of Machine Learning Models
for Time Series Forecasting”. The paper is publicly available and you can access is from
ResearchGate using this link (at the bottom of the screen).
This paper discusses one of the few large-scale comparison studies for major machine learning
models for time series forecasting, using regression. By now you are familiar with the classical
econometrics techniques like ARIMA, but our goal is to introduce you to models that can capture
non-linearities in the data.
The paper starts with an introduction to the M3 Competition which is a competition to test
various algorithms against business time series data. The M4 Competition takes place in 2018 and
has more than 100,000 time series (Link: https://www.m4.unic.ac.cy/).
The two best-reported algorithms in the study was the Multilayer Perceptron (which is another
term for a neural network) and Gaussian Process model. The study specifies that while the
Multilayer Perceptron has higher performance, it does take much longer to train than the
Gaussian Process model.
It is seen throughout the literature that Support Vector Machines (SVMs) and the variations of
random forests perform very well in classification-style problems. However, the Support Vector
Regression model doesn’t maintain a high place in this study.
The study also looks at the different effects of preprocessing. In particular, it notes that time series
differencing results in worse model performance. This is in line with the presentation by Dr Lopéz
de Prado that we covered in the previous lecture video. A technique such as fractional
differentiation will add value here.