Machine Learning in Trading
Machine Learning in Trading
Machine Learning in Trading
Contents
I Introduction 3 Preface 4
1 Introduction to Machine Learning 9
II Machine Learning Implementation Framework 17 2 Defining the
Problem Statement, Target and Feature 20
3 Target and Features in Python 24
4 Train and Test Split 34
5 Training and Forecasting using Classification Model 41
6 Metrics to Evaluate Classifier Model 46
7 Backtesting and Live Trading 56
8 Challenges in Live Trading 67
III Supervised Learning: Regression and Classification 73 9 The Linear
Regression Model 74
10 Logistic Regression 86
11 Naive Bayes Model 96
12 Decision Trees 105
13 Random Forest Algorithm 115
14 XGBoost Algorithm 122
15 Neural Networks 137
CONTENTS
IV Unsupervised Learning, Natural Language Processing, and
Reinforcement Learning 153 16 Unsupervised Learning 154
17 K-Means Clustering 161
18 Hierarchical Clustering 168
19 Principal Component Analysis 177
20 Natural Language Processong 192
21 Reinforcement Learning 197 2
Part I
Introduction Preface
“If you invent a breakthrough in artificial intelligence, so machines can learn,
that is worth 10 Microsofts.”
— Bill Gates Alan Turing had created a test, called the “Turing Test” in
1950. The test was to judge if a machine was ‘human’ enough. For it to be so,
the machine had to interact with a real person. In such interactions, if the
person found the machine’s behavior indistinguishable from a human, the
machine passed the test.
Arthur Samuel wrote the first computer learning program to play checkers. It
studied the previous moves and incorporated them in its moves when it
played against an opponent. Today, Google’s DeepMind AI has defeated a
human player in the board game Go! A feat that was earlier thought to be
impossible. All this, thanks to machine learning.
Obviously, a human will not be able to keep up with the data overload, and
conventional computer programs might not be able to handle such large
amounts of data.
• Download and manage price, fundamental and other alternative data from
multiple data vendors and in different formats.
• Pre-processing the data and cleaning
• Forming hypothesis for a trading strategy and backtesting it.
• Automating the trading strategy to make sure emotions don’t get in the way
of your trading strategy.
We found out that we can outsource most of these tasks to the machine. And
this is the reason you, the reader, have a book which talks about machine
learning and its applications in trading.
Why was this book written? Machine learning is a vast topic if you look at
the various disciplines originating from it. You will also hear buzzwords such
as AI, Neural Networks, Deep learning, AI Engineering being associated with
machine learning.
Our aim in this book is to demystify these concepts and provide clarity on
how machine learning is different from conventional programming. And
further, how machine learning can be used to gain an edge in the trading
domain.
We have structured the book in such a way that initially, you will learn about
the various tasks carried out by a machine learning algorithm.
Once the basics of ML tasks are covered, we move further and go through
various ML algorithms, one chapter at a time. We will build upon the
concepts and give you a step by step guide on how an idea can be converted
to rules, and apply machine learning to test this idea on real-world data. And
if you are satisfied with the results, move further and implement them in real
life.
From the outset, we believe that only theory is not enough to retain
knowledge. You need to know how you can apply this knowledge in the real
world. Thus, our book contains lots of real world examples, especially in the
field of trading. But rest assured that these concepts can be transferred to any
other discipline which requires data analysis.
How to read this? The ideal way to go through this book is one chapter at a
time. But you can try other options as well, such as:
1. Skim through the book and stop at a chapter that catches your eye. And
then go back if you want some preliminary knowledge.
2. If you have worked on machine learning algorithms before, then feel free
to go to any random chapter you like and see if the text can be improved or
not.
Where else can you find all this? Machine learning has been around for
more than half a century now. In fact, the word “robota” indicating a human
like machine, was first mentioned in a Czech play in 1920!
Thus, you can find a variety of information related to machine learning in the
form of blogs, courses, videos and other mediums. And there are free
resources too. The idea of this book was to present the core set and principles
of machine learning in an easy to understand language and also, in a compact
form.
Code and Data Used in the Book You can download all the code and data
files from the Github link: ‘https://github.com/quantra-go-algo/ml-trading-
ebook’. We will also update the codes and data files on the same link.
2. Machine learning requires data to learn. In the early days, this data was
hard to acquire.
Why should you use machine learning? The world is moving towards
automation and machine learning, and the finance industry is no different.
Zest AI, which develops AI solutions for lenders, helps auto lenders screen
potential clients and advises them on the correct lending rate to be charged.
Underwrite.ai claims that after using their ML model to screen clients, they
cut the first payment default rate from 32% to 8%.
Let’s take a simple example to illustrate this concept. First, the ML algorithm
receives the details of all the clients and their respective portfolios as input.
The ML algo then learns the relationship between the portfolio advised and
the characteristics of the client (from our input data). After this, the model is
ready to make recommendations for the portfolio of any client based on her
characteristics (like age, income, employment status, etc.). For example, if
the client is a 25- year-old female with a high risk appetite, the model, based
on previous patterns learnt, would suggest a portfolio inclined towards
technology stocks such as Apple and Facebook.
The ML algorithms are really fast and are likely to do this in a matter of
seconds. The advantage of speed is harnessed by hedge funds that are using
algorithmic trading. Renaissance Technologies is just one example from
hundreds of AI-driven funds. Renaissance Technologies is known to use
machine learning in its trading strategies. And it has about $55 billion worth
of assets in its management. Various hedge funds and high-frequency trading
firms rely on machine learning to create their trading strategies.
Source: https://www.preqin.com/insights/research/blogs/the-rise-of-
themachines-ai-funds-are-outperforming-the-hedge-fund-benchmark
Is machine learning only for hedge funds and large institutions? That was
the case for a long time but not anymore. With the advent of technology and
open source data science resources, retail traders and individuals have also
started to learn and use machine learning. Let’s take an example here.
Assume that you have an understanding of the market trend. You have a
simple trading strategy using a few technical indicators which help you to
predict the market trend and trade accordingly. Meanwhile, another trader
uses machine learning to implement the same trading strategy. Although,
instead of a few old technical indicators, he will allow the machine to go
through hundreds of technical indicators. He lets the machine decide which
indicator performs best in predicting the correct market trend. While the
regular trader might have settled for RSI or MA, the ML algo will be able to
go through many more technical indicators and pick the best ones, in terms of
prediction power.
This book gives you a detailed and step-by-step guide to create a machine
learning trading strategy.
1. Use Python libraries, which will help you read the data.
2. Use machine learning to find patterns in the data.
3. Generate signals on whether you should buy or sell the asset.
4. Learn to analyse the performance of the model.
Recall how often Facebook (FB) prompts you to tag the person in the picture!
Among billions of users, FB ML algos are able to correctly match different
pictures of the same person and identify her! Tesla’s autopilot feature is
another example.
How do Machines learn? Well, the simple answer is, just like humans!
First, we receive information about a certain thing, which is kept in our mind.
If we encounter the same thing in the future, our brain is able to identify it.
Also, past experiences help us in making better decisions in the future. Our
brain trains itself by identifying the features and patterns in knowledge/data
received, thus enabling itself to successfully identify or differentiate between
various things.
There are many key industries where ML is making a huge impact: financial
services, logistics, marketing and sales, health care, etc. It is expected that in
a couple of decades, the mechanical repetitive tasks will be over. Machine
learning and improvements in artificial intelligence techniques have made the
impossible possible, from self-driving cars to cancer cell detection.
The mathematical models are divided into two categories, depending on their
training data: Supervised and unsupervised learning models.
1.1 Supervised Learning
Think of supervised learning as a kid learning multiplication tables. Tell the
kid the multiplication table of 2, till 2 times 5. Then the kid deduces the logic
that they have to add 2 to get the next answer in the multiplication table.
Then, it says that 2 times 6 is 12. When building supervised learning models,
the training data contains the required answers or the expected output. These
required answers are called labels. For example, if the training data contains
the technical indicators such as RSI and ADX, as well as the trading position
to take such as buy or sell, then it is known as supervised learning approach.
With enough data points, the machine learning algorithm will be able to
classify the trading signal correctly more often than not. Supervised learning
models can also be used to predict continuous numeric values such as the
share price of Disney. These models are known as regression models. In this
case, the labels would be the share price of Disney.
• Linear Regression
• Lasso Regression
• Ridge Regression
There are various evaluation methods to find out the performance of these
models. We will discuss these models and the evaluation methods in greater
detail in later chapters.
This type of classification has only two categories. Usually, they are Boolean
values: 1 or 0 (sometimes known as True/False, or High/Low). Some
examples where such a classification could be used are in cancer detection or
email spam detection. In cancer detection, the labels would be positive or
negative for cancer. Similarly, the labels for spam detection would be spam
or not spam. You can also make trading a binary classification problem with
labels such as buy and sell, or buy and no position.
Multi-class Classification
Multi-class classifiers or multinomial classifiers can distinguish between
more than two classes. For example, you can have three labels such as buy,
no position or sell. Multi-label Classification
The assumption in such a system is that the clusters discovered will match
reasonably well with an intuitive classification. For example, the clustering of
stocks based on historical data. This will result in clustering the stocks that
belong to same sector or industry group together. There may also be some
surprises, like Facebook and Twitter can be part of different group even
though they belong to same sector.
Mr. Rob Read, a trader from Manhattan, is looking to use a machine learning
algorithm that can guide him when to go long on J. P. Morgan stock. He goes
to a data vendor to get the price data.
He then makes sure that the data quality is good. He does this by checking for
missing and duplicate values in the data. He performs other analysis to ensure
that the data can be used with the ML algorithm.
But what is required as input to predict the target variable? The inputs to
the machine learning algorithm are called feature variables. He uses
indicators such as RSI, MACD, and even momentum indicators such as
ADX.
How does he evaluate if the machine learning model is effective? For that,
he splits the data into train and test.
The train data is then used by the algorithm to learn the relationship between
the feature variables and the target. This relationship is then referred to as the
ML model.
This way the algorithm will learn how the feature variables and target
variables are related by using the train dataset. And he verifies the
performance of the model on the test dataset.
Which metrics can be used to evaluate the model? Since he has created the
target variable beforehand, he has the actual signals. He can simply compare
the predicted signals of the ML model with the actual data.
For example, the ML model predicted a signal of 1 on 8th July 2021. This
means that the ML model thinks that the price on 9th July 2021 would be
higher than 8th July 2021.
He checked the price and sure enough, it had increased on 9th July. This
means the ML program was correct in predicting the price moves for that
day.
Finally, he generates signals through the machine learning model and
backtests them. He plots the equity curve and drawdown to analyse the
performance further.
This was a brief overview of the ML tasks. Let us deep dive into each of
these steps and implement them in Python.
For the purpose of this book, our problem statement will be, “Whether to buy
J. P. Morgan’s stock at a given time or not?”
In technical jargon, the possible answers to these questions is the target
variable.
0.003 Buy
0.007 Buy
-0.002 Sell
0.02 Buy
-0.005 Sell
This column marked as buy and sell is called the target variable. In machine
learning, you will denote a letter, “y” for the target variable.
Similarly, you can create a target variable based on the different problem
statements you will be solving.
2.3 Features
Let’s assume there is no machine learning model and Rob has to make this
decision on his own.
What data points or information would Rob need in order to make this
decision? Can you help Rob?
Probably Rob can look at the following things:
Let’s come back to the ML world. The information which Rob needs to
predict to buy or sell J P Morgan’s stock, the same information is used by the
ML model to make that decision.
Rob passes all this information to the ML model which in turn processes this
information and tells Rob to buy or sell. The information, such as volatility,
that Rob passed to the ML model are called features.
Rob was excited and he started collecting all the data which he has, and
prepared them to pass to ML model. On a whim, he also decides to collect the
price history of apples for the same time period to use as a feature on his ML
model.
How useful will this data be? It won’t be useful at all. The price of the fruit
will have no predictive power for Apple Inc. the company’s stock price.
Therefore, before passing any data to ML model, Rob needs to ensure that
each feature is relevant and has some predictive power.
But even if the data is relevant, there is one important thing to check. Let’s
see what this point is in the next section.
Let’s look at one more example. ML model learns that as S&P 500 index
increases, the volatility index (VIX) decreases and vice versa. A basic
machine learning algorithm, on seeing constant increase in SP500 price over
the years, would start predicting VIX to crash into 0 and then go in the
negative zone, whereas this will never be the case.
Rob has now understood this concept and has almost decided the features to
be added. He finally wanted to include the stock’s volatility in his model. But
what should he include in his model, weekly volatility or monthly volatility,
or both? Since both these features, weekly and monthly volatility are highly
correlated with each other, adding both of them will not add any extra
information to the model.
Instead, keeping both of them together in the model might add more weight
to the information carried by each of them individually. Hence, Rob decides
to drop one of them. Thus, to make sure you get reliable results, you need to
make sure that the above conditions are satisfied.
In the next chapter, we will see how you can create these targets and features
in Python.
You have decided that you want to trade in J. P. Morgan. More specifically,
you want to design an ML algorithm that will help you in deciding whether to
go long in J. P. Morgan at a given point in time. Thus, the problem statement
is:
All the data modules used in the book are uploaded on the following github
link: https://github.com/quantra-go-algo/ml-trading-ebook. The 15-minute
OHLCV data of J.P. Morgan stock price is stored in a CSV file JPM.csv in
the data_modules directory. The data ranges from January 2017 to December
2019.
To read a CSV file, you can use the read_csv method of pandas. The syntax
is shown below:
Syntax:
import pandas as pd
pd.read_csv(filename)
filename: Complete path of the file and file name in string format.
[2]: #The dataisstored inthedirectory 'data_modules'
path = "../data_modules/"
#Read thedata
data = pd.read_csv(path + 'JPM_2017_2019.csv',index_col=0) data.index =
pd.to_datetime(data.index)
Going back to our problem statement, Whether to buy J.P. Morgan’s stock
or not?, we will create a column, signal. The signal column will have two
labels, 1 and 0.
Whenever the label is 1, the model indicates a buy signal. And whenever the
label is 0, the model indicates a do not buy signal. We will assign 1 to the
signal column whenever the future returns will be greater than 0.
Syntax:
DataFrame[column].pct_change().shift(period)
#Create thesignalcolumn
data['signal'] = np.where(data['future_returns'] > 0, 1, 0)
data.head()
future_returns signal
2017-01-03 09:45:00+00:00 -0.002289 0
2017-01-03 10:00:00+00:00 0.001262 1
2017-01-03 10:15:00+00:00 0.000916 1
2017-01-03 10:30:00+00:00 -0.002861 0
2017-01-03 10:45:00+00:00 -0.006312 0
As you can see in the above table, the close price from second row to third
row is increased, and therefore the signal column in second row is marked as
1. In other words, if you buy when the signal column is 1, it is going to result
in positive returns for you. Our aim is to develop a machine learning model
which can accuratly forecast when to buy!
Features In order to predict the signal, we will create the input variables for
the ML model. These input variables are called features. The features are
referred to as X. You can create features in such a way that each feature in
your dataset has some predictive power.
import talib as ta
ta.RSI(data, timeperiod)
ta.ADX(data_high, data_low, data_open, timeperiod)
Store the signal column in y and features in X. The columns in the variable X
will be the input for the ML model and the signal column in y will be the
output that the ML model will predict.
[9]: #Target
y = data[['signal']].copy()
#Features
i=1
plt.subplot(nrows, 2,i)
#Plot thefeature
Thus, you can see that open, high, low, close, and sma are not stationary.
They are dropped from the dataset.
Correlation You can check if two features have high correlation, then
essentially one of the feature is redundant. We can remove that feature and
improve learning of model.
[11]: import seaborn as sns
plt.figure(figsize=(8,5))
sns.heatmap(X.corr(), annot=True,cmap='coolwarm') plt.show()
#Recurring& redundantpair
pairs_to_drop = set()
cols = X.corr().columns
for i in range(0,X.corr().shape[1]):
for j in range(0,i+1):
pairs_to_drop.add((cols[i], cols[j]))
#Drop therecurring&redundant pair correl =
correl.drop(labels=pairs_to_drop) \ .sort_values(ascending=False)
return correl[correl > threshold].index
print(get_pair_above_threshold(X, 0.7))
MultiIndex([('volatility', 'volatility2')],
)
[13]: #Drop thehighlycorrelatedcolumn
X = X.drop(columns=['volatility2'], axis=1)
Once we have removed the features which are not stationary and are
correlated, we will display the features which we have selected.
Display the Final Features
[14]: list(X.columns)
[14]: ['pct_change',
'pct_change2',
'pct_change5',
'rsi',
'adx',
'corr',
'volatility']
Great! We have the features and target variable with us. What next? In the
next chapter, you will learn why it is important to split the data in test and
train data set.
In machine learning, the train-test split means splitting the data into two
parts: 1. Training data (train_data)
2. Testing data (test_data)
Let us split our data into train and test data. In the first step, we will import
the libraries and the dataset.
Import Libraries
[1]: #For datamanipulation import pandas as pd
#Import sklearn'strain-testsplit module from sklearn.model_selection import
train_test_split
#For plotting
import matplotlib.pyplot as plt %matplotlib inline
plt.style.use('seaborn-darkgrid')
If the exam is for 10 chapters, and the student attempts the exam only after
studying 2 chapters. This would probably result in a poor performance in the
exam. In the machine learning context, this situation is analogous to under-
learning. Under-learning is when the model is trained on a very small
train_data.
Say, if the train-test split is 80%-20%. It means 80% of the original data is
the train_data and the remaining 20% is the test_data. The 80%-20%
proportion is a popular proportion to split the data. But there is no rule of
thumb that we always have to use the 80%-20% ratio.
You can also try other popular proportion choices like 90%-10%, 75%-25%.
We will use a ready-made method called train_test_split from the sklearn
module to perform the train-test split.
Syntax:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size, shuffle)
Parameters:
Returns:
train_test_split(X, y, train_size=0.80)
#Print thedimensionsofthe variables
1. The dimensions of the original dataset show that there were 7 features and
19318 observations in the feature dataset (X). The target variable (y) has one
column and the same number of observations as X.
2. The dimensions of the train_data show that X_train has 7 features and
15454 observations. That is 80% of 19318, rounded down to the nearest
integer. The target variable for the train data (y_train) has one column and the
same number of observations as X_train.
3. The dimensions of the test_data show that X_test has 7 features and 3864
observations. That is the balance 20% of 19318. The target variable for the
train data (y_test) has one column and the same number of observations as
X_test.
Do you want randomly shuffled data for our train and test datasets? The
answer depends on what type of data you are handling. If you are handling
discrete observations, like the number of faulty products in a factory
production line, then you can shuffle the indices for the train-test split.
But as seen in the above illustration, we are dealing with financial time-series
data.
For time-series data, the order of indices matters and you cannot do random
shuffling. This is because the indices in time series data are timestamps that
occur one after the other (in sequence).
The reason for that is simple. You cannot use the data from 2021 to train your
model, and then use the model to predict the prices in 2017. It is not possible
in real life as we do not have access to future data.
X, y, train_size=0.80,shuffle=False)
#Plot thedata
plt.figure(figsize=(8, 5))
Now that you have learnt how to split the data, let’s try to train a model and
use the trained model to make some predictions in the upcoming chapters.
As you have seen in countless movies, the hero always trains hard before he
faces the competitor. Similarly, we have to train the ML algorithm too. So
that when it goes in the real world, it will perform spectacularly.
We will use the X_train and y_train to train the machine learning model. The
model training is also referred to as “fitting” the model.
After the model is fit, the X_test will be used with the trained machine
learning model to get the predicted values (y_pred).
Import Libraries
[]: #For datamanipulation
import pandas as pd
X_test = pd.read_csv(
path + "JPM_features_testing_2017_2019.csv",index_col=0,
parse_dates=True)
y_train = pd.read_csv(
path + "JPM_target_training_2017_2019.csv",index_col=0,
parse_dates=True)
Syntax:
RandomForestClassifier(n_estimators, max_features, max_depth,
random_state) Parameters:
Returns:
A RandomForestClassifier type object that can be fit on the test data, and
then used for making forecasts.
We have set the values for the parameters. These are for illustration and can
be changed.
[]: #Create themachinelearningmodel
rf_model = RandomForestClassifier(
n_estimators=3,max_features=3,max_depth=2,random_state=4) Train the
Model
Now it is time for the model to learn from the X_train and y_train. We call
the fit function of the model and pass the X_train and y_train datasets.
Syntax:
model.fit(X_train, y_train)
Parameters:
Returns:
The fit function trains the model using the data passed to it. The trained
model is stored in the model object where the fit function was applied.
[]: #Fit themodelon thetrainingdata rf_model.fit(X_train, y_train['signal'])
[]: RandomForestClassifier(max_depth=2, max_features=3, n_estimators=3,
random_state=4)
Forecast Data
The model is now ready to make forecasts. We can now pass the unseen data
( X_test) to the model, and obtain the model predicted values (y_pred). To
make a forecast, the predict function is called and the unseen data is passed as
a parameter.
Syntax:
model.predict(X_test)
As we can see, the model correctly predicts the first three values of the
test_data. But how do we know the accuracy of the model prediction for the
entire dataset? We need to learn some metric for measuring the model
performance. This will be covered in the next chapter.
One of the easiest ways to find out is how many times the predictions made
by the ML model were correct. For example, if the ML model is right only
50% of the times, then it is not at all useful. As you can be right 50% of the
times by making random decisions too.
For instance, the task is to predict the heads or tails in a coin toss. The model
can predict heads every time and be right 50% of the time. Since there are
only two choices, heads or tails, even a random guess will be right 50% of the
time. Therefore, you need the model to be right more than 50% of the times.
On the other hand, if the ML model was right 8 out of 10 times, then it is
right 80% of the times. And we can say that it is indeed able to find some
patterns to map the input to output and not random. In technical jargons, this
score of 80% is called accuracy.
What should be the desired threshold of the accuracy while using ML models
for trading?
#Librariesfor plotting
import matplotlib.pyplot as plt %matplotlib inline
plt.style.use('seaborn-darkgrid') import seaborn as sns
import matplotlib.colors as clrs
Accuracy is nothing but the total correct predictions divided by the total
predictions. We plot the data to see how the correct and incorrect predictions
are distributed. The green points are where the predictions were correct and
the red points are where the predictions were incorrect.
#Accuracy percentage
accuracy_percentage = round(100 *
accuracy_data.sum()/len(accuracy_data),
2)
plt .yticks([])
plt.scatter(x=y_test.index, y=[1]*len(y_test), c=(accuracy_data !=
True).astype(float),
marker='.',cmap=cmap)
The accuracy is calculated as seen above. These calculations for the accuracy
and other performance metrics can be done using the ready-made
classification_report method. You will learn about the classification_report
method in the latter part of the chapter.
To get an answer to that question, you have to get the accuracy number label
wise, or the action which the ML model will take. In this case, the accuracy
for each buy and sell prediction. This will help you get a more granular view
of how the model is performing.
It comes out that the ML model predicted 100 times that the price would go
up, the price actually went up 90 times. And 10 times it actually fell.
So for predicting the “Buy” signal, the model is 90% accurate. That’s pretty
good.
ML Predicted Buy Actual price movement Price does not go up 10
ML Predicted Buy Price goes up 90
But for the sell signal. The model predicted 50 times that price would go
down.The price went down only 20 times and 30 times it went up. The model
is only 40% accurate in predicting the sell signal. Therefore, placing a sell
order based on the model’s recommendation is bound to be disastrous.
Sometimes, there are jargons used to explain the matrix labels such as False
Positive, True Positive, False Negative, and True Negative. In the above
example, let’s say Buy is Positive and Sell is Negative.
On all the occasions where the ML model predicted buy correctly, we call
that as True Positive. The number of times where the ML model was wrong
in predicting Buy or positive, we call it as False Positive. False because the
ML model was wrong in forecasting the positive label.
Similarly, can you tell what is a True Negative and False Negative?
Let us see the confusion matrix of our machine learning model now. Syntax:
confusion_matrix(y_test, y_pred)
Parameters:
The confusion matrix helps us understand the effectiveness of the model, but
it has its own limitation.
If you have more labels to classify, the confusion matrix grows. For example,
3 labels like buy, no position, and sell look like this.
Actual price movement ML Predicted Buy No position Sell
Price goes up 10 25 40
Price stays the same 90 25 20
Price goes down 80 40 30
And if your labels are the quantity of shares, which can range from 0 to 10, to
buy, then it will be a 11 by 11 table.
Thus, as the number of labels grows, it becomes difficult to interpret.
Actual Price Movement ML Predict
Are there other reasons we should not keep accuracy as the only metric?
Let’s take a slight diversion here and imagine that Rob wants to create an ML
algorithm that would be helpful in times of stress or market fall, such as
during the outbreak of the covid 19 pandemic.
He analysed the daily returns of the S&P 500 for the past 40 years.
Rob built an ML algorithm to predict when the S&P 500’s daily returns will
be less than -5%. And based on the ML model prediction, Rob will short the
SP500 futures. This ML model created by Rob showed accuracy of 99.8%.
Whoa! He was on cloud 9!
But Mary, his friend and colleague, looked suspicious and took a deep dive
into the data. The ML algorithm was run on 10,000 days from 1980 to 2021.
The ML was correct 9980 times, resulting in accuracy of 99.8%.
She found that the S&P 500 went below 5% only 20 times.
And the algorithm predicted on all days including the 20 days to not short
S&P 500 futures. On all 20 days where SP500 fell by 5%, the model was
incorrect.
Mary explained the problem to Rob and advised him that there is a better
metric to use here. It is to check how many times the model predicted that the
market will fall by 5% correctly.
ML prediction
Price not below 5% Price below 5% Actual value Price not below 5% 9980 0
Price goes below 5% 20 0
ML prediction
Price not below 5% Price below 5% Actual value Price not below 5% 0 9980
Price goes below 5% 0 20
The recall for the market falling by 5% will be 20/20 or 100%, that is correct
all the time. But you know this model is not useful.
Mary said you should also check the number of times the model gave the sell
signal and it turned out right.
ML prediction
Price not below 5% Price below 5% Actual value Price not below 5% 0 9980
Price goes below 5% 0 20 Total 100
Rob realised that this value is 20 out of 10000 times, which is 0.002. This
value is called precision.
Rob realised that both recall and precision are equally important.
Mary said Yes. It is called the f1 score. You can use the f1 score to
understand the overall performance of the algorithm. The f1 score is the
harmonic mean of precision and recall.
f
1
score
=
2
(precision⇤recall) ⇤ (precision +recall)
Rob quickly calculated the f1 score for his model. It came out to 0.003.
f
1
score
=
2
(1⇤ 0.002)
⇤ (1 + 0.002) = 0.004 = 0.0031.002
Let us look at the formulae for the different performance metrics:
Recall
=
Number of times the algorithm predicted an outcome correctly Total number
of the actual outcomes
Precision
=
Number of times the algorithm predicted an outcome correctly Total number
of said outcomes predicted by the algorithm
f1-score
=
2⇤ (precision * recall) (precision + recall)
In the left-most column, you can see the values 0.0 and 1.0. These represent
the position as follows:
1. 0 means no position.
2. 1 means a long position.
So from the table, you can say that the ML Model has an overall accuracy
score of 0.52. The accuracy we calculated was 51.55% which is
approximately 0.52. Apart from accuracy, you can identify the precision,
recall, and f1-score for the signals as well.
The accuracy score tells you how the ML model performed in total.
What are macro and weighted average?
Sometimes, the signal values might not be balanced. There could be instances
where the number of occurrences for 0 is barely 50 while the number of
occurrences for 1.0 is 500. In this scenario, the weighted average will give
more weightage to the signal 1. In contrast, the macro average takes a simple
average of all the occurrences.
Thus, the machine learning model’s performance can be analysed using the
metrics you have learned in this notebook. Great! Now you know the metrics
to analyse the performance, but what happens if we use this algorithm in the
real world?
Let’s backtest this model and see how it would have performed!
Why are we backtesting the model? Think about it, before you buy
anything, be it a mobile phone or a car, you would want to check the history
of the brand, its features, etc. You check if it is worth your money. The same
principle applies to trading, and backtesting helps you with it.
Do you know the majority of the traders in the market lose money?
They lose money not because they lack understanding of the market. But
simply because their trading decisions are not based on sound research and
tested trading methods.
They make decisions based on emotions, suggestions from friends, and take
excessive risks in the hope to get rich quickly. If they remove emotions and
instincts from the trading and backtest the ideas before trading, then the
chance to trade profitability in the market is increased.
In the previous chapters, you took the 15-minute, 30-minute, and 75-minute
prior percentage change as your features and expected that these features will
help you predict the future returns. This is your hypothesis.
How would you test this hypothesis? How would you know whether the
strategy will work in the market or not?
By using historical data, you can backtest and see whether your hypothesis is
true or not. It helps assess the feasibility of a trading strategy by discovering
how it performs on the historical data.
If you backtest your strategy on the historical data and it gives good returns,
you will be confident to trade using it. If the strategy is performing poorly on
the historical data, you will discard or re-evaluate the hypothesis.
We will go through a few terms and concepts which will help us analyse our
strategy. But we will do this simultaneously to see how our strategy
performs. First, let us read the data files.
In the previous chapters, you have used a random forest classifier to generate
the signal whether to buy J.P. Morgan’s stock or not.
The signals are stored in a CSV JPM_predicted_2019.csv. We will use the
read_csv method of pandas to read this CSV and store it in a dataframe
strategy_data.
Also, we will read the close price of J.P. Morgan stored in a column close in
the CSV file JPM_2017_2019.csv.
Further, we will store it in close column in strategy_data. While reading the
close price data, we will slice the period to match the signal data.
#For plotting
import matplotlib.pyplot as plt %matplotlib inline
plt.style.use('seaborn-darkgrid')
#Read thecloseprice
strategy_data['close'] = pd.read_csv(
path + "JPM_2017_2019.csv",index_col=0)\
.loc[strategy_data.index[0]:]['close']
strategy_data.index = pd.to_datetime(strategy_data.index)
The first thing you want to know from your strategy is what the returns are.
Only then will you think if the trade made sense. So you will first calculate
the strategy returns, as shown below.
The strategy returns help us understand the returns on a granular level. But
now you want to visualise how the portfolio value has changed over a period
of time. You can use the equity curve for this purpose.
Plot the Equity Curve You can plot the cumulative_returns columns of the
strategy_data to obtain the equity curve.
[]: #Calculatethe cumulative returns
strategy_data['cumulative_returns'] = (
1+strategy_data['strategy_returns']).cumprod()
From the above output, you can see that the strategy generated a cumulative
returns of 28.10% in seven months. That is impressive. But this is not the
only metric we need to see. Let us dive futher to understand the performance
of your strategy in detail.
You can analyse the returns generated by the strategy and the risk associated
with them using different performance metrics.
Annualised Returns
=(
252
Cumulative Returns
no
.
of
15
⇤
6.5 ⇤4
minute datapoints
)1
Note: There are approximately 252 trading days in a year, and 6.5 trading
hours in a day. Since we are working with 15-minute data, the number of
trading frequencies in a year is 252⇤ 6.5⇤ 4.
This is interesting. But volatility as a term treats positive and negative terms
as the same. You would like to know how much the portfolio can go in the
negative territory. You can check that using maximum drawdown.
Maximum Drawdown
=
Trough Value Peak Value Peak Value
[]: #Calculatethe runningmaximum
running_max = np.maximum.accumulate(
strategy_data['cumulative_returns'].dropna())
* 100
#Calculatethe maximumdrawdown
max_dd = drawdown.min()
print("The maximum drawdown is {0:.2f}%.".format(max_dd))
plt.tight_layout() plt.show()
The maximum drawdown is -7.94%.
From
the above output, you can see that the maximum drawdown is 7.94%. This
means that the maximum value that the portfolio lost from its peak was
7.94%. As with any investment, you can calculate the Sharpe ratio as well, to
understand how well the strategy performs.
Sharpe ratio
=
Rp Rf
sp
where,
• Rp is the return of the portfolio.
• Rf is the risk-free rate.
• sp is the volatility.
A portfolio with a higher Sharpe ratio will be preferred over a portfolio with a
lower Sharpe ratio.
Note that to keep the chapter simple, the transaction cost and slippage were
not considered while analysing the performance of the strategy. But this is an
important concept too. Also, since the risk free rate depends on the region as
well as the time period, we have preferred to keep it 0. You can of course
plug it in the formula to calculate it yourself.
Volatility and maximum drawdown are the standard measures of risk. If you
are concerned about the maximum loss a strategy can incur over a period of
time. Then you can use maximum drawdown.
Traders also use the Sharpe ratio as it provides information about the returns
per unit risk. So, it is using both factors, risk and returns.
Next, you repeat the optimisation using data from years 2–4, and validate
using month 5. You keep repeating this process until you’ve reached the end
of the data. You collate the performances of all the out-of-sample data from
year 4 to 10, which is your out-ofsample performance.
Paper Trading & Live Trading You created the strategy and analysed the
performance of the strategy.
If you are satisfied with the backtesting strategy performance, then you can
start paper trading. If not, you should tweak the strategy until the
performance is acceptable to you. And once the paper trading results are
satisfactory, you can start live trading.
You get the best result on the historical dataset, but when you deploy the
same model on the unseen dataset, it might fail to give the same result.
The best way to avoid overfitting is:
While devising a strategy, you have access to the entire data. Thus, there
might be situations where you include future data that was not able in the
time period being tested.
It’s a simple fact, after the year 2000, the companies which survived did well
because their fundamentals were strong, and hence your strategy would not
be including the whole universe. Thus, your backtesting result might not be
able to give the whole picture.
Some of the common backtesting software and live trading software are:
1. Blueshift
2. MetaTrader
3. Amibroker
4. QuantConnect
5. Quanthouse, etc.
While this has already been said before, it is not necessary that the past is
always representative of the future.
The rules identified on the historical data might not have performed well
during this pandemic.
Great! You have succesfully built your own machine learning algorithm and
not only have you tested it on historical data, but also analysed its results.
Before you go ahead and start live trading, there might be some things you
need to care of first. In the next section, we will answer some of the most
common questions when it comes to applying the machine learning model in
live markets.
How to save a model? Training machine learning models takes a lot of time
and resources. So, most traders train their models on weekends, if the model
works on daily data. Or if the model works on intraday data, they train it after
the close of market.
In both the cases, traders use the latest data available to train their models and
then save it. These models are later retrieved and used to make predictions
while trading. This process saves both time and resources. You can use the
pickle library to save a model once it is trained.
When you save any object using pickle, that object will be converted into a
byte stream of 1s and 0s, and then saved. When you want to load a pickled
object then an inverse operation takes place, whereby a byte stream is
converted back into an object. There are a couple of things that you need to
remember when saving an object using pickle.
When (de)serializing objects you need to use the same version of Python, as
the process used is different in different versions and this might result in
errors. Security
Pickle data can be altered to insert malicious code, so it is not recommended
to restore data from unauthenticated sources.
Let us save the model. Before you save the model, you need to decide two
parameters: What do you want to save? How do you want to save it?
In the code shown here, we have created a simple function called save_model
which takes these two parameters as inputs and saves the model.
Right now, we will assume we are saving the file on a local system.
The save_model function opens a file in the local machine using the variable
model_pickled_name. Here the keyword ‘wb’, or ‘write binary’, implies that
Python will overwrite the file, if it already exists or creates a new one if it
doesn’t. Then the dump command of the pickle library is used to write the
model to the specified destination.
Apart from the pickle library you can also use joblib and JSON libraries to
save your models. The joblib library is very efficient compared to the pickle
library when saving objects containing large data. On the other hand, JSON
saves a model in a string format which is easier for humans to read.
Load a Model Once you have saved the model, you can access the model on
your local machine by using the load_model function. This function takes the
name of the pickled model as its input and loads that model.
How to handle the data? To train machine learning based trading models
we require a lot of data. Downloading data from an online source every time
you want to train a model takes a lot of time. To avoid this, the old data that
you used to train the initial model must be saved on your local machine, and
the new data can be added to this file at the end of trading everyday. The new
data can be appended to the existing data using the pandas append function.
On the left you can see the old data file, and on the right you can see the
updated data file after adding the new data.
When do you retrain the model? We need to retrain a model whenever its
performance goes bad.
You can decide when to retrain a model based on its performance metrics
such as:
1. Capital Loss Let us say that you want to retrain a model based on its
capital loss. Then you need to track the profit and loss (or PnL) of the
strategy at every time period, such as everyday or a minute.
If the PnL falls below a certain limit, then you will retrain it.
If the model has initially made a profit of 100 dollars, and then it has lost 5
dollars, which is the cutoff criteria in this case.
After the cutoff criterion is triggered, we will stop trading and then retrain the
model. This cutoff criteria is decided by a trader, depending on his or her
own risk appetite.
2. Accuracy This is another criteria that can used to decide whether to retrain
a model or not.
Let us say that you have set 55% accuracy as the criterion for retraining a
model. Whenever the model’s accuracy falls below the 55% mark, you
retrain it.
In addition to these two approaches, you can retrain your model as often as
possible, regardless of the model’s performance. However, make sure your
model is not overfitted.
This will create a model that is trained on the latest available data at all times.
When you want to retrain a model, you need to perform many tasks such as
creating the features, training the model and saving it.
To do these multiple tasks, we created a simple function called
create_new_model. This function takes the raw data and the saved name of
the model as input.
In this way, you should take care that your machine learning model is
performing according to your expectations. Remember, there might be
occasions where your model’s performance might start deteriorating. Do not
hesitate in pausing your trading until you have modified the strategy to
perform as per your expectations.
Great! we have finally implemented a machine learning model from the start
to end.
So far, you have studied the classification based machine learning model.
This is a type of supervised learning algorithm. This brings us to the end of
the second part of the book. In the next part, you will see other types of
machine learning algorithms.
Additional Reading
1. A Practical Guide to Feature Engineering in Python -
https://heartbeat.fritz.ai/a-practical-guide-to-feature-engineering-in-
python8326e40747c8
How does linear regression work or how to find the relationship? We will
work with a simplified example to learn the workings of linear regression.
There are two time series x and y.
Date x y
124
248
3 8 16
43?
You can easily see here that y is double of the x. So you can say that the price
of Y will be $6. In other words, the mathematical representation of the
relationship is y = 2x. You can also plot the first three points on a graph.
To predict the price of Y on day 4, given that the price of X is $3. You can
draw the line as shown to get the price of $6 for Y.
This graph is also called scatter plot.
Let’s take another example. From the table or the scatter plot shown above,
what will be the price of y on day 4 given that price of X is $6?
It is easy to find from the scatter plot that the price would be $17.
xy
29
3 11
4 13
6 17
This equation can also be used to describe the line on the scatter plot. Here, 2
is the slope of the line and 5 is the intercept or the value of y when x is 0.
This is the basic principle behind linear regression, where you fit a line to
connect the data points. Further, you can use this line to predict the value of
Y if you know the value of X.
In the above two examples, all the points were on the line.
But this is far from true for real-world data. If we change the x and y to J. P.
Morgan and Bank of America, then the points would fall either above or
below the line. The distance between the point and the line is called the
forecasting error.
There are various regression models which work to minimise this error. Let
us see how linear regression can be used on real world data and how to judge
its performance.
Syntax:
import statsmodels.api as sm
model = sm.OLS(y, X).fit()
1. y: y coordinate
2. X: x coordinate
The following methods/properties are used.
1. add_constant(): Compute the coefficient, as default is considered 0. 2.
summary(): View the details of the regression model.
,
!0.816
Model:
! 0.815
,
Method:
! 1101.
,
Date:
!17e-93
,
Time:
!-728.85
,
Df Residuals:
! 1469.
,
Df Model:
Covariance Type:
================================================================
coef std err t P>|t| [0.025 !0.975]
,
-----------------------------------------------------------------------------const
-10.8576 3.764 -2.885 0.004 -18.270 !-3.445
,
================================================================
Omnibus:
! 0.045
,
Prob(Omnibus): !23.956
,
Skew:
, 28e-06
!
Kurtosis:
,
! 399.
================================================================
37.420 Durbin-Watson:
,
!correctly
specified.
"""
[]: model.params
9.2 R-squared
R-squared is also known as the Coefficient of Determination and is
denoted by R2.
You have already seen that the R-squared explains the percentage of variation
in the dependent variable, y, that is described by the independent variable, X.
It is equal to the square of correlation. Mathematically,
R
2= Variance Explained by the Model Total Variance
=
1
Unexplained Variance Total Variance
=
1
Sum Squared Regression Error Sum Squared Total Error
=
1
Ân (yi yi)2i=1
Â
n
(
¯
yi y)2i=1
Where, yi is the observed value for the ith row. - ˆi is the predicted value for
the ith row. - ¯ is the average of observed y’s.
Now, since you understand the math behind R2, what is the range of R2? The
value of R2 always lies between 0 and 1.
Refer back to the formula above, and you will be able to see that R2 will be 1
if and only if all the yi’s would be exactly equal to the respective ˆi’s. That
means R2 will be 1 when the model will be able to predict all the values
precisely. And R2 will be 0 when the model will not be able to capture any
relationship between the X’s and the y’s.
Calculate R-squared Let us calculate the R-squared for the above data using
the formula.
[]: r_sq = 1 ((data['Observed JPM']-data['Predicted JPM_BAC']) \ **
2).sum()/((data['Observed JPM']\
- data['Observed JPM'].mean()) \ ** 2).sum()
print('The R-squared is %.2f' % r_sq)
The R-squared is 0.82
The above value of R-squared means that 82% of the variance in the stock
price of J.P. Morgan is explained by the variance in the stock price of Bank
of America. You can also calculate the same using the sklearn library. The
sklearn library has a r2_score function which can be imported as below:
data_nestle.tail()
Can you interpret the value of R2 above? The above value of R-squared, 0.35,
means that only 35% of the variance in the stock price of J.P. Morgan is
explained by the variance in the stock price of Nestle.
Let us have a look at the plot of the JPM Price vs BAC Price and JPM Price
vs Nestle Price.
[]: fig, (ax1, ax2) = plt.subplots(2, 1,figsize=(6, 5*2))
#Plot ofJPMPrice vsBACPrice
ax1.scatter(data['BAC Close'], data['Observed JPM'], color="green")
% r2_score(data['Observed JPM'],
data['Predicted JPM_BAC']), fontsize=14) ax1.set_xlabel('BAC
Price',fontsize=12)
ax1.set_ylabel('JPM Price',fontsize=12)
Of the two values of the R2 that we have seen, which one do you think is
better?
Yes, the first value (82%) is better as it explains more variance. In the first
graph, you can see that the points are closer to the line of best fit and
scattered in the second graph. This explains the difference between the two
values of the R2.
Intuitively, the change in the stock price of J.P. Morgan would be highly
correlated to the change in the stock price of Bank of America, and it would
not be correlated to the change in the stock price of Nestle. This is because
both J.P. Morgan and Bank of America belong to the same sector of Banking,
and are very sensitive to factors like interest rates. Nestle belongs to a
different sector, Fast Moving Consumer Goods. The dynamics of this sector
are different and are affected by different factors.
R-squared does not allow us to see if the predictions are biased. This can be
done by the analysis of the residuals. Any pattern in the residual plot will
help us identify the bias in our model if any. Hence, a high R2 alone will
always not be a good statistic.
Using R-squared, you concluded how the stock price of Bank of America was
able to explain better the variance in the stock price of J.P. Morgan. Can you
think of any other stock price that can also help you explain this variance?
The stock price of any other bank or investment institute might do the job for
you.
Now it’s your turn. You can download any such data from
finance.yahoo.com and apply what you learned.
10 Logistic Regression
Logistic regression falls under the category of supervised learning. It
measures the relationship between the categorical dependent variable, and
one or more independent variables by estimating probabilities using a
logistic/sigmoid function.
In spite of the name ‘logistic regression’, this is not used for machine
learning regression problem where the task is to predict the real-valued
output. It is a classification problem which is used to predict a binary
outcome (1/0, -1/1, True/False) given a set of independent variables.
The aim of linear regression is to estimate values for the model coefficients
c,w1,w2,w3....wn and fit the training data with minimal squared error and
predict the output y.
Logistic regression does the same thing, but with one addition. The logistic
regression model computes a weighted sum of the input variables similar to
the linear regression, but it runs the result through a special non-linear
function, the logistic function or sigmoid function to produce the output y.
Here, the output is binary or in the form of 0/1 or -1/1.
y = logistic(c + x1⇤ w1x2⇤ w2 + ....xn⇤ wn)
As you can see in the graph, it is an S-shaped curve that gets closer to 1 as the
value of input variable increases above 0, and gets closer to 0 as the input
variable decreases below 0. The output of the sigmoid function is 0.5 when
the input variable is 0.
Thus, if the output is more than 0.5, we can classify the outcome as 1 (or
positive) and if it is less than 0.5, we can classify it as 0 (or negative).
Now, we have a basic intuition behind the logistic regression and the sigmoid
function. We will learn how to implement logistic regression in Python and
predict the stock price movement using the above condition.
#Technicalindicators
import talib as ta
#Machine learning
from sklearn.linear_model import LogisticRegression from sklearn import
metrics
Import Dataset We will use the same data which we used in the second part
of the book.
[]: #The dataisstored inthedirectory 'data_modules' path = "../data_modules/"
#Read thedata
data = pd.read_csv(path + 'JPM_2017_2019.csv',index_col=0) data.index =
pd.to_datetime(data.index)
[]: 01
0pct_change[0.015386536081786934]
1pct_change2 [-0.08026259453839449]
2pct_change5 [0.024335207683013584]
3rsi[0.0052866122780786074]
4adx[0.013464920611383466]
5corr[-0.001639525259172929]
6volatility[0.016321785785599726]
Predict Class Labels Next, we will predict the class labels using predict
function for the test dataset.
If you print ‘predicted’ variable, you will observe that the classifier is
predicting 1, when the probability in the second column of variable
‘probability’ is greater than 0.5. When the probability in the second column is
less than 0.5, then the classifier is predicting -1.
Create Trading Strategy Using the Model We will predict the signal on the
test dataset. We will calculate the cumulative strategy return based on the
signal predicted by the model in the test dataset. We will also plot the
cumulative returns.
#Calculatethe strategyreturns
strategy_data['strategy_returns'] = \
strategy_data['predicted_signal'].shift(1) * \
strategy_data['pct_change']
#Drop themissingvalues
strategy_data.dropna(inplace=True)
strategy_data.head()
[]: pct_change predicted_signal !strategy_returns
,
The
maximum drawdown is -3.80%.
The
Sharpe ratio is 2.75.
But in a post about Naive Bayes, why are we talking about razors? Actually,
Naive Bayes implicitly incorporates this belief, because it really is a simple
model. Let’s see how a simple model like the Naive Bayes model can be used
in trading.
What is Naive Bayes? Let’s take a short detour and understand what the
“Bayes” in Naive Bayes means. There are basically two schools of thought
when it comes to probabilities. One school suggests that the probability of an
event can be deduced by calculating the probabilities of all the probable
events, and then calculating the probability of the event you are interested in.
For example, in the case of coin toss experiment, you know that probability
of heads is ½ because there are only two possibilities here, heads or tails.
Now, we will find the probability that the price rises the next day if the RSI is
below 40 by the same formula.
Here, B is similar to the feature we define in machine learning. It can also be
called as the evidence.
But hold on! What if we want to check when the RSI is below 40, as well as
the “slow k” of stochastic oscillator is more than its “slow d”?
P
(
A
|
B
,
C
)=
P(B|A)P(C|A)P(A) P(B)P(C)
While this looks simple enough to compute, if you add more features to the
model, the complexity increases. Here is where the Naive part of the Naive
Bayes model comes into the picture.
Assumption of Naive Bayes Model The Naive Bayes model assumes that
both B and C are independent events. Further the denominator is also
dropped. This simplifies the model to a great extent and we can simply write
the equation as:
P(A|B,C)= P(B|A)⇤ P(C|A)⇤ P(A)
You must remember that this assumption can be incorrect in real life.
Logically speaking, both the RSI and stochastic indicators are calculated
using the same variable, i.e. price data. Thus, they are not exactly
independent.
However, the beauty of Naive Bayes model is that even though this
assumption is not true, the model still performs well in various scenarios.
Wait, is there just one type of Naive Bayes model? Actually there are three.
Let’s find out in the next section.
Types of Naive Bayes models Depending on the requirement, you can pick
the model accordingly. These models are based on the input data you are
working on:
Multinomial: This model is used when we have discrete data and working on
its classification. A simple example is we can have the weather (cloudy,
sunny, raining) as our input and we want to see in which weather is a tennis
match played.
The great thing about Python is that the sklearn library incorporates all these
models. Shall we try to use it for building our own Naive Bayes model? Why
not try it out.
#Technicalindicators
import talib as ta
#Read thedata
data = pd.read_csv(path + 'JPM_2017_2019.csv',index_col=0) data.index =
pd.to_datetime(data.index)
How was the accuracy of our model? Let’s find out. []: from
data_modules.utility import get_metrics get_metrics(y_test, predicted)
precision recall f1-score support 00.51 0.64 0.56 1936 10.51 0.38 0.44 1928
#Calculatethe strategyreturns
strategy_data['strategy_returns'] = \
strategy_data['predicted_signal'].shift(1) * \
strategy_data['pct_change']
#Drop themissingvalues
strategy_data.dropna(inplace=True)
strategy_data.head() []: pct_change predicted_signal
,
!strategy_returns
2019-05-28 12:15:00+00:00 0.000732 0 0.
!000732
,
There is obviously room for improvement here, but this was just a
demonstration of how a Naive Bayes model works. But are there special
occasions when the model should be used? Let’s find out in the next section.
• The main advantage of the Naive Bayes model is its simplicity and fast
computation time. This is mainly due to its strong assumption that all events
are independent of each other.
• They can work on limited data as well.
• Their fast computation is leveraged in real time analysis when quick
responses are required.
Although this speed comes at a price. Let’s find out how in the next section.
Disadvantages of Naive Bayes Model
• Since Naive Bayes assumes that all events are independent of each other, it
cannot compute the relationship between the two events
• The Naive Bayes Model is fast but it comes at the cost of accuracy. Naive
Bayes is sometimes called bad estimator.
• The equation for Naive Bayes shows that we are multiplying the various
probabilities. Thus, if one feature returned 0 probability, it could turn the
whole result as 0. There are, however, various methods to overcome this
instance. One of the more famous ones is called Laplace correction. In this
method, the features are combined or the probability is set to 1 which ensures
that we do not get zero probability.
Conclusion The Naive Bayes model, despite the fact that it is naive, is pretty
simple and effective in a large number of use cases in real life. While it is
mostly used for text analysis, it has been used as a verification tool in the
field of trading.
The Naive Bayes model can also be used as a stepping stone towards more
precise and complex classification based machine learning models.
12 Decision Trees
Decision Trees are a Supervised Machine Learning method used in
Classification and Regression problems, also known as CART.
A regression problem tries to forecast a number such as the return for the
next day. Although the classification and regression problems have different
objectives, the trees have the same structure:
If we look at the first four steps, they are common operations for data
processing.
Steps 5 and 6 are related to the ML algorithms for the decision trees
specifically. As we will see, the implementation in Python will be quite
simple. However, it is fundamental to understand well the parameterisation
and the analysis of the results.
Getting the Data The raw material for any algorithm is data. In our case, it
would be the time series of financial instruments, such as indices, stocks, etc.
and it usually contains details like the opening price, maximum, minimum,
closing price and volume.
There are multiple data sources to download the data, free and premium. The
most common sources for free daily data are Quandl, Yahoo, Google, or any
other data source we trust.
For the sake of uniformity, we are going to use the same dataset which we
used in the previous part of the book. This helps us because we don’t need to
spend time on getting the data, but rather, we can focus on the machine
learning algorithm and its working.
#Plotting graphs
import matplotlib.pyplot as plt import seaborn as sns
%matplotlib inline
plt.style.use('seaborn-darkgrid')
split = int(0.8*len(X))
X_train, X_test, y_train, y_test = X[:split], X[split:], \ y[:split], y[split:]
DecisionTreeClassifier(class_weight =None,criterion='gini',max_depth=3,
max_features=None,max_leaf_nodes=None,
min_samples_leaf=5,min_samples_split=2,
min_weight_fraction_leaf=0.0,presort=False,
random_state=None,splitter='best')
Basically, refer to the parameters with which the algorithm must build the
tree because it follows a recursive approach to build the tree, so we must set
some limits to create it.
criterion: For the classification decision trees, we can choose Gini or Entropy
and Information Gain, these criteria refer to the loss function to evaluate the
performance of a machine learning algorithm, and are the most used when it
comes to the classification algorithms. Although it is beyond the scope of this
chapter, it serves us to adjust the accuracy of the model, and the algorithm to
build the tree. It also stops evaluating the branches in which no improvement
is obtained according to the loss function.
Now we need to make forecasts with the model on unknown data. For this,
we will use 30% of the data that we had kept reserved for testing, and finally,
evaluate the performance of the model. But first, let’s take a graphical look at
the classification decision tree that the ML algorithm has automatically
created for us.
out_file =None,
filled=True,
feature_names=X_train.columns)
Note that the graph only shows the most significant nodes. In this graph, we
can see all the relevant information in each node:
We can observe a pair of pure nodes that allows us to deduce possible trading
rules. For example, you can see how the nodes are split depending on the
feature values which the algorithm has given more preference to.
Forecast Now let’s make predictions with data sets that were reserved for
testing, this is the part that will let us know if the algorithm is reliable with
unknown data in training.
#Calculatethe strategyreturns
strategy_data['strategy_returns'] = \
strategy_data['predicted_signal'].shift(1) * \
strategy_data['pct_change']
#Drop themissingvalues
strategy_data.dropna(inplace=True)
strategy_data.head()
[]: pct_change predicted_signal !strategy_returns
,
2019-05-28 12:15:00+00:00 0.000732 0 0. !0
,
The
maximum drawdown is -3.80%.
The
Sharpe ratio is 2.46.
Decision Trees for Regression Now let’s create the regression decision tree
using the DecisionTreeRegressor function from the sklearn.tree library.
Although the DecisionTreeRegressor function has many parameters that we
invite you to know and experiment with (help(DecisionTreeRegressor)), here
we will see the basics to create the regression decision tree.
DecisionTreeRegressor(criterion
='mse',max_depth=None,min_impurity_decrease=0.0,
min_impurity_split=None min_samples_leaf=400,
min_samples_split=2,min_weight_fraction_leaf=0.0,
presort=False,random_state=None,splitter='best')
Basically, refer to the parameters with which the algorithm must build the
tree, because it follows a recursive approach to build the tree, we must set
some limits to create it.
Criterion: For the classification decision trees, we can choose Mean Absolute
Error (MAE) or Mean Square Error (MSE). These criteria are related with the
loss function to evaluate the performance of a learning machine algorithm
and are the most used for the regression algorithms. It basically serves us to
adjust the accuracy of the model, the algorithm to build the tree, and stops
evaluating the branches in which no improvement is obtained according to
the loss function. Here, we left the default parameter to Mean Square Error
(MSE).
max_depth: Maximum number of levels the tree will have. We have left it to
the default parameter of ‘none’.
min_samples_leaf: This parameter is optimisable and indicates the minimum
number of leaves that we want the tree to have.
Now we are going to train the model with the training datasets.
[]: y = X['pct_change'].shift(-1)
split = int(0.8*len(X))
X_train, X_test, y_train, y_test = X[:split], X[split:], \ y[:split], y[split:]
model = dtr.fit(X_train, y_train)
Visualise the Model To visualise the tree, we again use the graphviz library
that gives us an overview of the regression decision tree for analysis.
out_file =None,
filled=True,
feature_names=X_train.columns)
1. Bagging
2. Random Subspace
3. Random Forest
One of the solutions to overcome this issue is to use Random Forest. Let us
see how.
What is a Random Forest? Random forest is a supervised classification
machine learning algorithm which uses ensemble method. Simply put, a
random forest is made up of numerous decision trees and helps to tackle the
problem of overfitting in decision trees. These decision trees are randomly
constructed by selecting random features from the given dataset.
How to select features from the dataset to construct decision trees for the
Random Forest?
Once the features are selected, the trees are constructed based on the best
split. Each tree gives an output which is considered as a ‘vote’ from that tree
to the given output. The output which receives the maximum ‘votes’ is
chosen by the random forest as the final output/result or in case of continuous
variables, the average of all the outputs is considered as the final output.
For example, in the above diagram, we can observe that each decision tree
has voted or predicted a specific class. The final output or class selected by
the Random Forest will be the Class N, as it has majority votes or is the
predicted output by two out of the four decision trees.
#Technicalindicators
import talib as ta
#Plotting graphs
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn-darkgrid')
#Read thedata
data = pd.read_csv(path + 'JPM_2017_2019.csv',index_col=0) data.index =
pd.to_datetime(data.index)
Creating Input and Output Dataset In this step, we will use the same target
and feature variable as we have taken in the previous chapters.
Training the Machine Learning Model All set with the data! Let’s train a
decision tree classifier model. The RandomForestClassifier function from
tree is stored in variable clf, and then a fit method is called on it with X_train
and y_train dataset as the parameters so that the classifier model can learn the
relationship between input and output.
Strategy Returns
[]: #Calculatethe percentage change
strategy_data = X_test[['pct_change']].copy()
#Predict the signals
strategy_data['predicted_signal'] = model.predict(X_test)
#Calculatethe strategyreturns
strategy_data['strategy_returns'] = \
strategy_data['predicted_signal'].shift(1) * \
strategy_data['pct_change']
#Drop themissingvalues
strategy_data.dropna(inplace=True)
strategy_data.head()
[]: pct_change predicted_signal !strategy_returns
,
The
maximum drawdown is -7.42%.
The
Sharpe ratio is 1.63.
The output displays the strategy returns and daily returns according to the
code for the Random Forest Classifier. In the next chapter, we will look at the
XGBoost algorithm, which is the weapon of choice for most machine
learning enthusiasts and competition winners alike.
14 XGBoost Algorithm
XGBoost stands for eXtreme Gradient Boosting and is developed on the
framework of gradient boosting. We like the sound of that, Extreme! Sounds
more like a supercar than an ML model, actually.
But that is exactly what it does, boosts the performance of a regular gradient
boosting model.
“XGBoost used a more regularized model formalization to control
overfitting, which gives it better performance.”
-Tianqi Chen, the author of XGBoost
Let’s break down the name to understand what XGBoost does.
In the above image example, the train dataset is passed to classifier 1. The
yellow background indicates that the classifier predicted hyphen, and the blue
background indicates that it predicted plus. The classifier 1 model incorrectly
predicts two hyphens and one plus. These are highlighted with a circle. The
weights of these incorrectly predicted data points are increased and sent to the
next classifier. That is to classifier 2.
Classifier 2 correctly predicts the two hyphens, which classifier 1 was not
able to. But classifier 2 also makes some other errors. This process continues
and we have a combined final classifier which predicts all the data points
correctly.
The classifier models can be added until all the items in the training dataset is
predicted correctly or a maximum number of classifier models are added. The
optimal maximum number of classifier models to train can be determined
using hyperparameter tuning.
Let’s take baby steps here and understand where does XGBoost fit in the
bigger scheme of things.
In a nutshell With decision tree models, Bayesian and the like, we hit a
roadblock. The prediction rate for certain problem statements was dismal
when we used only one model. Apart from that, for decision trees, we
realised that we had to live with bias, variance, as well as noise in the models.
This led to the idea of combining models. This was and is called ‘ensemble
learning’. But here, we can use much more than one model to create an
ensemble. Gradient boosting was one such method of ensemble learning.
Gradient boosting is an approach where new models are created that predict
the residuals or errors of prior models, and then added together to make the
final prediction.
Where, L is the loss function which controls the predictive power, and W is
regularization component which controls simplicity and overfitting.
The loss function (L) which needs to be optimised can be Root Mean Squared
Error for regression, Logloss for binary classification, or mlogloss for multi-
class classification. The regularization component (W) is dependent on the
number of leaves and the prediction score assigned to the leaves in the tree
ensemble model.
While the actual logic is somewhat lengthy to explain, one of the main things
about XGBoost is that it has been able to parallelise the tree building
component of the boosting algorithm. This leads to a dramatic gain in terms
of processing time as we can use more cores of a CPU, or even go on and
utilise cloud computing as well. While machine learning algorithms have
support for tuning and can work with external programs, XGBoost has built-
in parameters for regularization and cross-validation to make sure both bias
and variance is kept at a minimal. The advantage of in-built parameters is that
it leads to faster implementation.
1. If you know that a certain feature is more important than others, you would
put more attention to it and try to see if you can improve your model further.
2. After you have run the model, you will see if dropping a few features
improves the model.
3. Initially, if the dataset is small, the time taken to run a model is not a
significant factor while we are designing a system. But if the strategy is
complex and requires a large dataset to run, then the computing resources and
the time taken to run the model becomes an important factor.
4. The good thing about XGBoost is that it contains an inbuilt function to
compute the feature importance and we don’t have to worry about coding it
in the model.
Import Libraries
[]: import warnings
warnings.simplefilter('ignore')
#Import XGBoost import xgboost
#XGBoost Classifier
from xgboost import XGBClassifier
#To plotthegraphs
import matplotlib.pyplot as plt import seaborn as sns
%matplotlib inline
plt.style.use('seaborn-darkgrid')
Define Parameters We have defined the list of stock, start date, and the end
date which we will be working with in this chapter.
[]: #Set thestocklist
stock_list = ['AAPL', 'AMZN', 'NFLX', 'WMT', 'MSFT']
Get the Data, Create Features and Target Variable We define a list of
features from which the model will pick the best ones. Here, we have the
percentage change and the standard deviation with different time periods as
the features.
The target variable is the next day’s return. If the next day’s return is positive
we label it as 1, and if it is negative then we label it as -1. You can also try to
create the target variables with three labels such as 1, 0, and -1 for long, no
position and short respectively.
Let’s see the code now.
#Create thefeatures
predictor_list = []
for r in range(10, 60, 5):
df[ 'pct_change_'+str(r)] = \
df.daily_pct_change.rolling(r).sum()
df['std_'+str(r)] = df.daily_pct_change.rolling(r).std()
predictor_list.append('pct_change_'+str(r))
predictor_list.append('std_'+str(r))
#Target Variable
df['return_next_day'] = df.daily_pct_change.shift(-1) df['actual_signal'] =
np.where(df.return_next_day > 0, 1, -1)
df = df.dropna()
#Add thedatato dictionary
stock_data_dictionary.update({stock_name: df})
[*********************100%***********************]
[*********************100%***********************]
[*********************100%***********************]
[*********************100%***********************]
[*********************100%***********************] 1 of 1
completed 1 of 1 completed 1 of 1 completed 1 of 1 completed 1 of 1
completed
#Get thetargetvariable
y = stock_data_dictionary[stock_name].actual_signal
,
!interaction_constraints=None,
learning_rate=None, max_delta_step=None, max_depth=2,
min_child_weight=None, missing=nan,
,
!monotone_constraints=None,
n_estimators=30, n_jobs=None, num_parallel_tree=None,
random_state=None, reg_alpha=None, reg_lambda=None,
scale_pos_weight=None, subsample=None, tree_method=None,
validate_parameters=None, verbosity=None)
Create the Model We will train the XGBoost classifier using the fit method.
[]: #Fit themodel
model.fit(X_train, y_train) [05:44:56] WARNING:
C:/Users/Administrator/workspace/xgboostwin64_release_1.4.0/src/learner.cc:1095:
Starting in XGBoost 1.3.0, the
,
!default
evaluation metric used with the objective 'binary:logistic' was changed
,
!from
'error' to 'logloss'.Explicitlyset eval_metric ifyou'dlike to
,
!restore the
old behavior.
,
!gpu_id=-1,
importance_type='gain',interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=2,
min_child_weight=1, missing=nan,
,
!monotone_constraints='()',
n_estimators=30, n_jobs=8, num_parallel_tree=1,
!random_state=0,
,
tree_method='exact',validate_parameters=1,
!verbosity=None)
,
Since we are using most of the default parameters, you might get certain
warnings depending on updates in the XGBoost library.
Feature Importance We have plotted the top 7 features and sorted them
based on their importance.
Hold on! We are almost there. Let’s see what XGBoost tells us right now.
That’s interesting. The f1-score for the long side is much more powerful
compared to the short side. We can modify the model and make it a long-only
strategy. Let’s try another way to formulate how well XGBoost performed.
Confusion Matrix
[]: confusion_matrix_data = confusion_matrix(y_test, y_pred)
#Plot thedata
fig, ax = plt.subplots(figsize=(6, 4))
sns.heatmap(confusion_matrix_data, fmt="d",
cmap='Blues',cbar=False,annot=True,ax=ax)
#Set theaxeslabels and thetitle
ax.set_xlabel('Predicted Labels',fontsize=12) ax.set_ylabel('Actual
Labels',fontsize=12)
ax.set_title('Confusion Matrix',fontsize=14)
ax.xaxis.set_ticklabels(['No Position', 'Long Position'])
ax.yaxis.set_ticklabels(['No Position', 'Long Position']) #Display the plot
plt.show()
Individual Stock Performance Let’s see how the XGBoost based strategy
returns held up against the normal daily returns, i.e. the buy and hold
strategy. We will plot a comparison graph between the strategy returns and
the daily returns for all the companies we had mentioned before. The code is
as follows:
#Get thedata
df = stock_data_dictionary[stock_name] #Store thefeaturesinX
X = df[predictor_list]
AAPL
AMZN
NFLX
WMT
MSFT
Performance of Portfolio
[]: #Drop themissingvalues
portfolio.dropna(inplace=True)
Now let us see another interesting concept, called the neural networks in the
next chapter.
15 Neural Networks
Neural network studies were started in an effort to map the human brain and
understand how humans take decisions. But algorithmic trading tries to
remove human emotions altogether from the trading aspect. Why should we
learn about neural networks then?
We sometimes fail to realise that the human brain is quite possibly the most
complex machine in this world and has been known to be quite effective at
coming to conclusions in record time.
Think about it. If we could harness the way our brain works and apply it in
the machine learning domain (Neural networks are after all a subset of
machine learning), we could possibly take a giant leap in terms of processing
power and computing resources.
Before we dive deep into the nitty-gritty of neural network trading, we should
understand the working of the principal component, i.e. the neuron.
There are three components to a neuron: The dendrites, axon and the main
body of the neuron. The dendrites are the receivers of the signal and the axon
is the transmitter. Alone, a neuron is not of much use, but when it is
connected to other neurons, it does several complicated computations and
helps operate the most complicated machine on our planet, the human body.
There are inputs to the neuron marked with blue lines, and the neuron emits
an output signal after some computation.
The input layer resembles the dendrites of the neuron and the output signal is
the axon.
Each input signal is assigned a weight, wi. This weight is multiplied by the
input value and the neuron stores the weighted sum of all the input variables.
These weights are computed in the training phase of the neural network
learning through concepts called gradient descent and backpropagation, we
will cover these topics later on.
The input layer consists of the parameters that will help us arrive at an output
value or make a prediction. Our brains essentially have five basic input
parameters, which are our senses to touch, hear, see, smell and taste.
Therefore, there are two layers of computations in this case before making a
decision.
The first layer takes in the five senses as inputs and results in emotions and
feelings, which are the inputs to the next layer of computations, where the
output is a decision or an action.
Hence, in this extremely simplistic model of the working of the human brain,
we have one input layer, two hidden layers, and one output layer. Of course
from our experiences, we all know that the brain is much more complicated
than this, but essentially this is how the computations are done in our brain.
The 3 neurons in the hidden layer will have different weights for each of the
five input parameters and might have different activation functions, which
will activate the input parameters according to various combinations of the
inputs.
For example, the first neuron might be looking at the volume and the
difference between the Close and the Open price, and might be ignoring the
High and Low prices as well. In this case, the weights for High and Low
prices will be zero.
Based on the weights that the model has trained itself to attain, an activation
function will be applied to the weighted sum in the neuron, this will result in
an output value for that particular neuron.
Similarly, the other two neurons will result in an output value based on their
individual activation functions and weights. Finally, the output value or the
predicted value of the stock price will be the sum of the three output values of
each neuron. This is how the neural network will work to predict stock prices.
Now that you understand the working of a neural network, we will move to
the heart of the matter of this chapter, and that is learning how the Artificial
Neural Network will train itself to predict the movement of a stock price.
Training the Neural Network To simplify things in neural networks, we can
say that there are two ways to code a program for performing a specific task.
1. Define all the rules required by the program to compute the result given
some input to the program.
2. Develop the framework upon which the code will learn to perform the
specific task. This task is carried out by training itself on a dataset, and
adjusting the result it computes to be as close to the actual results which have
been observed.
The second process is called training the model which is what we will be
focussing on. Let’s look at how our neural network will train itself to predict
stock prices.
The neural network will be given the dataset, which consists of the OHLCV
data as the input, as well as the output, which is the close price of the next
day. This output variable is the value that we want our model to learn to
predict. The actual value of the output will be represented by ‘y’ and the
predicted value will be represented by ˆy.
The training of the model involves adjusting the weights of the variables for
all the different neurons present in the neural network. This is done by
minimizing the ‘Cost Function’. The cost function, as the name suggests is
the cost of making a prediction using the neural network. It is a measure of
how far off the predicted value, ˆy, is from the actual or observed value, y.
There are many cost functions that are used in practice, the most popular one
is computed as half of the sum of squared differences between the actual and
predicted values for the training dataset.
C =Â1/2(ˆy y)2
The way the neural network trains itself is by first computing the cost
function for the training dataset for a given set of weights for the neurons.
Then it goes back and adjusts the weights, followed by computing the cost
function for the training dataset based on the new weights.
The process of sending the errors back to the network for adjusting the
weights is called backpropagation.
This is repeated several times till the cost function has been minimised. We
will look at how the weights are adjusted and the cost function is minimised
in more detail next. The weights are adjusted to minimise the cost function.
One way to do this is through brute force.
Suppose we take 1000 values for the weights, and evaluate the cost function
for these values.
When we plot the graph of the cost function, we will arrive at a graph as
shown below.
The best value for weights would be the cost function corresponding to the
minima of this graph.
The time it will require to train such a model will be extremely large even on
the world’s fastest supercomputer. For this reason, it is essential to develop a
better, faster methodology for computing the weights of the neural network.
This process is called Gradient Descent. We will look into this concept in the
next part.
Gradient Descent Gradient descent involves analysing the slope of the curve
of the cost function. Based on the slope we adjust the weights, to minimise
the cost function in steps rather than computing the values for all possible
combinations.
2. Stochastic Gradient Descent: The slope of the cost function and the
adjustments of weights are done after each data entry in the training dataset.
This is extremely useful to avoid getting stuck at a local minima if the curve
of the cost function is not strictly convex. Each time you run the stochastic
gradient descent, the process to arrive at the global minima will be different.
Batch gradient descent may result in getting stuck with a suboptimal result if
it stops at local minima.
While we can dive deep into Gradient descent, we are afraid it will be outside
the scope of this chapter. Hence, let us move forward and understand how
backpropagation works to adjust the weights according to the error which had
been generated.
Backpropagation Backpropagation is an advanced algorithm which enables
us to update all the weights in the neural network simultaneously.
This drastically reduces the complexity of the process to adjust weights. If we
were not using this algorithm, we would have to adjust each weight
individually by figuring out what impact that particular weight has on the
error in the prediction. Let us look at the steps involved in training the neural
network with Stochastic Gradient Descent:
1. Initialise the weights to small numbers very close to 0 (but not 0).
2. Forward propagation: The neurons are activated from left to right, by using
the first data entry in our training dataset, until we arrive at the predicted
result y.
3. Measure the error which will be generated.
4. Backpropagation: The error generated will be backpropagated from right to
left, and the weights will be adjusted according to the learning rate.
5. Repeat the previous three steps, forward propagation, error computation,
and backpropagation on the entire training dataset.
This would mark the end of the first epoch, the successive epochs will begin
with the weight values of the previous epochs, we can stop this process when
the cost function converges within a certain acceptable limit.
We will now learn how to develop our own Artificial Neural Network to
predict the movement of a stock price.
You will understand how to code a strategy using the predictions from a
neural network that we will build from scratch.
#Plotting graphs
import matplotlib.pyplot as plt import seaborn as sns
%matplotlib inline
plt.style.use('seaborn-darkgrid')
Random will be used to initialise the seed to a fixed number so that every
time we run the code we start with the same seed.
Import dataset
[]: #The dataisstored inthedirectory 'data_modules' path = "../data_modules/"
#Read thedata
data = pd.read_csv(path + 'JPM_2017_2019.csv',index_col=0) data.index =
pd.to_datetime(data.index)
Define Target,
Features and Split the Data
[]: import sys
sys.path.append("..")
from data_modules.utility import get_target_features
y, X = get_target_features(data)
split = int(0.8*len(X))
X_train, X_test, y_train, y_test = X[:split], X[split:], \ y[:split], y[split:]
Feature Scaling
[]: from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_test_original = X_test.copy()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Another important step in data preprocessing is to standardise the dataset.
This process makes the mean of all the input features equal to zero and also
converts their variance to 1. This ensures that there is no bias while training
the model due to the different scales of all input features. If this is not done,
the neural network might get confused and give a higher weight to those
features which have a higher average value than others.
Now we will import the functions which will be used to build the Artificial
Neural Network. We import the Sequential method from the keras.models
library. This will be used to sequentially build the layers of the neural
networks learning. The next method that we import will be the Dense
function from the keras.layers library.
This method will be used to build the layers of our Artificial Neural Network.
[]: classifier = Sequential()
[]: classifier.add(Dense(
units = 128,
kernel_initializer = 'uniform',
activation = 'relu',
input_dim = X.shape[1]
))
To add layers into our Classifier, we make use of the add() function. The
argument of the add function is the Dense() function, which in turn has the
following arguments: Units: This defines the number of nodes or neurons in
that particular layer. We have set this value to 128, meaning there will be 128
neurons in our hidden layer.
Kernel_initializer: This defines the starting values for the weights of the
different neurons in the hidden layer. We have defined this to be ‘uniform’,
which means that the weights will be initialised with values from a uniform
distribution.
Activation: This is the activation function for the neurons in the particular
hidden layer. Here we define the function as the rectified Linear Unit
function or ‘relu’. Input_dim: This defines the number of inputs to the
hidden layer, we have defined this value to be equal to the number of
columns of our input feature dataframe. This argument will not be required in
the subsequent layers, as the model will know how many outputs the previous
layer produced.
[]: classifier.add(Dense(
units = 128,
kernel_initializer = 'uniform',
activation = 'relu'
))
We then add a second layer, with 128 neurons, with a uniform kernel
initialiser and ‘relu’ as its activation function. We are only building two
hidden layers in this neural network.
[]: classifier.add(Dense(
units = 1,
kernel_initializer = 'uniform',
activation = 'sigmoid'
))
The next layer that we build will be the output layer, from which we require a
single output. Therefore, the units passed are 1, and the activation function is
chosen to be the Sigmoid function because we would want the prediction to
be a probability of market moving upwards.
[]: classifier.compile(
optimizer = 'adam',
loss = 'mean_squared_error', metrics = ['accuracy']
)
Epoch 1/7
773/773 [==============================] - 1s 1ms/step - loss:
0.2487 accuracy: 0.5323
Epoch 2/7
773/773 [==============================] - 1s 1ms/step - loss:
0.2483 accuracy: 0.5354
Epoch 3/7
773/773 [==============================] - 1s 1ms/step - loss:
0.2482 accuracy: 0.5319
Epoch 4/7
773/773 [==============================] - 1s 1ms/step - loss:
0.2479 accuracy: 0.5374
Epoch 5/7
773/773 [==============================] - 1s 1ms/step - loss:
0.2477 accuracy: 0.5422
Epoch 6/7
773/773 [==============================] - 1s 1ms/step - loss:
0.2474 accuracy: 0.5412
Epoch 7/7
773/773 [==============================] - 1s 1ms/step - loss:
0.2473 accuracy: 0.5416
#Calculatethe strategyreturns
strategy_data['strategy_returns'] = \
strategy_data['predicted_signal'].shift(1) * \
strategy_data['pct_change']
#Drop themissingvalues
strategy_data.dropna(inplace=True) strategy_data.head() []: pct_change
predicted_signal !strategy_returns
,
This is interesting! While the strategy returns were less than the benchmark
returns, you should note that the drawdown percentage is relatively lower.
Nevertheless, the neural network strategy can be improved by letting the
algorithm process more data to improve its performance scores.
Results -
Part IV
Unsupervised Learning, Natural Language
Processing, and Reinforcement Learning 16
Unsupervised Learning
In the previous chapters, we examined supervised learning algorithms like
classification and regression in detail. The common element of these
algorithms is that you know the labels and target variable in supervised
learning models.
But if you think about it, humans have another approach towards learning
too. A baby starts with zero knowledge of the surroundings and is introduced
to his parents who are humans. Even though the baby might not be able to
speak, he knows how to differentiate between things. When he sees a pet dog,
he will first think if it is some other being, which is different from his parents.
When he sees another pet dog, he sees that this pet dog has more things in
common, such as a tail, with the first dog than a human. In this way, while he
does not know that he is seeing a ‘dog’, he has learnt to put dogs in a
different box than humans.
He repeats the same process when he sees a cat. He knows it one is different
from both dogs and humans. This process, when replicated in machine
learning is called clustering, which is a part of unsupervised learning.
Suppose, you want to analyse the stocks that constitute the S&P 500 index
and find two similar stocks. This will help you apply a pairs trading strategy
on the similar stocks identified. But analysing the 500 stocks fundamental
and price data, and finding similarities is a very tedious and time-consuming
task.
You can use an unsupervised learning algorithm here. Just pass the relevant
data for 500 stocks to the algorithm. In return, it will cluster similar stocks
together. For example, it will create a cluster of Google and Facebook, and
another cluster of Citigroup and Bank of America, and so on. You can then
pick a pair of Citigroup and Bank of America, and create a statistical
arbitrage trading strategy on it.
This can work not only on 500 stocks, but also on stocks from different
geographies or even from different asset classes such as FX, commodities or
bonds. The unsupervised algorithm will group them together in a matter of
seconds.
This is useful not only for pairs trading but you can use these groups to create
diversified portfolios. Remember, the securities which are similar to each
other are placed together in a single cluster. And security in one cluster is
dissimilar to security in another cluster.
You can pick top-performing security from each cluster and create a
diversified portfolio. For example, you can pick Bank of America, Facebook,
Gold ETF, EURUSD, and US Treasury and create a diversified portfolio.
Let’s take a small example here to drive home the point.
Name of the Company Sector
Facebook Tech Bank of America Bank Amazon Tech Google Tech Microsoft
Tech Ford Auto J.P. Morgan Bank Wells Fargo Bank Intel Tech Tesla Auto
You want to divide these companies into two groups. One of the many
possible ways to do this can be as shown below:
• Group 1: Facebook, Amazon, Google, Microsoft, Intel, Apple
• Group 2: Bank of America, J.P. Morgan, Wells Fargo, Ford, Tesla Can you
think of how this grouping was done?
Group 1 comprises all the tech companies in the list. And group 2 comprises
all the non-tech companies.
Similarly, if you had to divide these companies into three groups, it can be
done as shown:
Here, group 1 comprises the tech companies, group 2 is banks, and group 3 is
automobiles. From these groupings, you can conclude that the stocks within a
group are similar to each other. And they differ across the different groups.
Each group is called a cluster.
The next question that arises is what if instead of ten, you have a list of
thousand companies? What if, not only the sector, but you have many other
features in the dataset?
You can read the data row-by-row and create the groups. But as the size of
the data increases, creating the groups with human intervention will become
impossible. You can utilise the clustering algorithm to do this task. Clustering
is a technique to divide the data into similar groups. You simply need to pass
meaningful data for all the companies, and the clustering algorithm will
create groups for you.
In the given dataset, the feature is the sector. Hence, the clustering algorithm
creates the groups based on the sectors. You can also pass the historical data,
fundamental data, etc. for a more detailed grouping. You don’t have to
specify any rules to create the groups, the groups are created based on
similarity. The similarity between the points within a group is not known,
you have to analyse each group and add a meaningful label to them.
For example, we analysed the cluster above and found that the clusters were
made based on the sectors. Also note, the only feature passed to the clustering
algorithm was the sector. That means that the output generated by the
algorithm is based on the input provided to the algorithm.
One thing must have come to your mind here. In the previous chapters, we
had looked at supervised learning which included classification. Is it similar
to clustering? In a way, yes, but there are certain differences here.
You will get the historical data. In order to train a classification model, you
have to label the historical data. For days, when the next day returns are
positive, you label it as buy. And days where the next day returns are
negative, you label it as no position. You also use two features as input to the
machine learning model. The image shows a scatter plot of these two
features. The blue dots correspond to a buy signal and green to no position.
The classification model will create a boundary to separate blue dots from
green dots. It will then use this boundary to predict the Buy or Don'tbuy
position on the unseen data or future data points. If a data point falls in the
left region, it will be classified as a Buy signal. And if it falls in the right
region, it will be classified as Don'tBuy, which means no position is taken.
Here, you can see that few points are falling in the incorrect region. This
means that these points are misclassified.
Also, it is very difficult to create the boundary if the classes are not well
separated. Let us now see how we can approach the same problem with a
clustering algorithm.
In clustering, you will simply pass the features to the algorithm without any
expected outcome such as buy or no position.
That is, the clustering algorithm has access only to the features and does not
know the groups in which it has to classify the data. Since there are no
separate labels for buy, the algorithm will not create any boundary to separate
buy and don’t buy. But it will create groups of similar data points.
You can analyse the groups and identify in which group you will buy the
asset. Based on your analysis, you can buy Apple stock if the data points fall
in cluster 1.
As you can see, all the data points falling in cluster 1 are blue in colour
indicating that the next day returns are positive.
A limitation of clustering is that you don’t have any control over the clusters
created by the algorithm. Some of the clusters created might not be useful to
us, whereas some might be. For example, cluster 2 and cluster 3 have a
combination of buy and no position. Thus, making it difficult to add a
meaningful label to it.
That was all about the key differences between classification and clustering.
Let us summarise them.
Classification Clustering
17 K-Means Clustering
K-means clustering is used when we have unlabeled data, i.e. data without
defined categories or groups. This algorithm finds groups/clusters in the data,
with the number of groups represented by the variable ‘k’(hence the name).
The great thing about k-means is that we simply give the data to the
algorithm, and the number of clusters that need to be formed. And then tell it
to go play! That’s all the algorithm gets.
1. The algorithm starts with randomly choosing ‘K’ data points as ‘centroids’,
where each centroid defines a cluster.
2. Each data point is assigned to a cluster defined by a centroid such that the
distance between that data point and the cluster’s centroid is minimum.
3. The centroids are recalculated by taking the mean of all data points that
were assigned to that cluster in the previous step. The algorithm iterates
between steps (ii) and (iii) until a stopping criterion is met, such as a pre-
defined maximum number of iterations are reached, or datapoints stop
changing the clusters.
Let us see this in action now. Often, traders and investors want to group
stocks based on similarities in certain features.
For example, a trader who wishes to trade a pair-trading strategy, where she
simultaneously takes a long and shorts position in two similar stocks, would
ideally want to scan through all the stocks and find those which are similar to
each other in terms of industry, sector, market-capitalisation, volatility, or any
other features.
Going through each and every stock manually and then forming groups is a
tedious and time-consuming process. Instead, one could use a clustering
algorithm such as the k-means clustering algorithm to group/cluster stocks
based on a given set of features.
We start with importing the necessary libraries and fetching the required data
using the following commands.
Import Dataset We will use the dataset which contains the data for 12
companies and is stored in the sample_stocks.csv file. You can download this
file from the github link provided in “About The Book” section of this book.
#Read thedata
DF = pd.read_csv(path + 'sample_stocks.csv',index_col=0) DF
As seen above, we have downloaded the data for the 12 stocks successfully.
We will now create a copy (df) of the original data and work with it. The first
step is to pre-process the data so that it can be fed to a k-means clustering
algorithm. This involves converting the data in a NumPy array format and
scaling it.
Scaling amounts to subtracting the column mean and dividing by the column
standard deviation from each data point in that column.
For scaling, we use the StandardScaler class of scikit-learn library as follows:
[]: #Making acopyto workwith df = DF.copy()
#Scaling the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_values = scaler.fit_transform(df.values)
#Printing pre-processeddata print(df_values)
[[ 1.29583307 0.59517359]
[-1.00653849 -1.27029587]
[0.51349787 0.40862664]
[-1.21731025 -0.81725329]
[0.47010369 1.4746092]
[-0.39654021 0.70177185]
[0.66723728-0.39086027]
[-0.92842895 -1.08374893]
[2.02733507 0.11548144]
[0.36595764 1.90100222]
[-0.77592938 -0.47080896]
[-1.01521733 -1.16369762]]
The next step is to import the ‘KMeans’ class from scikit-learn and fit a
model with the value of hyperparameter ‘K’ (which is called n_clusters in
scikit-learn) set to 2(randomly chosen) to which we fit our pre-processed data
‘df_values’.
Now that we have the assigned clusters, we will visualise them using the
matplotlib and seaborn libraries as follows:
[]: #Set graphsize
plt.figure(figsize=(12, 8))
#Plot thegraph
plt.xlabel('Beta',size=17)
plt.ylabel('ROE(%)',size=17)
plt.setp(ax.get_legend().get_texts(), fontsize='17')
plt.setp(ax.get_legend().get_title(), fontsize='17') plt.title('CLUSTERS from
k-means algorithm with k = 4',
fontsize='x-large')
#Label individual elements
for i in range(0,df.shape[0]):
plt.text(df.Beta[i]+0.07,df['ROE(%)'][i]+0.01,df.index[i],
horizontalalignment ='right',
verticalalignment='bottom',size='small', color='black',weight='semibold')
We can clearly see the difference between the two clusters in the graph
above.
Can you make an analysis on this cluster? Cluster 1 largely consists of all
the public utility companies which have a low ROE and low beta compared
to high growth tech companies in Cluster 0.
Although we did not tell the k-means algorithm about the industry sectors to
which the stocks belonged, it was able to discover that structure in the data
itself. Therein lies the power and appeal of unsupervised learning.
The next question that arises is how to decide the value of hyperparameter k
before fitting the model?
Let’s say you took k = 8. Now, k-means algorithm will cumpulsorily try to
create 8 clusters. But you know that your data set contains 12 companies,
which means some clusters will have only one company.
One of the ways of doing this is to check the model’s ‘inertia’, which
represents the distance of points in a cluster from its centroid. As more and
more clusters are added, the inertia keeps on decreasing, creating what is
called an ‘elbow curve’. We select the value of k beyond which we do not see
much benefit (i.e., decrement) in the value of inertia. Think about it, if there
were 8 clusters, then in the cluster with one company, the inertia would be 0
and there would be no point in having that cluster. Below we plot the inertia
values for k-mean models with different values of ‘k’:
model = KMeans(n_clusters=k)
model.fit(df)
inertia.append(model.inertia_)
As we can see that the inertia value shows marginal decrement after k= 3, a
k-means model with k=3(three clusters) is the most suitable for this task. You
can try running the same algorithm but change the parameter to 3 and see
how it changes.
In the next chapter, we will look at another clustering algorithm, i.e. the
hierarchical clustering algorithm.
18 Hierarchical Clustering
We looked at the K-means clustering in the previous chapter, but did you
realise one limitation of K-means?
Step 2: Identify two clusters that are similar and make them into one cluster.
Step 3: Repeat the process until only single cluster remains.
If the (x1,y1) and (x2,y2) are points in 2-dimensional space, then the
Euclidean distance between is:
(x2 x1)2 (y2 y1)2
Other than Euclidean distance, several other metrics have been developed to
measure distance such as Hamming distance, Manhattan distance (Taxicab or
City Block), Minkowski distance.
The choice of distance metrics should be based on the field of study, or the
problem that you are trying to solve. For example, if you are trying to
measure distance between objects on a uniform grid, like a chessboard or city
blocks. Then Manhattan distance would be an apt choice.
The observations E and F are closest to each other by any other points. So,
they are combined into one cluster, and also the height of the link that joins
them together is the smallest. The next observations that are closest to each
other are A and B which are combined together.
This can also be observed in the dendrogram as the height of the block
between A and B is slightly bigger than E and F. Similarly, D can be merged
into E and F clusters and then C can be combined to that. Finally A and B
combined to C, D, E, and F to form a single cluster.
Important points to note while reading dendrogram is that:
1. Height of the blocks represents the distance between clusters.
2. Distance between observations represents dissimilarities.
But the question still remains the same. How do we find the number of
clusters using a dendrogram or where should we stop merging the clusters?
If you think about it, you can see that while A and B can be in one cluster,
they are far from the other cluster made of C, D, E and F. This is seen in the
dendrogram as the height of the line joining these two clusters is the longest.
Thus, you could divide these points into two clusters.
At each iteration, we separate the farthest points or clusters which are not
similar until each data point is considered as an individual cluster. Here we
are dividing the single clusters into n clusters, therefore the name divisive
clustering.
#Technicalindicators
import talib as ta
#Plotting graphs
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-darkgrid')
#Machine learning
#Scaling the data
from sklearn.preprocessing import StandardScaler
#Read thedata
DF = pd.read_csv(path + 'sample_stocks.csv',index_col=0) DF
We will scale the data as well. []: #Making acopyto workwith df = DF.copy()
scaler = StandardScaler()
df_values = scaler.fit_transform(df.values)
#Printing pre-processeddata print(df_values)
#Create adendrogram
sc.dendrogram(sc.linkage(df, method='ward'),labels=df.index)
plt .title('Dendrogram')
plt.xlabel('Sample index') plt.ylabel('Euclidean distance') plt.show()
You can easily see that you can form two clusters here.
A new axis is added to capture the ADX values. The graph becomes a 2D
graph with RSI as the x-axis and ADX as the y-axis.
You can see that due to the addition of ADX, the data points moved away.
These data points are difficult to cluster in two groups. Instead, they are
clustered into four different groups.
Why did the number of clusters increase? At first, we had calculated only
one feature value, which was the RSI. Then, we added more information in
the form of the ADX values. Thus, the points which were similar and
clustered together became distant due to the addition of new features.
Similarly, when we add more features, the information increases and the
points may move farther away.
But what is the problem in adding more information and creating more
clusters?
In the RSI and ADX example, we ended up creating 4 clusters. Each cluster
had only 1 data point. In the essence, we ended up with what we started. We
started with 4 points and ended with 4 clusters. There was no grouping of
these 4 points. To generalise, too many dimensions or features causes each
point to appear equidistant from other points. If the distances are all
approximately equal, then all the observations appear equally alike (as well as
equally different), and no meaningful clusters can be formed.
9 August 2021 40 20
10 August 2021 45 25
11 August 2021 60 45
12 August 2021 65 50
13 August 2021 40 25
14 August 2021 45 20
15 August 2021 60 50
16 August 2021 65 45
17 August 2021 44 30
18 August 2021 65 55
If you draw a scatter plot for these values, it looks as shown below:
The easiest way to reduce the dimensions is to remove one of them. For
example, either remove ADX or remove RSI values. But that would lead to
loss of information.
If only RSI was kept, the data point (40, 20) and (40, 25) will come together.
Earlier, the two points were at a distance of 5 units from each other. But now
it seems like they are the same point. Thus, the information that these two
points are different and not the same is lost.
Is there a better way to reduce the dimension while keeping the loss in
information at a minimum? Yes. Draw a line that is in between the two axes.
And then bring all the points on this line.
You can see that even though the points are closer on the new axis, they can
still be seen distinctly. You simply rotate or straighten this line to convert the
data points into a single dimension. And you managed to preserve some
information of both dimensions. This is essentially what the principal
component analysis does, i.e. dimensionality reduction.
Of course, in the real world, there would be some information loss, and thus
we use the principal component analysis to make sure that it is kept to a
minimum. That sounds really interesting, but how do we do this exactly?
The answer is EigenValues. Well, that’s not the whole answer but in a way it
is. Before we get into the mathematics of it, let us refresh our concepts about
the matrix (no, not the movie!).
Av = lv
The entire set of such vectors which do not change direction when multiplied
with a matrix are called Eigen Vectors. The corresponding change in
magnitude is called the Eigen Value. It is represented with l here.
Remember how we said that the ultimate aim of the principal component
analysis is to reduce the number of variables. Well, Eigenvectors and
Eigenvalues help in this regard. One thing to follow is that Eigenvectors are
for a square matrix i.e. a matrix with an equal number of rows and columns.
But they can also be calculated for rectangular matrices.
While it may sound daunting, but there is a simple method to find the
Eigenvectors and Eigenvalues.
Av lIv = 0
Now, we will assume that the Eigenvector matrix is non-zero. Thus, we can
find l by finding the determinant.
Thus, |A lI| = 0
(2 l)(10 l) (8⇤ 6)= 0
20 2l 10l +l2 48 = 0
28 12l +l2 = 0
l2 12l 28 = 0
(l 14)(l + 2)= 0
l = 14, 2
Hence, we can say that the Eigenvalues are 14, -2.
Taking the l value as -2. We will now find the Eigenvector matrix.
Recall that, Av = lv.
A= 26 ð8 10
v =x
ðy
Therefore,
3
ð2
Similarly, for l = 14, the Eigenvectors are x =1, y = 2.
That’s awesome! We have understood how Eigenvectors and Eigenvalues are
formed. Now that we know how to find the Eigenvalues and Eigenvectors, let
us talk about their properties.
Let’s say we have a graph with certain points mapped on the x and y axis.
Now you can locate them by giving the (x, y) coordinates of a point, but we
realise that there can be another efficient way to display them meaningfully.
What if we draw a line in such a way that we can locate them using just the
distance between a point on this new line and the points. Let’s try that now.
Let us say that the green lines represent the distance from each point to the
new line. Furthermore, the distance from the first point on the line projected,
i.e. A, to the last point’s projection on the new line, i.e. B, is referred to as the
spread here. In the above figure, it is denoted as the red line. The importance
of spread is that it gives us information on how varied is the data set. The
larger the value, the easier it is for us to differentiate between two individual
data points.
We should note that if we had variables in two dimensions, then we will get
two sets of Eigenvalue and Eigenvectors. The other line would actually be
perpendicular to the first line.
Now, we have a new coordinate axis which is much more suited to the points,
right? Since we are used to seeing the axes in a particular manner, we will
rotate them and keep it as shown in the below diagram.
When we said the Eigenvalues gives us the value by which the points are
stretched, we should add further that the Eigenvalues with the corresponding
larger number gives us more information than the one with the lesser valued
Eigenvalue.
Oh, wait! This chapter talks about Principal Component Analysis. Well, the
Eigenvalues and Eigenvectors are actually the components of the data.
But there is one missing part of the puzzle left. As we have seen the
illustration, the data points were bunched together, and hence the distance
between the points was not that significant.
But what if we were trying to compare stock prices, one whose value was less
than $10 and another which was more than \$500? Obviously, the variation in
the price itself would lead to a different result. Hence, we try to find a way to
standardise the data. And since we are trying to find the difference, we can
use the covariance matrix.
N
For example, there are three points, 155, 146, 152.
The average of these three points is 151. Using the formula, the variance is:
Variance
(
s
2 2 2
2)= (155 151) +(146 151) +(152 151) 3 = 14
Why did we square the difference? The first reason is that if we had simply
taken the difference, it would lead to the numerator becoming 0. Also,
squaring the difference helps us give more weightage to the points which are
farther from the mean. Thus, the point 146 is 5 points away from 151 and that
leads to the value of 25. Whereas, 152 is merely one point away and thus, its
square is only 1.
Another reason is that squaring helps us give same weightage to points which
are above or below the mean. A data point that is 2 units away from the mean
should have the same weightage irrespective of whether it is above or below
the mean price. Squaring the difference helps us achieve that.
19.1.2 Covariance
Covariance is calculated using the below formula:
Âi(xi Xmean)(yi Ymean)Cov(X,Y)= N
Cov (X, Y) = Covariance between X and Y
Xi = Various observations of X
Xmean = Average value of X
Yi = Various observations of Y
Ymean = Average value of Y
N = Number of observations
Let’s say we find the Eigenvalues and Eigenvectors of the covariance matrix,
we will say that the largest Eigenvalue has the most information about the
covariance matrix. Thus, we can keep majority of the Eigenvalues and their
corresponding Eigenvectors and still be able to contain most of the
information which was present in the covariance matrix.
That’s all there is to it. Now to take the earlier example, if we had chosen the
Eigenvalue 14, we would lose some information but it simplifies the
computation by a great deal and in that sense, we have reduced the dimension
of the variables.
When to use Principal Component Analysis? Suppose you are dealing with
a large dataset which is computationally heavy, then you can use Principal
component Analysis. In fact, it could help us a great deal in the strategy for
pairs trading. Let’s try that, shall we?
Since we are looking for global implementation, we will use the stocks listed
on the NYSE.
Since we will be working on the data for a lot of companies, we have created
a csv file which contains tickers for 20 companies listed on the NYSE. For
the time being, we will import their Adjusted Close Price. The code is as
follows:
#Import thedataset
df = pd.read_csv("pca.csv",index_col=0) df.head()
Date
2018-01-02 1065.000000 1073.209961 1189.010010 148.906738 282.
,
!886383
2018-01-03 1082.479980 1091.520020 1204.199951 150.778992 283.
!801239
,
Date
2018-01-02 39.286922 200.766891 131.867477 375.250000 49.867218
2018-01-03 39.444405 204.136826 133.980453 383.820007 50.786686
2018-01-04 39.537041 205.207260 135.410294 376.920013 51.372623
2018-01-05 40.046532 208.686172 136.293716 379.010010 51.273472
2018-01-08 39.842739 206.376816 138.188080 391.859985 51.796291
Since we will be working on daily returns, we write the code as follows: []:
data_daily_returns = df.pct_change()
data_daily_returns.dropna(inplace=True)
If we have to check the number of rows and columns of the array, we use the
following code:
[]: data_daily_returns.shape
[]: (501, 20)
[]: N_PRIN_COMPONENTS = 18
pca = PCA(n_components=N_PRIN_COMPONENTS)
pca.fit(data_daily_returns)
[]: PCA(n_components=18)
You can check the number of rows and columns in the array with the “shape”
command.
[]: pre_process = pca.components_.T
pre_process.shape
[]: (20, 18)
You can see that we have gone from 501 to 20. Now, we have to use this for
trading. First we will scale the data.
(20, 18)
[]: #Using k-means algorithm ondataset
clf = KMeans(n_clusters=4,init='k-means++',max_iter=30,
n_init=10,random_state=7)
print(clf)
#Using thefitfunction
clf.fit(X)
labels = clf.labels_
n_clusters_ = len(set(labels)) (1 if -1 in labels else 0) print("\nClusters
discovered: %d" % n_clusters_)
plt .scatter(
X_tsne[(labels!=-1), 0], X_tsne[(labels!=-1), 1], s=100,
alpha=0.85,
c=labels[labels!=-1]
plt .scatter(
X_tsne[(clustered_series_all==-1).values, 0],
X_tsne[(clustered_series_all==-1).values, 1], s=100,
alpha=0.05
)
plt.title('T-SNE of all Stocks with KMeans Clusters Noted');
You can see that there are 4 clusters formed. While it may look like they are
spread out, the t-SNE tool has visualised the clusters in a 2 dimensional
space, and hence you can’t see the clusters grouped together.
Once these clusters are formed, you can use them further for analysis or your
own trading strategy.
Great! We have not only seen two unsupervised algorithms, but we have also
seen how to overcome the curse of dimensionality. In the next chapters, we
lean back a bit and try to understand the concepts of natural language
processing and reinforcement learning.
How can you use NLP in trading? NLP in trading is mainly used to gauge
the sentiment of the market through Twitter feeds, newspaper articles, RSS
feeds, and Press releases. In this chapter, we will cover the basic structure
needed to solve the NLP problem from a trader’s perspective.
News and NLP Anyone who has traded some sort of a financial instrument
knows that the markets constantly factor in all the news that is pouring in
through various sources.
The cause and effect relationship between impactful news and market
movements can be directly observed when one tries to trade the market
during the release of big news such as the non-farm payrolls data.
Before social media became one of the main sources of information, traders
used to depend on the Radio or TV announcements for the latest information.
Get the Data To build an NLP model for trading, you need to have a reliable
source of data. There are multiple vendors for this purpose.
How to get news headlines for a specific company?
In the news headline text, you can search for the company ticker or company
name to get the news headline related to it. For example, to get the news
headline specific to Apple Inc, you can search for AAPL or Apple. You can
also consider to include news headlines which mention Apple products such
as the iPhone or Mac. However, there are chances that you might miss out on
news which indirectly references Apple. For example, “The biggest price
drop seen in the best selling smartphone in the US”.
How to get the latest news headline data and sentiment score for them? You
can get the latest news headline data from webhose.io. It provides a Python
API to extract the news headline.
Let us divide the data into two types and try to approach each of them
differently.
Pre-process the Data Unstructured data like Twitter feeds consists of many
nontextual data, such as hashtags and mentions. These need to be removed
before measuring the text’s sentiment.
For structured data, the size of the text can easily cloud its essence. To solve
this, you need to break the text down to individual sentences or apply
techniques such as term frequency-inverse document frequency (tf-idf) to
estimate the importance of words.
For structured text, you don’t have any pre-existing libraries that can help
you convert the text to a positive or a negative score. So, you will have to
create a library of your own.
When building such a library of relevant structured data, care should be taken
to consider texts from similar sources and the corresponding market reactions
to this text data.
For example, if the Fed releases a statement saying that “the inflation
expectations are firmly anchored” and changes it to “the inflation
expectations are stable”, then libraries like VADER won’t be able to tell the
difference apart, but the market will react significantly.
To understand score the sentiment of such text you need to develop a word-
to-vector model or a decision tree model using the tf-idf array.
Generate a Trading Model Once you have the sentiment scores of the text,
you can combine this with some kind of technical indicators to filter the noise
and generate the buy and sell signals.
Backtest the Model Once the model is ready, you need to backtest it on the
past data to check whether your model’s performance is within the risk
limitations. While backtesting, make sure that you don’t use the same data
that is used to train the decision tree model.
If the model confirms to your risk management criterion then you can deploy
the model in live trading.
21 Reinforcement Learning
Initially, we were using machine learning and AI to simulate how humans
think, only a thousand times faster! The human brain is complicated but is
limited in capacity. This simulation was the early driving force of AI
research. But we have reached a point today where humans are amazed at
how AI “thinks”.
While most chess players know that the ultimate objective of chess is to win,
they still try to keep most of the chess pieces on the board. But AlphaZero
understood that it didn’t need all its chess pieces as long as it was able to take
the opponent’s king. Thus, its moves are perceived to be quite risky but
ultimately they would pay off handsomely.
As a kid, you were always given a reward for excelling in sports or studies.
Also, you were reprimanded or scolded for doing something mischievous like
breaking a vase. This was a way to change your behaviour. Suppose you
would get a bicycle or PlayStation for coming first, you would practice a lot
to come first. And since you knew that breaking a vase meant trouble, you
would be careful around it. This is called reinforcement learning. The reward
served as positive reinforcement while the punishment served as negative
reinforcement. In this manner, your elders shaped your learning.
Similarly, the RL algorithm can learn to trade in financial markets on its own
by looking at the rewards or punishments received for the actions.
Like a human, our agents learn for themselves to achieve successful strategies
that lead to the greatest long-term rewards. This paradigm of learning by
trial-and-error, solely from rewards or punishments, is known as
reinforcement learning (RL).
-Google Deepmind
How to apply reinforcement learning in trading? In the realm of trading,
the problem can be stated in multiple ways such as to maximise profit, reduce
drawdowns, or portfolio allocation. The RL algorithm will learn the strategy
to maximise long-term rewards.
For example, the share price of Amazon was almost flat from late 2018 to the
start of 2020. Most of us would think a mean-reverting strategy would work
better here.
But if you see from early 2020, the price picked up and started trending. Thus
from the start of 2020, deploying a mean-reverting strategy would have
resulted in a loss. Looking at the mean-reverting market conditions in the
prior year, most of the traders would have exited the market when it started to
trend.
But if you had gone long and held the stock, it would have helped you in the
long run. In this case, foregoing your present reward for future long-term
gains. This behaviour is similar to the concept of delayed gratification which
was talked about at the beginning of the article.
The RL model can pick up price patterns from the year 2017 and 2018, and
with a bigger picture in mind, the model can continue to hold on to a stock
for outsize profits later on. How is reinforcement learning different from
traditional machine learning algorithms?
As you can see in the above example, you don’t have to provide labels at
each time step to the RL algorithm. The RL algorithm initially learns to trade
through trial and error, and receives a reward when the trade is closed. And
later optimises the strategy to maximise the rewards. This is different from
traditional ML algorithms which require labels at each time step or at a
certain frequency.
For example, the target label can be percentage change after every hour. The
traditional ML algorithms try to classify the data. Therefore, the delayed
gratification problem would be difficult to solve through conventional ML
algorithms.
Actions The actions can be thought of what problem is the RL algo solving.
If the RL algo is solving the problem of trading then the actions would be
Buy, Sell and Hold. If the problem is portfolio management then the actions
would be capital allocations to each of the asset classes. How does the RL
model decide which action to take?
Policy There are two methods or policies which help the RL model take the
actions. Initially, when the RL agent knows nothing about the game, the RL
agent can decide actions randomly and learn from it. This is called an
exploration policy. Later, the RL agent can use past experiences to map state
to action that maximises the long-term rewards. This is called an exploitation
policy.
But for proper analysis and execution, the data should be weakly predictive
and weakly stationary. The data should be weakly predictive is simple
enough to understand, but what do you mean by weakly stationary? Weakly
stationary means that the data should have a constant mean and variance. But
why is this important? The short answer is that machine learning algorithms
work well on stationary data. Alright! How does the RL model learn to map
state to action to take?
Rewards A reward can be thought of as the end objective which you want to
achieve from your RL system. For example, the end objective would be to
create a profitable trading system. Then, your reward becomes profit. Or it
can be the best risk-adjusted returns then your reward becomes a Sharpe
ratio.
Defining a reward function is critical to the performance of an RL model. The
following metrics can be used for defining the reward.
RL Agent The agent is the RL model which takes the input features/state and
decides the action to take. For example, the RL agent takes RSI and past 10
minutes returns as input and tells us whether we should go long on Apple
stock or square off the long position if we are already in a long position.
• Environment: For simplicity, we say that the order was placed at the open of
the next trading day, which is July 27. The order was filled at $92. Thus, the
environment tells us that you are long one share of Apple at \$92.
• State & Action: You get the next state of the system created using the latest
price data which is available. On the close of July 27, the price had reached
\$94. The agent would analyse the state and give the next action, say Sell to
environment.
• Environment: A sell order will be placed which will square off the long
position.
• Reward: A reward of 2.1% is given to the agent.
Date Closing price Action Reward July 24 $92 Buy Na July 27 $94 Sell 2.1
At each time step, the RL agent needs to decide which action to take. What if
the RL agent had a table which would tell it which action will give the
maximum reward. Then simply select that action. This table is Q-table.
In the Q-table, the rows are the states (in this case, the days) and the actions
are the columns (in this case, hold and sell). The values in this table are called
the Q-values.
Date Sell Hold
From the above Q-table, on 23 July, which action would RL agent take? Yes,
that’s right. A “hold” action would be taken as it has a q-value of 0.966
which is greater than q-value of 0.954 for Sell action.
But how to create the Q-table? Let’s create a Q-table with the help of an
example. For simplicity sake, let us take the same example of price data from
July 22 to July 31, 2020. We have added the percentage returns and
cumulative returns as shown below:
22-07-2020 97.2
23-07-2020 92.8 -4.53% 0.95
24-07-2020 92.6 -0.22% 0.95
27-07-2020 94.8 2.38% 0.98
28-07-2020 93.3 -1.58% 0.96
29-07-2020 95 1.82% 0.98
30-07-2020 96.2 1.26% 0.99
31-07-2020 106.3 10.50% 1.09
You have bought one stock of Apple a few days back and you have no more
capital left. The only two choices for you are “hold” or “sell”. As a first step,
you need to create a simple reward table.
If we decide to hold, then we will get no reward till 31 July and at the end,
we get a reward of 1.09. And if we decide to sell on any day then the reward
will be cumulative returns up to that day. The reward table (R-table) looks
like below. If we let the RL model choose from the reward table, the RL
model will sell the stock and gets a reward of 0.95.
State/Action Sell Hold
22-07-2020 0 0
23-07-2020 0.95 0
24-07-2020 0.95 0
27-07-2020 0.98 0
28-07-2020 0.96 0
29-07-2020 0.98 0
30-07-2020 0.99 0
31-07-2020 1.09 1.09
We will first start with the q-value for the Hold action on July 30. The first
part is the reward for taking that action. As seen in the R-table, it is 0. Let us
assume that g = 0.98. The maximum Q-value for sell and hold actions on the
next day, i.e. 31 July, is 1.09. Thus q-value for Hold action on 30 July is 0 +
0.98 (1.09) = 1.06.
In this way, we will fill the values for the other rows of the Hold column to
complete the Q table.
State/Action Sell Hold
The RL model will now select the hold action to maximise the Q value. This
was the intuition behind the Q table. This process of updating the Q table is
called Q learning. Of course, we had taken a scenario with limited actions
and states. In reality, we have a large state space and thus, building a q-table
will be time-consuming as well as a resource constraint.
To overcome this problem, you can use deep neural networks. They are also
called Deep Q networks or DQN. The deep Q networks learn the Q table
from past experiences and when given state as input, they can provide the Q-
value for each of the actions. We can select the action to take with the
maximum Q value.
Issues in Reinforcement Learning There are mainly two issues which you
have to consider while building the RL model. They are as follows:
Type 2 Chaos This might feel like a science fiction concept but it is very
real. While we are training the RL model, we are working in isolation. Here,
the RL model is not interacting with the market. But once it is deployed, we
don’t know how it will affect the market. Type 2 chaos is essentially when
the observer of a situation has the ability to influence the situation. This effect
is difficult to quantify while training the RL model itself. However, it can be
reasonably assumed that the RL model is still learning even when it is
deployed and thus will be able to correct itself accordingly.
Noise in Financial Data There are situations where the RL model could pick
up random noise which is usually present in financial data, and consider it as
input which should be acted upon. This could lead to inaccurate trading
signals. While there are ways to remove noise, we have to be careful of the
tradeoff between removing noise and losing important data.
While these issues are definitely not something to be ignored, there are
various solutions available to reduce them and create a better RL model in
trading.
Conclusion
This leads us to the not so happening part of this book, i.e. its conclusion. We
have covered a long way together, haven’t we?
But right now, we have different ML algorithms which are used for different
purposes, each one with their own strengths and weakness. We hope you
enjoyed this ride and gained some knowledge from this book. We also hope it
made you curious on what more is out there in the ML world and its
application in trading.