Intro ML For Quants

Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Machine-learning for Portfolio Management and Trading:

An Introduction with Scikit-Learn

Sylvain Champonnois1
March 28, 2022

1 Email:[email protected]. This pdf was prepared for the class Data Science applied to
Finance, MSc Data Science for Business X-HEC.
2
Contents

1 Introduction 5
1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Data deluge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Data scouting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 ML research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 MLOps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Backtesting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 The business model of hedge funds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Backtesting 13
2.1 Markowitz portfolio optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Industry momentum backtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Industry data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Backtesting functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Scikit-Learn TimeSeriesSplit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Empiricial results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Cumulative pnl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Other backtest statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Linear estimators 27
3.1 Ridge / Lasso / Elastic net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Scikit-learn Pipeline and Multi-output . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Predicting industry returns with linear models . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Ridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.3 Ridge with feature expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Appendix: helper functions 37


4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Data visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.1 Ken French data: industry returns . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.2 Stock returns (2003-2007) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.3 13F Berkshire Hathaway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3
4 CONTENTS
Chapter 1

Introduction

This material is an introduction to using machine-learning for portfolio management and trading.
Given the centrality of programming in hedge funds today, the concepts are exposed using only
jupyter notebooks in python. Moreover, we leverage the scikit-learn package (also known as
sklearn) to illustrate how machine-learning is used in practice in this context.

1.1 Outline

Outline:

1. Backtesting: Markowitz portfolio optimisation, industry momentum

2. Risk: risk-model shrinkage, non-normal returns

3. Linear models: pipelines Ridge, Lasso, feature engineering

4. Non-linear models: Boosted trees, Multi-layer perceptron

5. Factors: Value, Momentum, Style Analysis

6. Overfitting: hyperparameter search, validation strategy

7. Transaction cost turnover, leverage, portfolio optimisation with constraints

8. Mean reversion profitability of liquidity provision, survival-free sample

We are interested in how quantitative hedge funds operate in practice. Today, quantitative hedge
funds are essentially consumers of data – they ingest all sorts of datasets and extract information
used to buy or sell systematically securities. Researchers and portfolio managers are deeply in-
volved in the process of data ingestion and information extraction, but they do not directly
decide which securities are bought or sold – instead algorithms do.

Because these processes of data ingestion and information extraction are so central to quan-
titative hedge fund operations, they have become software companies – a lot of the intellectual
property (IP) of hedge funds is embedded in the code they write. And in that sense, hedge funds

5
6 CHAPTER 1. INTRODUCTION

are not so different from other data-science based technology companies. (And in fact the hir-
ing has become very similar, with a lot of interest in profiles out of Computer Science, Machine-
Learning, Data engeeniring, Statistics, etc).

In the rest of this section, we first describe how new datasets have become available and what it
implies for Machine-learning research and more particularly, for quantitative hedge funds. We
then introduce the structure of this course. In particular, given this course will focus on code (that
is, python code), it is written entirely in jupyter notebooks.

1.2 Trends

1.2.1 Data deluge

There are now sensors everywhere in the physical world and most of the online interactions are
tracked – leading a ‘data deluge" (e.g. see Mary Meeker (2018) on internet trends or Lori Lewis
(2022)).

1.2.2 Data scouting

Alternative data = web + credit card transactions + geolocalisation + satellite imaging

Two examples on how data is transforming hedgd funds in the Financial Times and Bloomberg:
FT (08/28/2017) and Bloomberg (06/15/2019).
1.2. TRENDS 7

1.2.3 ML research

ML research has become a race with new ideas coming out with an increasing speed – e.g. as
illustrated by the number of papers published on the scientific paper repository arxiv.com (Jeff
Dean (06/02/2019)). More precisely, as Francois Chollet (04/03/2019) points out, the issue is how
to process (ie. test empirically) these new ideas – as fast as possible to gain a competitive edge.
8 CHAPTER 1. INTRODUCTION

The success of deep-learning depends on: i) model “capacity”, ii) computational power, iii) dataset
size. Sun, Shrivastava, Singh, Gupta (2017) note that the size of the largest dataset has remained
somewhat constant over the last few years.

A particular success of deep-learning has been on Natural Language Processing (NLP) and there
too, the size of the largest models has increased dramatically. Victor Sanh, Lysandre Debut, Julien
Chaumond, Thomas Wolf (2019).
1.3. MLOPS 9

1.3 MLOps

MLOps (machine learning operations) represents a set of practices for the deployment of ML mod-
els in production. For quant hedge funds, there are two main concepts that we describe here:
• pipelines
• backtests

1.3.1 Pipelines

Pipeline:
A machine-learning pipeline is an end-to-end description of the automated flow of
data from raw inputs to a desired output. Each step represents a transformation of the
data, possibly with a fitted model.
The diagram below illustrates a pipeline for a quant fund. The end point (to the right of the
diagram) are the holdings in a set of traded securities – and combined with the returns on these
securities, the pnl of a given strategy can be computed. The entry point (to the left of the diagram)
are features. A set of transformations (pre-determined in the pipeline) are applied to these
features to produce the desired holdings. Some transformations in the pipeline are “fixed”
while others depend on fitted models (e.g. a ML predictor of returns or a risk model).
In the diagram, we emphasize the timing of these different objects:
• for a pnl at time t, the features and target include only information up to t − 1 so that the
holdings are known in t − 1 and can accrue returns over the period t.
The following equation summarizes this point:
10 CHAPTER 1. INTRODUCTION

pnlt = holdingst−1 × returnst .

1.3.2 Scikit-learn

The following notebooks and notes are largely based on scikit-learn. scikit-learn is
an extremely powerful (and widely used) package for machine-learning. In particular, it
provides a “grammar” for pipelines where each transformation or estimator class has the
fit/transform/predict functions with arguments as (X, y) where X represents the features and
y, targets.

1.3.3 Backtesting

A look-ahead bias occurs when data dated at t includes information only available after t; in
contrast, point-in-time data ensures that data dated at t is based on only information up to date t.
A backtest is a method to simulate a strategy using point-in-time historical data and evaluate its
profitability.

In order to illustrate how to use pipelines à la scikit-learn for quantitative portfolio manage-
ment, we introduce a thin layer of functions – in particular, the backtester class. This class allow
to run a rolling window simulation so that only information up to date t − 1 is used to determined
the holdings at that date.
1.4. THE BUSINESS MODEL OF HEDGE FUNDS 11

1.4 The business model of hedge funds

HF = data + compute power + human capital + intangibles


12 CHAPTER 1. INTRODUCTION
Chapter 2

Backtesting

In this section, we construct a backtest using industry data. More precisely, we use data from Ken
French’s data library to construct a simple industry momentum return predictor.
The goal of a backtest is to assess the validity of a trading predictor at any point in the past.
In particular, it is crucial to avoid any forward-looking bias – in which information available
only after time t is mistakingly used at time t. In practice, the predictors are estimated over
rolling (or expanding) windows. We implement rolling window estimation with the sklearn
TimeSeriesSplit object.
For backtesting, visualisation is very important and we make use of some plotting functions in-
troduced in the Appendix:

[1]: from ml4pmt.plot import line, bar, heatmap

2.1 Markowitz portfolio optimisation

First: a review of mean-variance optimisation for a universe with N assets: α is the return forecast:
α = E (r ).
h T Vh
Lemma [mean-variance]: the allocation that maximizes the utility h T α − 2λ is

h = λV −1 α

,
where λ is the risk-tolerance.
The ex-ante risk is h T Vh = λ2 α T V −1 α and the ex-ante Sharpe ratio is

h T E (r ) √
S= √ = αT V −1 α.
h T Vh

Corollary: The maximisation of the sharpe ratio is equivalent (up to a scaling factor) the mean-
variance optimisation.

13
14 CHAPTER 2. BACKTESTING

The mean-variance formula is extended to account for the linear constraints

Ah = b.

To do so, we introduce the Lagrangian L (and Lagrange multiplier ξ)

h T Vh
L = hT α − − (h T A T − b T )ξ

The Lagrange multiplier ξ is a tuning parameter chosen exactly so that the constraint above
holds. At the optimal value of ξ, the constrained problem boils down to an unconstrained problem
with the adjusted return forecast α − A T ξ.
h T Vh
Lemma: the allocation that maximizes the utility h T α − 2λ under the linear constraint Ah = b is

  −1    −1 
−1 T −1 T −1 T −1 T −1
h=V A AV A b + λV α−A AV A AV α

Proof : the first-order condition is

∂L Vh
= α− − AT ξ = 0 ⇔ h = λV −1 [α − AT ξ ]
∂h λ

The parameter ξ is chosen that Ah = b

 
−1 T −1 T −1 −1 b
b = Ah = λAV [α − A ξ ] ⇒ ξ = [ AV A ] AV α−
λ

The holding vector under constraint is

  −1    −1 
−1 T −1 T −1 T −1 T −1
hλ = V A AV A b + λV α−A AV A AV α
| {z } | {z }
minimum variance portfolio speculative portfolio

• The first term is what minimises the risk h T Vh under the constraint Ah = b (in particular, it
does not depend on expected returns or risk-tolerance).
• The second term is the speculative portfolio (it is sensitive to both inputs).
T
√ frontier is the relation between expected portfolio return h α and portfolio standard
The efficient
deviation h T Vh for varying level of risk-tolerance
 q 
T T
( x, y) 7→ hλ α, hλ Vhλ

q
When b = 0, the efficient frontier between hλT α and hλT Vhλ is a line through (0, 0); otherwise, it
is a parabolic curve.
2.2. INDUSTRY MOMENTUM BACKTEST 15

We focus on pure “alpha views” – that is, long-short “cash-neutral” portfolios where the sum of
holdings is zero. In this case b = 0 and A = 1 where

 
1= 1 ... 1 .

2.2 Industry momentum backtest

The setup for predicting industry returns is the following:


• the assets are industries;
• the return forecast α is estimated using rolling-window returns (over L months, L = 12)
preceding a given date;
• no look-ahead bias: at each date, only information up that date is used;
• such a strategy goes long past “winners” (industries with higher-than-average returns) and
goes short “losers” (industries with lower-than-average returns) ⇒ Momentum strategy;
• this strategy is often implemented by skipping the most recent month to avoid the 1-month
reversal" effect.
See Moskowitz and Grinblatt (1999): “Do Industries Explain Momentum,” Journal of Finance
16 CHAPTER 2. BACKTESTING

2.2.1 Industry data

[5]: from ml4pmt.dataset import load_kf_returns

[6]: returns_data = load_kf_returns(cache_dir="data")

INFO:ml4pmt.dataset:logging from cache directory: data/12_Industry_Portfolios


Since the Moskowitz-Grinblatt paper was published in August 1999, we will keep the data after
1999 as out-of-sample and only use the data before 1999.

[7]: ret = returns_data["Monthly"]["Average_Value_Weighted_Returns"][:'1999']

Time convention: - holdings ht and returns rt are known for period t – ie. *at the end of period t.
• so to compute pnl with forward-looking information, the holdings must only depend on
information up to t − 1
• in practice, we will have

pnlt = ht−1 × rt

2.2.2 Backtesting functions

In the next set of helper file, we introduce three main functions:


• a function that computes mean-variance holdings for batches
• a MeanVariance class that follows the sklearn api
• a fit_and_predict function to run rolling window estimations.
2.2. INDUSTRY MOMENTUM BACKTEST 17

[8]: %%writefile ../ml4pmt/backtesting.py

import numpy as np
import pandas as pd
from joblib import Parallel, delayed
from ml4pmt.metrics import sharpe_ratio
from sklearn.base import BaseEstimator, clone
from sklearn.model_selection import TimeSeriesSplit
from sklearn.utils.metaestimators import _safe_split

def compute_batch_holdings(pred, V, A=None, past_h=None, constant_risk=False):


"""
compute markowitz holdings with return prediction "mu" and covariance matrix␣
,→"V"

mu: numpy array (shape N * K)


V: numpy array (N * N)

"""

N, _ = V.shape
if isinstance(pred, pd.Series) | isinstance(pred, pd.DataFrame):
pred = pred.values
if pred.shape == (N,):
pred = pred[:, None]
elif pred.shape[1] == N:
pred = pred.T

invV = np.linalg.inv(V)
if A is None:
M = invV
else:
U = invV.dot(A)
if A.ndim == 1:
M = invV - np.outer(U, U.T) / U.dot(A)
else:
M = invV - U.dot(np.linalg.inv(U.T.dot(A)).dot(U.T))
h = M.dot(pred)
if constant_risk:
h = h / np.sqrt(np.diag(h.T.dot(V.dot(h))))
return h.T

class MeanVariance(BaseEstimator):
def __init__(self, transform_V=None, A=None, constant_risk=True):
18 CHAPTER 2. BACKTESTING

if transform_V is None:
self.transform_V = lambda x: np.cov(x.T)
else:
self.transform_V = transform_V
self.A = A
self.constant_risk = constant_risk

def fit(self, X, y=None):


self.V_ = self.transform_V(y)

def predict(self, X):


if self.A is None:
T, N = X.shape
A = np.ones(N)
else:
A = self.A
h = compute_batch_holdings(X, self.V_, A, constant_risk=self.
,→constant_risk)

return h

def score(self, X, y):


return sharpe_ratio(np.sum(X * y, axis=1))

class Backtester:
def __init__(
self,
estimator,
ret,
max_train_size=36,
test_size=1,
start_date="1945-01-01",
end_date=None,
):

self.start_date = start_date
self.end_date = end_date
self.estimator = estimator
self.ret = ret[: self.end_date]
self.cv = TimeSeriesSplit(
max_train_size=max_train_size,
test_size=test_size,
n_splits=1 + len(ret.loc[start_date:end_date]) // test_size,
)

def train(self, features, target):


pred, estimators = fit_predict(
2.2. INDUSTRY MOMENTUM BACKTEST 19

self.estimator, features, target, self.ret, self.cv,␣


,→return_estimator=True
)
self.estimators_ = estimators
self.h_ = pred
self.pnl_ = (
pred.shift(1).mul(self.ret).sum(axis=1)[self.start_date : self.
,→end_date]

)
return self

def _fit_predict(estimator, X, y, train, test, return_estimator=False):


X_train, y_train = _safe_split(estimator, X, y, train)
X_test, _ = _safe_split(estimator, X, y, test, train)
estimator.fit(X_train, y_train)
if return_estimator:
return estimator.predict(X_test), estimator
else:
return estimator.predict(X_test)

def fit_predict(
estimator,
features,
target,
ret,
cv,
return_estimator=False,
verbose=0,
pre_dispatch="2*n_jobs",
n_jobs=1,
):
parallel = Parallel(n_jobs=n_jobs, verbose=verbose,␣
,→pre_dispatch=pre_dispatch)

res = parallel(
delayed(_fit_predict)(
clone(estimator), features, target, train, test, return_estimator
)
for train, test in cv.split(ret)
)
if return_estimator:
pred, estimators = zip(*res)
else:
pred = res
cols = ret.columns
idx = ret.index[np.concatenate([test for _, test in cv.split(ret)])]
20 CHAPTER 2. BACKTESTING

if return_estimator:
return pd.DataFrame(np.concatenate(pred), index=idx, columns=cols),␣
,→estimators

else:
return pd.DataFrame(np.concatenate(pred), index=idx, columns=cols)

Overwriting ../ml4pmt/backtesting.py

[9]: from ml4pmt.backtesting import compute_batch_holdings, MeanVariance, Backtester

[10]: T, N = ret.shape
A = np.ones(N)

[11]: h = compute_batch_holdings(ret.mean(), ret.cov(), A, past_h=None)

[12]: np.allclose(h.dot(A), [0.])

[12]: True

[13]: A = np.stack([np.ones(N), np.zeros(N)], axis=1)


A[0, 1] = 1

[14]: h = compute_batch_holdings(pred=ret.mean(), V=ret.cov(), A=A, past_h=None)

[15]: np.allclose(h.dot(A), [0., 0.])

[15]: True

2.2.3 Scikit-Learn TimeSeriesSplit

[16]: from sklearn.model_selection import TimeSeriesSplit

Given that the data is monthly, we re-estimate the model every month. This is done by choosing
the parameter n_splits in the class TimeSeriesSplit as the number of months.

[17]: start_date = "1945-01-01"


test_size = 1
params = dict(max_train_size=36, test_size=test_size, gap=0)
params["n_splits"] = 1 + len(ret.loc[start_date:]) // test_size

cv = TimeSeriesSplit(**params)

More precisely, with TimeSeriesSplit:


• the test indices are the dates for which the holdings are computed.
• the train indices are the date range over which a forecasting model is trained.
2.3. EMPIRICIAL RESULTS 21

• gap could set to −1 because the model can have at most the same information as when
making the portfolio decision for the first date in the test window

• in practice, the target will been shifted by −1 and gap is set to 0.

• we can estimate batches with test_size > 1.

• n_splits is determined so that the backtest starts (just) before a certain start date.

[18]: for train, test in cv.split(ret):


break
ret.iloc[train].index[-1], ret.iloc[test].index[0]

[18]: (Timestamp('1944-11-01 00:00:00'), Timestamp('1944-12-01 00:00:00'))

2.3 Empiricial results

2.3.1 Cumulative pnl

[19]: transform_X = lambda x: x.rolling(12).mean().values


transform_y = lambda x: x.shift(-1).values
features = transform_X(ret)
target = transform_y(ret)

[20]: _h = []
for train, test in cv.split(ret):
m = MeanVariance()
m.fit(features[train], target[train])
_h += [m.predict(features[test])]

cols = ret.columns
idx = ret.index[np.concatenate([test for _, test in cv.split(ret)])]
h = pd.DataFrame(np.concatenate(_h), index=idx, columns=cols)

[21]: pnl = h.shift(1).mul(ret).sum(axis=1)[start_date:]


line(pnl.rename('Industry momentum'), cumsum=True)
22 CHAPTER 2. BACKTESTING

[22]: m = Backtester(estimator=MeanVariance(), ret=ret)


m.train(features, target)
h.equals(m.h_), pnl.equals(m.pnl_)

[22]: (True, True)

2.3.2 Other backtest statistics

We can also extract information for the estimator – e.g. in this simple case, recover the covariance
matrix fitted by the class MeanVariance().

[23]: estimators = m.estimators_

[24]: V_mean = pd.DataFrame(sum([m.V_ for m in estimators])/len(estimators), ret.


,→columns, ret.columns)

heatmap(V_mean, title='Average covariance matrix')


2.3. EMPIRICIAL RESULTS 23

[25]: m = Backtester(estimator=MeanVariance(), ret=ret)


pnls = {}
for window in [6, 12, 24, 36]:
features_ = ret.rolling(window).mean().values
m.train(features_, target)
pnls[window] = m.pnl_
line(pnls, cumsum=True, start_date='1945', title='Cumulative pnl for different␣
,→look-back windows (in month)')

[27]: from ml4pmt.metrics import sharpe_ratio


sr = {i: h.shift(1+i).mul(ret).sum(axis=1).pipe(sharpe_ratio) for i in␣
,→range(-10, 12)}
24 CHAPTER 2. BACKTESTING

bar(sr, baseline=0, sort=False, title='Lead-lag sharpe ratio')

The off-the-top approach is to remove an asset from the tradable set and check whether the
portfolio sharpe ratio decreases (in which case, this asset is a contributor) or increases (in which
case, this asset is a detractor).

[30]: pnls_ott = {}
for c in ret.columns:
ret_ = ret.drop(c, axis=1)
features_ = transform_X(ret_)
target_ = transform_y(ret_)
pnl_ = Backtester(estimator=MeanVariance(), ret=ret_).train(features_,␣
,→target_).pnl_

pnls_ott[c] = pnl_.pipe(sharpe_ratio)

pnls_ott["ALL"] = pnl.pipe(sharpe_ratio)

[31]: bar(pnls_ott, baseline="ALL", title='Industry momentum off-the-top')


2.3. EMPIRICIAL RESULTS 25
26 CHAPTER 2. BACKTESTING
Chapter 3

Linear estimators

In this section, we take advantage of some of scikit-learn powerful features such as the
pipeline.

3.1 Ridge / Lasso / Elastic net

Ridge regression: the betas h β 1 , ..., β p i are chosen to minimize


p p
1 N
2 i∑
(yi − β 0 − ∑ xij β j )2 + λ ∑ β2j
=1 j =1 j =1

The Ridge regression provides more stable and accurate estimates than standard residual sum of
squares minimization
Lasso regression: the betas h β 1 , ..., β p i are chosen to minimize
p p
1 N
2 i∑
(yi − β 0 − ∑ xij β j )2 + λ ∑ | β j |
=1 j =1 j =1

The Lasso tends to promote sparse and stable models that can be more easily interpretable.
Elastic net: the betas h β 1 , ..., β p i are chosen to minimize
p p
1 N
2 i∑
(yi − β 0 − ∑ xij β j )2 + λ ∑ [(1 − α) β2j + α| β j |]
=1 j =1 j =1

“The lasso penalty is not very selective in the choice among a set of strong but corre-
lated predictors, and the ridge penalty is inclined to shrink the coefficients of correlated
variables towards each other. The compromise in the elastic net could cause the highly
correlated features to be averaged while encouraging a parsiminuous model."
To give an example, we use a diabetes dataset provided by sklearn.

[2]: from sklearn import datasets

27
28 CHAPTER 3. LINEAR ESTIMATORS

[3]: X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)

[4]: from sklearn.linear_model import lasso_path, enet_path


from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
X /= X.std(axis=0)
eps = 5e-3

[5]: alphas_lasso, coefs_lasso, _ = lasso_path(X, y, eps=eps)


l1_ratio = 0.5
alphas_enet, coefs_enet, _ = enet_path(X, y, eps=eps, l1_ratio=l1_ratio)

[6]: fig, ax = plt.subplots(1, 2, figsize=(20, 6))


fig.suptitle('Coefficients as a function of the shrinkage factor (in log)')
line(pd.DataFrame(coefs_lasso.T, -1*np.log(alphas_lasso), columns=X.columns),␣
,→title='Lasso', ax=ax[0])

line(pd.DataFrame(coefs_enet.T, -1*np.log(alphas_enet), columns=X.columns),


title=f'Elastic net (l1_ratio={l1_ratio})', ax=ax[1])

[9]: from ml4pmt.estimators import LinearRegression, Ridge

3.2 Scikit-learn Pipeline and Multi-output

[7]: from sklearn.pipeline import make_pipeline


from sklearn.preprocessing import StandardScaler, PolynomialFeatures

The API of scikit-learn is to have fit, predict and transform. More precisely, estimator like
linear regression or Ridge regression have the property to deal with multi-ouputs, but do not have
a transform property. We add this property to be able to use the pipeline operator.
3.3. PREDICTING INDUSTRY RETURNS WITH LINEAR MODELS 29

[30]: %%writefile ../ml4pmt/estimators.py

from sklearn.linear_model import LinearRegression, Ridge, RidgeCV


from sklearn.multioutput import MultiOutputRegressor
from sklearn.neural_network import MLPRegressor

class LinearRegression(LinearRegression):
def transform(self, X):
return self.predict(X)

class Ridge(Ridge):
def transform(self, X):
return self.predict(X)

class RidgeCV(RidgeCV):
def transform(self, X):
return self.predict(X)

class MultiOutputRegressor(MultiOutputRegressor):
def transform(self, X):
return self.predict(X)

class MLPRegressor(MLPRegressor):
def transform(self, X):
return self.predict(X)

Overwriting ../ml4pmt/estimators.py

3.3 Predicting industry returns with linear models

[10]: returns_data = load_kf_returns(cache_dir='data')

ret = returns_data['Monthly']['Average_Value_Weighted_Returns'][:'1999']

transform_X = lambda x: x.rolling(12).mean().fillna(0).values


transform_y = lambda x: x.shift(-1).values
features = transform_X(ret)
target = transform_y(ret)

INFO:ml4pmt.dataset:logging from cache directory: data/12_Industry_Portfolios


30 CHAPTER 3. LINEAR ESTIMATORS

[11]: m = Backtester(MeanVariance(), ret).train(features, target)


pnls = {'momentum': m.pnl_ }

3.3.1 Linear Regression

Always a good idea to start with a linear regression as a benchmark.

[12]: estimator = make_pipeline(LinearRegression(), MeanVariance())

[13]: m = Backtester(estimator, ret).train(features, target)


pnls['linear_regression'] = m.pnl_
line(pnls['linear_regression'], cumsum=True, title='Linear Regression')

The linear regression fits an intercept and some coefficients.

[14]: ols_ = m.estimators_[0].named_steps['linearregression']


coef_ = ols_.coef_
intercept_ = ols_.intercept_
vec = ret.mean().values
np.allclose(ols_.predict(vec[None, :]), coef_.dot(vec) + intercept_)

[14]: True

[16]: coefs_ = [m.named_steps['linearregression'].coef_ for m in m.estimators_]


coefs_mean = pd.DataFrame(sum(coefs_)/len(coefs_), ret.columns, ret.columns).T

[17]: heatmap(coefs_mean.loc[coefs_mean.mean(1).sort_values().index, coefs_mean.


,→mean(1).sort_values().index],
3.3. PREDICTING INDUSTRY RETURNS WITH LINEAR MODELS 31

title='Average linear regression coefficients (x-axis: predictors,␣


,→ y-axis=targets)')

[18]: pnls_ = {}
for hl in tqdm([6, 12, 24]):
features_ = ret.ewm(halflife=hl).mean().fillna(0).values
pnls_[hl] = Backtester(estimator, ret).train(features_, target).pnl_
line(pnls_, cumsum=True, title='Robustness on feature half-lives')

0%| | 0/3 [00:00<?, ?it/s]


32 CHAPTER 3. LINEAR ESTIMATORS

[19]: pnls_ = {}
for hl in [6, 12, 24]:
features_ = ret.rolling(window=hl).mean().fillna(0).values
pnls_[hl] = Backtester(estimator, ret).train(features_, target).pnl_
line(pnls_, cumsum=True, title='Robustness on features with rolling windows')

3.3.2 Ridge

[21]: estimator = make_pipeline(StandardScaler(with_mean=False),


Ridge(),
MeanVariance())

pnls['ridge']= Backtester(estimator, ret).train(features, target).pnl_


line(pnls['ridge'], cumsum=True, title='Ridge')
3.3. PREDICTING INDUSTRY RETURNS WITH LINEAR MODELS 33

[22]: pnls_ = {}
for alpha in [0.1, 1, 10, 100]:
estimator_ = make_pipeline(StandardScaler(with_mean=False),␣
,→Ridge(alpha=alpha), MeanVariance())

pnls_[alpha] = Backtester(estimator_, ret).train(features_, target).pnl_


line(pnls_, cumsum=True, title='Ridge: Robustness on alpha')

3.3.3 Ridge with feature expansion

We can expand the set of features by using polynomial transfomrs with PolynomialFeatures.
34 CHAPTER 3. LINEAR ESTIMATORS

[23]: PolynomialFeatures(degree=2).fit_transform(ret.iloc[:10]).shape

[23]: (10, 91)

Number of new features: intercept, initial features (=12), squared features (12), all cross features
of degree 1 (=6*12):

[24]: estimator = make_pipeline(StandardScaler(with_mean=False),


PolynomialFeatures(degree=2),
Ridge(alpha=100),
MeanVariance())

[25]: print(f'Number of features generated by degree=2: {1+ 12 + 12 + 6 * 11}')

Number of features generated by degree=2: 91

[26]: pnls['ridge_with_feature_expansion'] = Backtester(estimator, ret).


,→train(features_, target).pnl_

line(pnls['ridge_with_feature_expansion'], cumsum=True, title='Ridge with␣


,→feature extension')

[27]: pnls_ = {}
for alpha in [0.1, 1, 10, 100, 1000]:
estimator_ = make_pipeline(StandardScaler(with_mean=False),
PolynomialFeatures(degree=2),
Ridge(alpha=alpha),
MeanVariance())
pnls_[alpha] = Backtester(estimator_, ret).train(features_, target).pnl_
line(pnls_, cumsum=True, title='Ridge with feature expansion: Robustness on␣
,→alpha')
3.3. PREDICTING INDUSTRY RETURNS WITH LINEAR MODELS 35

Putting all the types of linear predictors together, we can compare the cumulative pnls in the
graph below.

[28]: line(pd.concat(pnls, axis=1).assign(ALL = lambda x: x.mean(axis=1)), cumsum=True)


36 CHAPTER 3. LINEAR ESTIMATORS
Chapter 4

Appendix: helper functions

Here we create some helper functions that will be used across notebooks using the magic
%%writefile.

4.1 Metrics

[18]: %%writefile ../ml4pmt/metrics.py


import numpy as np

def test_monthly(df):
return int(len(df) / len(df.asfreq("M"))) == 1

def test_bday(df):
return int(len(df) / len(df.asfreq("B"))) == 1

def test_day(df):
return int(len(df) / len(df.asfreq("D"))) == 1

def sharpe_ratio(df, num_period_per_year=None):


num_period_per_year = None
if test_monthly(df):
num_period_per_year = 12
if test_bday(df):
num_period_per_year = 260
if test_day(df):
num_period_per_year = 365
if num_period_per_year is None:
return np.nan

37
38 CHAPTER 4. APPENDIX: HELPER FUNCTIONS

else:
return df.mean() / df.std() * np.sqrt(num_period_per_year)

def drawdown(x, return_in_risk_unit=True, window=36, num_period_per_year=12):


dd = x.cumsum().sub(x.cumsum().cummax())
if return_in_risk_unit:
return dd.div(x.rolling(window).std().mul(np.sqrt(num_period_per_year)))
else:
return dd

Overwriting ../ml4pmt/metrics.py

4.2 Data visualisation

Data exploration, in particular based on visualisation, is crucial to modern data science. Pandas
has a lot of plotting functionalities (e.g. see the graph below), but we will find it usefull to use a
custom plot set of functions.

[19]: %%writefile ../ml4pmt/plot.py


import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from ml4pmt.metrics import sharpe_ratio

plt.style.use("seaborn-whitegrid")

def line(
df,
sort=True,
figsize=(8, 5),
ax=None,
title="",
cumsum=False,
loc="center left",
bbox_to_anchor=(1, 0.5),
legend_sharpe_ratio=None,
legend=True,
yscale=None,
start_date=None,
):
if loc == "best":
bbox_to_anchor = None
if isinstance(df, dict):
df = pd.concat(df, axis=1)
4.2. DATA VISUALISATION 39

if isinstance(df, pd.Series):
df = df.to_frame()
if start_date is not None:
df = df[start_date:]
if cumsum & (legend_sharpe_ratio is None):
legend_sharpe_ratio = True
if legend_sharpe_ratio:
df.columns = [f"{c}: sr={sharpe_ratio(df[c]): 3.2f}" for c in df.columns]
if cumsum:
df = df.cumsum()
if sort:
df = df.loc[:, lambda x: x.iloc[-1].sort_values(ascending=False).index]
if ax is None:
fig, ax = plt.subplots(1, 1, figsize=figsize)
ax.plot(df.index, df.values)
if legend:
ax.legend(df.columns, loc=loc, bbox_to_anchor=bbox_to_anchor)
ax.set_title(title)
if yscale == "log":
ax.set_yscale("log")

def bar(
df,
err=None,
sort=True,
figsize=(8, 5),
ax=None,
title="",
horizontal=False,
baseline=None,
rotation=0,
):
if isinstance(df, pd.DataFrame):
df = df.squeeze()
if isinstance(df, dict):
df = pd.Series(df)
if sort:
df = df.sort_values()
if err is not None:
err = err.loc[df.index]
labels = df.index
x = np.arange(len(labels))
if ax is None:
fig, ax = plt.subplots(1, 1, figsize=figsize)
if horizontal:
ax.barh(x, df.values, xerr=err, capsize=5)
40 CHAPTER 4. APPENDIX: HELPER FUNCTIONS

ax.set_yticks(x)
ax.set_yticklabels(labels, rotation=0)
if baseline in df.index:
df_ = df.copy()
df_[df.index != baseline] = 0
ax.barh(x, df_.values, color="lightgreen")
else:
ax.bar(x, df.values, yerr=err, capsize=5)
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=0)
if baseline in df.index:
df_ = df.copy()
df_[df.index != baseline] = 0
ax.bar(x, df_.values, color="lightgreen")
ax.set_title(title)

def heatmap(
df,
ax=None,
figsize=(8, 5),
title="",
vmin=None,
vmax=None,
vcompute=True,
cmap="RdBu",
):
labels_x = df.index
x = np.arange(len(labels_x))
labels_y = df.columns
y = np.arange(len(labels_y))
if vcompute:
vmax = df.abs().max().max()
vmin = -vmax
if ax is None:
fig, ax = plt.subplots(1, 1, figsize=figsize)
pos = ax.imshow(
df.T.values, cmap=cmap, interpolation="nearest", vmax=vmax, vmin=vmin
)
ax.set_xticks(x)
ax.set_yticks(y)
ax.set_xticklabels(labels_x, rotation=90)
ax.set_yticklabels(labels_y)
ax.set_title(title)
ax.grid(True)
fig.colorbar(pos, ax=ax)
4.2. DATA VISUALISATION 41

def scatter(
df,
xscale=None,
yscale=None,
xlabel=None,
ylabel=None,
xticks=None,
yticks=None,
figsize=(8, 5),
title=None,
):
fig, ax = plt.subplots(1, 1, figsize=figsize)
ax.scatter(df, df.index, facecolors="none", edgecolors="b", s=50)
if xlabel is not None:
ax.set_xlabel(xlabel)
if ylabel is not None:
ax.set_ylabel(ylabel)
if xscale is not None:
ax.set_xscale(xscale)
if yscale is not None:
ax.set_yscale(yscale)
if yticks is not None:
ax.set_yticks(yticks)
ax.set_yticklabels(yticks)
if xticks is not None:
ax.set_xticks(xticks)
ax.set_xticklabels(xticks)
ax.set_title(title)

Overwriting ../ml4pmt/plot.py

[3]: from ml4pmt.plot import bar, line, heatmap

[6]: line(pd.Series(np.random.normal(size=50)), cumsum=True, title="This is a graph",


legend_sharpe_ratio=False)
42 CHAPTER 4. APPENDIX: HELPER FUNCTIONS

[7]: bar(pd.Series(np.random.normal(size=50)), baseline=10, horizontal=True)

4.3 Dataset

In particular, we look at two datasets:

• Ken French’s data library (https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html).

• Berkshire Hathaway
4.3. DATASET 43

[17]: %%writefile ../ml4pmt/dataset.py

import logging
import os
import sys
from io import BytesIO
from pathlib import Path
from zipfile import ZipFile

import numpy as np
import pandas as pd
import requests

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logger = logging.getLogger(__name__)

def clean_kf_dataframes(df, multi_df=False):


"""
extract the annual and monthly dataframes from the csv file with specific␣
,→formatting

"""
idx = [-2] + list(np.where(df.notna().sum(axis=1) == 0)[0])
if multi_df:
cols = [" Average Value Weighted Returns -- Monthly"] + list(
df.loc[df.notna().sum(axis=1) == 0].index
)
returns_data = {"Annual": {}, "Monthly": {}}
for i in range(len(idx)):
if multi_df:
c_ = (
cols[i]
.replace("-- Annual", "")
.replace("-- Monthly", "")
.strip()
.replace("/", " ")
.replace(" ", "_")
)
if i != len(idx) - 1:
v = df.iloc[idx[i] + 2 : idx[i + 1] - 1].astype(float)
v.index = v.index.str.strip()
if len(v) != 0:
if len(v.index[0]) == 6:
v.index = pd.to_datetime(v.index, format="%Y%m")
if multi_df:
returns_data["Monthly"][c_] = v
else:
44 CHAPTER 4. APPENDIX: HELPER FUNCTIONS

returns_data["Monthly"] = v
continue
if len(v.index[0]) == 4:
v.index = pd.to_datetime(v.index, format="%Y")
if multi_df:
returns_data["Annual"][c_] = v
else:
returns_data["Annual"] = v
return returns_data

def load_kf_returns(
filename="12_Industry_Portfolios", cache_dir=None, force_reload=False
):
"""
load industry returns for Ken French website:
https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
"""

if filename == "12_Industry_Portfolios":
skiprows, multi_df = 11, True
if filename == "F-F_Research_Data_Factors":
skiprows, multi_df = 3, False
if filename == "F-F_Momentum_Factor":
skiprows, multi_df = 13, False
if cache_dir is None:
cache_dir = Path(os.getcwd()) / "data"
if isinstance(cache_dir, str):
cache_dir = Path(cache_dir)
output_dir = cache_dir / filename
if (output_dir.is_dir()) & (~force_reload):
logger.info(f"logging from cache directory: {output_dir}")
returns_data = load_dict(output_dir)
else:
logger.info("loading from external source")
path = (
"http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/"
+ filename
+ "_CSV.zip"
)
r = requests.get(path)
files = ZipFile(BytesIO(r.content))

df = pd.read_csv(files.open(filename + ".CSV"), skiprows=skiprows,␣


,→index_col=0)

returns_data = clean_kf_dataframes(df, multi_df=multi_df)


4.3. DATASET 45

logger.info(f"saving in cache directory {output_dir}")


save_dict(returns_data, output_dir)
return returns_data

def load_buffets_data(cache_dir=None, force_reload=False):


if cache_dir is None:
cache_dir = Path(os.getcwd()) / "data"
if isinstance(cache_dir, str):
cache_dir = Path(cache_dir)

filename = cache_dir / "ffdata_brk13f.parquet"

if (filename.is_file()) & (~force_reload):


logger.info(f"logging from cache directory: {filename}")
df = pd.read_parquet(filename)

else:
logger.info("loading from external source")
path = "https://github.com/slihn/buffetts_alpha_R/archive/master.zip"
r = requests.get(path)
files = ZipFile(BytesIO(r.content))

df = pd.read_csv(
files.open("buffetts_alpha_R-master/ffdata_brk13f.csv"), index_col=0
)
df.index = pd.to_datetime(df.index, format="%m/%d/%Y")
logger.info(f"saving in cache directory {filename}")
df.to_parquet(filename)
return df

symbol_dict = {
"TOT": "Total",
"XOM": "Exxon",
"CVX": "Chevron",
"COP": "ConocoPhillips",
"VLO": "Valero Energy",
"MSFT": "Microsoft",
"IBM": "IBM",
"TWX": "Time Warner",
"CMCSA": "Comcast",
"CVC": "Cablevision",
"YHOO": "Yahoo",
"DELL": "Dell",
"HPQ": "HP",
"AMZN": "Amazon",
46 CHAPTER 4. APPENDIX: HELPER FUNCTIONS

"TM": "Toyota",
"CAJ": "Canon",
"SNE": "Sony",
"F": "Ford",
"HMC": "Honda",
"NAV": "Navistar",
"NOC": "Northrop Grumman",
"BA": "Boeing",
"KO": "Coca Cola",
"MMM": "3M",
"MCD": "McDonald's",
"PEP": "Pepsi",
"K": "Kellogg",
"UN": "Unilever",
"MAR": "Marriott",
"PG": "Procter Gamble",
"CL": "Colgate-Palmolive",
"GE": "General Electrics",
"WFC": "Wells Fargo",
"JPM": "JPMorgan Chase",
"AIG": "AIG",
"AXP": "American express",
"BAC": "Bank of America",
"GS": "Goldman Sachs",
"AAPL": "Apple",
"SAP": "SAP",
"CSCO": "Cisco",
"TXN": "Texas Instruments",
"XRX": "Xerox",
"WMT": "Wal-Mart",
"HD": "Home Depot",
"GSK": "GlaxoSmithKline",
"PFE": "Pfizer",
"SNY": "Sanofi-Aventis",
"NVS": "Novartis",
"KMB": "Kimberly-Clark",
"R": "Ryder",
"GD": "General Dynamics",
"RTN": "Raytheon",
"CVS": "CVS",
"CAT": "Caterpillar",
"DD": "DuPont de Nemours",
}

def load_sklearn_stock_returns(cache_dir=None, force_reload=False):


if cache_dir is None:
4.3. DATASET 47

cache_dir = Path(os.getcwd()) / "data"


if isinstance(cache_dir, str):
cache_dir = Path(cache_dir)

filename = cache_dir / "sklearn_returns.parquet"

if (filename.is_file()) & (~force_reload):


logger.info(f"logging from cache directory: {filename}")
df = pd.read_parquet(filename)

else:
logger.info("loading from external source")
url = "https://raw.githubusercontent.com/scikit-learn/examples-data/
,→master/financial-data"

df = (
pd.concat(
{
c: pd.read_csv(f"{url}/{c}.csv", index_col=0,␣
,→parse_dates=True)[

"close"
].diff()
for c in symbol_dict.keys()
},
axis=1,
)
.asfreq("B")
.iloc[1:]
)

logger.info(f"saving in cache directory {filename}")


df.to_parquet(filename)
return df

def save_dict(data, output_dir):


assert isinstance(data, dict)
if output_dir.is_dir() is False:
os.mkdir(output_dir)
for k, v in data.items():
if isinstance(v, pd.DataFrame):
v.to_parquet(output_dir / f"{k}.parquet")
else:
save_dict(v, output_dir=output_dir / k)

def load_dict(input_dir):
data = {}
48 CHAPTER 4. APPENDIX: HELPER FUNCTIONS

for o in os.scandir(input_dir):
if o.name.endswith(".parquet"):
k = o.name.replace(".parquet", "")
data[k] = pd.read_parquet(o)
elif o.is_dir:
data[o.name] = load_dict(o)
return data

Overwriting ../ml4pmt/dataset.py

4.3.1 Ken French data: industry returns

[9]: from ml4pmt.dataset import load_kf_returns, load_sklearn_stock_returns,␣


,→load_buffets_data

[10]: %%time
returns_data = load_kf_returns(
filename="12_Industry_Portfolios", cache_dir="data", force_reload=True
)

INFO:ml4pmt.dataset:loading from external source


INFO:ml4pmt.dataset:saving in cache directory data/12_Industry_Portfolios
CPU times: user 277 ms, sys: 23.1 ms, total: 300 ms
Wall time: 1.48 s
Reloading from a cache directory is faster!

[11]: %%time
returns_data = load_kf_returns(
filename="12_Industry_Portfolios", cache_dir="data", force_reload=False
)

INFO:ml4pmt.dataset:logging from cache directory: data/12_Industry_Portfolios


CPU times: user 26.1 ms, sys: 3.09 ms, total: 29.2 ms
Wall time: 28.4 ms

[12]: returs_data_SMB_HML = load_kf_returns(


filename="F-F_Research_Data_Factors", cache_dir="data"
)

INFO:ml4pmt.dataset:logging from cache directory: data/F-F_Research_Data_Factors

[13]: returs_data_MOM = load_kf_returns(filename="F-F_Momentum_Factor",␣


,→cache_dir="data")

INFO:ml4pmt.dataset:logging from cache directory: data/F-F_Momentum_Factor


4.3. DATASET 49

4.3.2 Stock returns (2003-2007)

[14]: %%time
returns_data = load_sklearn_stock_returns(cache_dir="data", force_reload=True)

INFO:ml4pmt.dataset:loading from external source


INFO:ml4pmt.dataset:saving in cache directory data/sklearn_returns.parquet
CPU times: user 1.08 s, sys: 63.5 ms, total: 1.14 s
Wall time: 19.8 s

[15]: from ml4pmt.metrics import sharpe_ratio


from ml4pmt.dataset import symbol_dict

start_date, end_date = returns_data.index[0].strftime('%Y-%m-%d'), returns_data.


,→index[-1].strftime('%Y-%m-%d')

df = returns_data.pipe(sharpe_ratio).rename(index=symbol_dict).sort_values()\
.pipe(lambda x: pd.concat([x.head(), x.tail()]))
bar(df, horizontal=True, title=f'Annualized stock sharpe ratio: {start_date} to␣
,→{end_date}')

4.3.3 13F Berkshire Hathaway

[16]: %%time
df = load_buffets_data(cache_dir="data", force_reload=True)

INFO:ml4pmt.dataset:loading from external source


INFO:ml4pmt.dataset:saving in cache directory data/ffdata_brk13f.parquet
50 CHAPTER 4. APPENDIX: HELPER FUNCTIONS

CPU times: user 60.8 ms, sys: 6.95 ms, total: 67.8 ms
Wall time: 914 ms
Bibliography

J.-P. Bouchaud, J. Bonart, J. Donier, and M. Gould. Trades, quotes and prices: financial markets under
the microscope. Cambridge University Press, 2018.

A. Frazzini, D. Kabiller, and L. H. Pedersen. Buffett’s alpha. Financial Analysts Journal, 74(4):35–55,
2018.

T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman. The elements of statistical learning: data
mining, inference, and prediction, volume 2. Springer, 2009.

M. Isichenko. Quantitative Portfolio Management: The Art and Science of Statistical Arbitrage. Wiley,
2022.

S. Jansen. Machine Learning for Algorithmic Trading: Predictive models to extract signals from market
and alternative data for systematic trading strategies with Python. Packt Publishing Ltd, 2020.

O. Ledoit and M. Wolf. Honey, i shrunk the sample covariance matrix. The Journal of Portfolio
Management, 30(4):110–119, 2004.

A. W. Lo. The statistics of sharpe ratios. Financial analysts journal, 58(4):36–52, 2002.

A. W. Lo and A. C. MacKinlay. When are contrarian profits due to stock market overreaction? The
review of financial studies, 3(2):175–205, 1990.

T. J. Moskowitz and M. Grinblatt. Do industries explain momentum? The Journal of finance, 54(4):
1249–1290, 1999.

K. P. Murphy. Probabilistic machine learning: an introduction. MIT press, 2022.

51

You might also like