Intro ML For Quants

Machine-learning for Portfolio Management and Trading:
An Introduction with Scikit-Learn
Sylvain Champonnois1
March 28, 2022
1 Email:[email protected]. This pdf was prepared for the class Data Science applied to
Finance, MSc Data Science for Business X-HEC.
2
Contents
1 Introduction 5
1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Data deluge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Data scouting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 ML research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 MLOps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Backtesting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 The business model of hedge funds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Backtesting 13
2.1 Markowitz portfolio optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Industry momentum backtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Industry data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Backtesting functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Scikit-Learn TimeSeriesSplit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Empiricial results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Cumulative pnl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Other backtest statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Linear estimators 27
3.1 Ridge / Lasso / Elastic net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Scikit-learn Pipeline and Multi-output . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Predicting industry returns with linear models . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Ridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.3 Ridge with feature expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Appendix: helper functions 37

4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Data visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.1 Ken French data: industry returns . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.2 Stock returns (2003-2007) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.3 13F Berkshire Hathaway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3
4 CONTENTS
Chapter 1
Introduction
This material is an introduction to using machine-learning for portfolio management and trading.
Given the centrality of programming in hedge funds today, the concepts are exposed using only
jupyter notebooks in python. Moreover, we leverage the scikit-learn package (also known as
sklearn) to illustrate how machine-learning is used in practice in this context.
1.1 Outline
Outline:
1. Backtesting: Markowitz portfolio optimisation, industry momentum
2. Risk: risk-model shrinkage, non-normal returns
3. Linear models: pipelines Ridge, Lasso, feature engineering
4. Non-linear models: Boosted trees, Multi-layer perceptron
5. Factors: Value, Momentum, Style Analysis
6. Overfitting: hyperparameter search, validation strategy
7. Transaction cost turnover, leverage, portfolio optimisation with constraints
8. Mean reversion profitability of liquidity provision, survival-free sample
We are interested in how quantitative hedge funds operate in practice. Today, quantitative hedge
funds are essentially consumers of data – they ingest all sorts of datasets and extract information
used to buy or sell systematically securities. Researchers and portfolio managers are deeply in-
volved in the process of data ingestion and information extraction, but they do not directly
decide which securities are bought or sold – instead algorithms do.
Because these processes of data ingestion and information extraction are so central to quan-
titative hedge fund operations, they have become software companies – a lot of the intellectual
property (IP) of hedge funds is embedded in the code they write. And in that sense, hedge funds
5
6 CHAPTER 1. INTRODUCTION
are not so different from other data-science based technology companies. (And in fact the hir-
ing has become very similar, with a lot of interest in profiles out of Computer Science, Machine-
Learning, Data engeeniring, Statistics, etc).
In the rest of this section, we first describe how new datasets have become available and what it
implies for Machine-learning research and more particularly, for quantitative hedge funds. We
then introduce the structure of this course. In particular, given this course will focus on code (that
is, python code), it is written entirely in jupyter notebooks.
1.2 Trends
1.2.1 Data deluge
There are now sensors everywhere in the physical world and most of the online interactions are
tracked – leading a ‘data deluge" (e.g. see Mary Meeker (2018) on internet trends or Lori Lewis
(2022)).
1.2.2 Data scouting
Alternative data = web + credit card transactions + geolocalisation + satellite imaging
Two examples on how data is transforming hedgd funds in the Financial Times and Bloomberg:
FT (08/28/2017) and Bloomberg (06/15/2019).
1.2. TRENDS 7
1.2.3 ML research
ML research has become a race with new ideas coming out with an increasing speed – e.g. as
illustrated by the number of papers published on the scientific paper repository arxiv.com (Jeff
Dean (06/02/2019)). More precisely, as Francois Chollet (04/03/2019) points out, the issue is how
to process (ie. test empirically) these new ideas – as fast as possible to gain a competitive edge.
The success of deep-learning depends on: i) model “capacity”, ii) computational power, iii) dataset
size. Sun, Shrivastava, Singh, Gupta (2017) note that the size of the largest dataset has remained
somewhat constant over the last few years.
A particular success of deep-learning has been on Natural Language Processing (NLP) and there
too, the size of the largest models has increased dramatically. Victor Sanh, Lysandre Debut, Julien
Chaumond, Thomas Wolf (2019).
1.3. MLOPS 9
1.3 MLOps
MLOps (machine learning operations) represents a set of practices for the deployment of ML mod-
els in production. For quant hedge funds, there are two main concepts that we describe here:
• pipelines
• backtests
1.3.1 Pipelines
Pipeline:
A machine-learning pipeline is an end-to-end description of the automated flow of
data from raw inputs to a desired output. Each step represents a transformation of the
data, possibly with a fitted model.
The diagram below illustrates a pipeline for a quant fund. The end point (to the right of the
diagram) are the holdings in a set of traded securities – and combined with the returns on these
securities, the pnl of a given strategy can be computed. The entry point (to the left of the diagram)
are features. A set of transformations (pre-determined in the pipeline) are applied to these
features to produce the desired holdings. Some transformations in the pipeline are “fixed”
while others depend on fitted models (e.g. a ML predictor of returns or a risk model).
In the diagram, we emphasize the timing of these different objects:
• for a pnl at time t, the features and target include only information up to t − 1 so that the
holdings are known in t − 1 and can accrue returns over the period t.
The following equation summarizes this point:
pnlt = holdingst−1 × returnst .
1.3.2 Scikit-learn
The following notebooks and notes are largely based on scikit-learn. scikit-learn is
an extremely powerful (and widely used) package for machine-learning. In particular, it
provides a “grammar” for pipelines where each transformation or estimator class has the
fit/transform/predict functions with arguments as (X, y) where X represents the features and
y, targets.
1.3.3 Backtesting
A look-ahead bias occurs when data dated at t includes information only available after t; in
contrast, point-in-time data ensures that data dated at t is based on only information up to date t.
A backtest is a method to simulate a strategy using point-in-time historical data and evaluate its
profitability.
In order to illustrate how to use pipelines à la scikit-learn for quantitative portfolio manage-
ment, we introduce a thin layer of functions – in particular, the backtester class. This class allow
to run a rolling window simulation so that only information up to date t − 1 is used to determined
the holdings at that date.
1.4. THE BUSINESS MODEL OF HEDGE FUNDS 11
1.4 The business model of hedge funds
HF = data + compute power + human capital + intangibles

Chapter 2
Backtesting
In this section, we construct a backtest using industry data. More precisely, we use data from Ken
French’s data library to construct a simple industry momentum return predictor.
The goal of a backtest is to assess the validity of a trading predictor at any point in the past.
In particular, it is crucial to avoid any forward-looking bias – in which information available
only after time t is mistakingly used at time t. In practice, the predictors are estimated over
rolling (or expanding) windows. We implement rolling window estimation with the sklearn
TimeSeriesSplit object.
For backtesting, visualisation is very important and we make use of some plotting functions in-
troduced in the Appendix:
[1]: from ml4pmt.plot import line, bar, heatmap
2.1 Markowitz portfolio optimisation
First: a review of mean-variance optimisation for a universe with N assets: α is the return forecast:
α = E (r ).
h T Vh
Lemma [mean-variance]: the allocation that maximizes the utility h T α − 2λ is
h = λV −1 α
,
where λ is the risk-tolerance.
The ex-ante risk is h T Vh = λ2 α T V −1 α and the ex-ante Sharpe ratio is
h T E (r ) √
S= √ = αT V −1 α.
h T Vh
Corollary: The maximisation of the sharpe ratio is equivalent (up to a scaling factor) the mean-
variance optimisation.
13
14 CHAPTER 2. BACKTESTING
The mean-variance formula is extended to account for the linear constraints
Ah = b.
To do so, we introduce the Lagrangian L (and Lagrange multiplier ξ)
h T Vh
L = hT α − − (h T A T − b T )ξ
2λ
The Lagrange multiplier ξ is a tuning parameter chosen exactly so that the constraint above
holds. At the optimal value of ξ, the constrained problem boils down to an unconstrained problem
with the adjusted return forecast α − A T ξ.
h T Vh
Lemma: the allocation that maximizes the utility h T α − 2λ under the linear constraint Ah = b is
−1 −1
−1 T −1 T −1 T −1 T −1
h=V A AV A b + λV α−A AV A AV α
Proof : the first-order condition is
∂L Vh
= α− − AT ξ = 0 ⇔ h = λV −1 [α − AT ξ ]
∂h λ
The parameter ξ is chosen that Ah = b

−1 T −1 T −1 −1 b
b = Ah = λAV [α − A ξ ] ⇒ ξ = [ AV A ] AV α−
λ
The holding vector under constraint is
−1 −1
−1 T −1 T −1 T −1 T −1
hλ = V A AV A b + λV α−A AV A AV α
| {z } | {z }
minimum variance portfolio speculative portfolio
• The first term is what minimises the risk h T Vh under the constraint Ah = b (in particular, it
does not depend on expected returns or risk-tolerance).
• The second term is the speculative portfolio (it is sensitive to both inputs).
T
√ frontier is the relation between expected portfolio return h α and portfolio standard
The efficient
deviation h T Vh for varying level of risk-tolerance
q
T T
( x, y) 7→ hλ α, hλ Vhλ
q
When b = 0, the efficient frontier between hλT α and hλT Vhλ is a line through (0, 0); otherwise, it
is a parabolic curve.
2.2. INDUSTRY MOMENTUM BACKTEST 15
We focus on pure “alpha views” – that is, long-short “cash-neutral” portfolios where the sum of
holdings is zero. In this case b = 0 and A = 1 where

1= 1 ... 1 .
2.2 Industry momentum backtest
The setup for predicting industry returns is the following:

• the assets are industries;
• the return forecast α is estimated using rolling-window returns (over L months, L = 12)
preceding a given date;
• no look-ahead bias: at each date, only information up that date is used;
• such a strategy goes long past “winners” (industries with higher-than-average returns) and
goes short “losers” (industries with lower-than-average returns) ⇒ Momentum strategy;
• this strategy is often implemented by skipping the most recent month to avoid the 1-month
reversal" effect.
See Moskowitz and Grinblatt (1999): “Do Industries Explain Momentum,” Journal of Finance
2.2.1 Industry data
[5]: from ml4pmt.dataset import load_kf_returns
[6]: returns_data = load_kf_returns(cache_dir="data")
INFO:ml4pmt.dataset:logging from cache directory: data/12_Industry_Portfolios

Since the Moskowitz-Grinblatt paper was published in August 1999, we will keep the data after
1999 as out-of-sample and only use the data before 1999.
[7]: ret = returns_data["Monthly"]["Average_Value_Weighted_Returns"][:'1999']
Time convention: - holdings ht and returns rt are known for period t – ie. *at the end of period t.
• so to compute pnl with forward-looking information, the holdings must only depend on
information up to t − 1
• in practice, we will have
pnlt = ht−1 × rt
2.2.2 Backtesting functions
In the next set of helper file, we introduce three main functions:

• a function that computes mean-variance holdings for batches
• a MeanVariance class that follows the sklearn api
• a fit_and_predict function to run rolling window estimations.
[8]: %%writefile ../ml4pmt/backtesting.py
import numpy as np
import pandas as pd
from joblib import Parallel, delayed
from ml4pmt.metrics import sharpe_ratio
from sklearn.base import BaseEstimator, clone
from sklearn.model_selection import TimeSeriesSplit
from sklearn.utils.metaestimators import _safe_split
def compute_batch_holdings(pred, V, A=None, past_h=None, constant_risk=False):

"""
compute markowitz holdings with return prediction "mu" and covariance matrix␣
,→"V"
mu: numpy array (shape N * K)

V: numpy array (N * N)
"""
N, _ = V.shape
if isinstance(pred, pd.Series) | isinstance(pred, pd.DataFrame):
pred = pred.values
if pred.shape == (N,):
pred = pred[:, None]
elif pred.shape[1] == N:
pred = pred.T
invV = np.linalg.inv(V)
if A is None:
M = invV
else:
U = invV.dot(A)
if A.ndim == 1:
M = invV - np.outer(U, U.T) / U.dot(A)
else:
M = invV - U.dot(np.linalg.inv(U.T.dot(A)).dot(U.T))
h = M.dot(pred)
if constant_risk:
h = h / np.sqrt(np.diag(h.T.dot(V.dot(h))))
return h.T
class MeanVariance(BaseEstimator):
def __init__(self, transform_V=None, A=None, constant_risk=True):
if transform_V is None:
self.transform_V = lambda x: np.cov(x.T)
else:
self.transform_V = transform_V
self.A = A
self.constant_risk = constant_risk
def fit(self, X, y=None):

self.V_ = self.transform_V(y)
def predict(self, X):

if self.A is None:
T, N = X.shape
A = np.ones(N)
else:
A = self.A
h = compute_batch_holdings(X, self.V_, A, constant_risk=self.
,→constant_risk)
return h
def score(self, X, y):

return sharpe_ratio(np.sum(X * y, axis=1))
class Backtester:
def __init__(
self,
estimator,
ret,
max_train_size=36,
test_size=1,
start_date="1945-01-01",
end_date=None,
):
self.start_date = start_date
self.end_date = end_date
self.estimator = estimator
self.ret = ret[: self.end_date]
self.cv = TimeSeriesSplit(
max_train_size=max_train_size,
test_size=test_size,
n_splits=1 + len(ret.loc[start_date:end_date]) // test_size,
)
def train(self, features, target):

pred, estimators = fit_predict(
self.estimator, features, target, self.ret, self.cv,␣

,→return_estimator=True
)
self.estimators_ = estimators
self.h_ = pred
self.pnl_ = (
pred.shift(1).mul(self.ret).sum(axis=1)[self.start_date : self.
,→end_date]
)
return self
def _fit_predict(estimator, X, y, train, test, return_estimator=False):

X_train, y_train = _safe_split(estimator, X, y, train)
X_test, _ = _safe_split(estimator, X, y, test, train)
estimator.fit(X_train, y_train)
if return_estimator:
return estimator.predict(X_test), estimator
else:
return estimator.predict(X_test)
def fit_predict(
estimator,
features,
target,
ret,
cv,
return_estimator=False,
verbose=0,
pre_dispatch="2*n_jobs",
n_jobs=1,
):
parallel = Parallel(n_jobs=n_jobs, verbose=verbose,␣
,→pre_dispatch=pre_dispatch)
res = parallel(
delayed(_fit_predict)(
clone(estimator), features, target, train, test, return_estimator
)
for train, test in cv.split(ret)
)
pred, estimators = zip(*res)
else:
pred = res
cols = ret.columns
idx = ret.index[np.concatenate([test for _, test in cv.split(ret)])]
return pd.DataFrame(np.concatenate(pred), index=idx, columns=cols),␣
,→estimators
else:
return pd.DataFrame(np.concatenate(pred), index=idx, columns=cols)
Overwriting ../ml4pmt/backtesting.py
[9]: from ml4pmt.backtesting import compute_batch_holdings, MeanVariance, Backtester
[10]: T, N = ret.shape
A = np.ones(N)
[11]: h = compute_batch_holdings(ret.mean(), ret.cov(), A, past_h=None)
[12]: np.allclose(h.dot(A), [0.])
[12]: True
[13]: A = np.stack([np.ones(N), np.zeros(N)], axis=1)

A[0, 1] = 1
[14]: h = compute_batch_holdings(pred=ret.mean(), V=ret.cov(), A=A, past_h=None)
[15]: np.allclose(h.dot(A), [0., 0.])
[15]: True
2.2.3 Scikit-Learn TimeSeriesSplit
[16]: from sklearn.model_selection import TimeSeriesSplit
Given that the data is monthly, we re-estimate the model every month. This is done by choosing
the parameter n_splits in the class TimeSeriesSplit as the number of months.
[17]: start_date = "1945-01-01"

test_size = 1
params = dict(max_train_size=36, test_size=test_size, gap=0)
params["n_splits"] = 1 + len(ret.loc[start_date:]) // test_size
cv = TimeSeriesSplit(**params)
More precisely, with TimeSeriesSplit:

• the test indices are the dates for which the holdings are computed.
• the train indices are the date range over which a forecasting model is trained.
2.3. EMPIRICIAL RESULTS 21
• gap could set to −1 because the model can have at most the same information as when
making the portfolio decision for the first date in the test window
• in practice, the target will been shifted by −1 and gap is set to 0.
• we can estimate batches with test_size > 1.
• n_splits is determined so that the backtest starts (just) before a certain start date.
[18]: for train, test in cv.split(ret):

break
ret.iloc[train].index[-1], ret.iloc[test].index[0]
[18]: (Timestamp('1944-11-01 00:00:00'), Timestamp('1944-12-01 00:00:00'))
2.3 Empiricial results
2.3.1 Cumulative pnl
[19]: transform_X = lambda x: x.rolling(12).mean().values

transform_y = lambda x: x.shift(-1).values
features = transform_X(ret)
target = transform_y(ret)
[20]: _h = []
for train, test in cv.split(ret):
m = MeanVariance()
m.fit(features[train], target[train])
_h += [m.predict(features[test])]
cols = ret.columns
idx = ret.index[np.concatenate([test for _, test in cv.split(ret)])]
h = pd.DataFrame(np.concatenate(_h), index=idx, columns=cols)
[21]: pnl = h.shift(1).mul(ret).sum(axis=1)[start_date:]

line(pnl.rename('Industry momentum'), cumsum=True)
[22]: m = Backtester(estimator=MeanVariance(), ret=ret)

m.train(features, target)
h.equals(m.h_), pnl.equals(m.pnl_)
[22]: (True, True)
2.3.2 Other backtest statistics
We can also extract information for the estimator – e.g. in this simple case, recover the covariance
matrix fitted by the class MeanVariance().
[23]: estimators = m.estimators_
[24]: V_mean = pd.DataFrame(sum([m.V_ for m in estimators])/len(estimators), ret.

,→columns, ret.columns)
heatmap(V_mean, title='Average covariance matrix')

[25]: m = Backtester(estimator=MeanVariance(), ret=ret)

pnls = {}
for window in [6, 12, 24, 36]:
features_ = ret.rolling(window).mean().values
m.train(features_, target)
pnls[window] = m.pnl_
line(pnls, cumsum=True, start_date='1945', title='Cumulative pnl for different␣
,→look-back windows (in month)')
[27]: from ml4pmt.metrics import sharpe_ratio

sr = {i: h.shift(1+i).mul(ret).sum(axis=1).pipe(sharpe_ratio) for i in␣
,→range(-10, 12)}
bar(sr, baseline=0, sort=False, title='Lead-lag sharpe ratio')
The off-the-top approach is to remove an asset from the tradable set and check whether the
portfolio sharpe ratio decreases (in which case, this asset is a contributor) or increases (in which
case, this asset is a detractor).
[30]: pnls_ott = {}
for c in ret.columns:
ret_ = ret.drop(c, axis=1)
features_ = transform_X(ret_)
target_ = transform_y(ret_)
pnl_ = Backtester(estimator=MeanVariance(), ret=ret_).train(features_,␣
,→target_).pnl_
pnls_ott[c] = pnl_.pipe(sharpe_ratio)
pnls_ott["ALL"] = pnl.pipe(sharpe_ratio)
[31]: bar(pnls_ott, baseline="ALL", title='Industry momentum off-the-top')

Chapter 3
Linear estimators
In this section, we take advantage of some of scikit-learn powerful features such as the
pipeline.
3.1 Ridge / Lasso / Elastic net
Ridge regression: the betas h β 1 , ..., β p i are chosen to minimize

p p
1 N
2 i∑
(yi − β 0 − ∑ xij β j )2 + λ ∑ β2j
=1 j =1 j =1
The Ridge regression provides more stable and accurate estimates than standard residual sum of
squares minimization
Lasso regression: the betas h β 1 , ..., β p i are chosen to minimize
p p
1 N
2 i∑
(yi − β 0 − ∑ xij β j )2 + λ ∑ | β j |
=1 j =1 j =1
The Lasso tends to promote sparse and stable models that can be more easily interpretable.
Elastic net: the betas h β 1 , ..., β p i are chosen to minimize
p p
1 N
2 i∑
(yi − β 0 − ∑ xij β j )2 + λ ∑ [(1 − α) β2j + α| β j |]
=1 j =1 j =1
“The lasso penalty is not very selective in the choice among a set of strong but corre-
lated predictors, and the ridge penalty is inclined to shrink the coefficients of correlated
variables towards each other. The compromise in the elastic net could cause the highly
correlated features to be averaged while encouraging a parsiminuous model."
To give an example, we use a diabetes dataset provided by sklearn.
[2]: from sklearn import datasets
27
28 CHAPTER 3. LINEAR ESTIMATORS
[3]: X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
[4]: from sklearn.linear_model import lasso_path, enet_path

from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
X /= X.std(axis=0)
eps = 5e-3
[5]: alphas_lasso, coefs_lasso, _ = lasso_path(X, y, eps=eps)

l1_ratio = 0.5
alphas_enet, coefs_enet, _ = enet_path(X, y, eps=eps, l1_ratio=l1_ratio)
[6]: fig, ax = plt.subplots(1, 2, figsize=(20, 6))

fig.suptitle('Coefficients as a function of the shrinkage factor (in log)')
line(pd.DataFrame(coefs_lasso.T, -1*np.log(alphas_lasso), columns=X.columns),␣
,→title='Lasso', ax=ax[0])
line(pd.DataFrame(coefs_enet.T, -1*np.log(alphas_enet), columns=X.columns),

title=f'Elastic net (l1_ratio={l1_ratio})', ax=ax[1])
[9]: from ml4pmt.estimators import LinearRegression, Ridge
3.2 Scikit-learn Pipeline and Multi-output
[7]: from sklearn.pipeline import make_pipeline

from sklearn.preprocessing import StandardScaler, PolynomialFeatures
The API of scikit-learn is to have fit, predict and transform. More precisely, estimator like
linear regression or Ridge regression have the property to deal with multi-ouputs, but do not have
a transform property. We add this property to be able to use the pipeline operator.
3.3. PREDICTING INDUSTRY RETURNS WITH LINEAR MODELS 29
[30]: %%writefile ../ml4pmt/estimators.py
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV

from sklearn.multioutput import MultiOutputRegressor
from sklearn.neural_network import MLPRegressor
class LinearRegression(LinearRegression):
def transform(self, X):
return self.predict(X)
class Ridge(Ridge):
class RidgeCV(RidgeCV):
class MultiOutputRegressor(MultiOutputRegressor):
class MLPRegressor(MLPRegressor):
Overwriting ../ml4pmt/estimators.py
3.3 Predicting industry returns with linear models
[10]: returns_data = load_kf_returns(cache_dir='data')
ret = returns_data['Monthly']['Average_Value_Weighted_Returns'][:'1999']
transform_X = lambda x: x.rolling(12).mean().fillna(0).values

transform_y = lambda x: x.shift(-1).values
features = transform_X(ret)
target = transform_y(ret)

[11]: m = Backtester(MeanVariance(), ret).train(features, target)

pnls = {'momentum': m.pnl_ }
3.3.1 Linear Regression
Always a good idea to start with a linear regression as a benchmark.
[12]: estimator = make_pipeline(LinearRegression(), MeanVariance())
[13]: m = Backtester(estimator, ret).train(features, target)

pnls['linear_regression'] = m.pnl_
line(pnls['linear_regression'], cumsum=True, title='Linear Regression')
The linear regression fits an intercept and some coefficients.
[14]: ols_ = m.estimators_[0].named_steps['linearregression']

coef_ = ols_.coef_
intercept_ = ols_.intercept_
vec = ret.mean().values
np.allclose(ols_.predict(vec[None, :]), coef_.dot(vec) + intercept_)
[14]: True
[16]: coefs_ = [m.named_steps['linearregression'].coef_ for m in m.estimators_]

coefs_mean = pd.DataFrame(sum(coefs_)/len(coefs_), ret.columns, ret.columns).T
[17]: heatmap(coefs_mean.loc[coefs_mean.mean(1).sort_values().index, coefs_mean.

,→mean(1).sort_values().index],
title='Average linear regression coefficients (x-axis: predictors,␣

,→ y-axis=targets)')
[18]: pnls_ = {}
for hl in tqdm([6, 12, 24]):
features_ = ret.ewm(halflife=hl).mean().fillna(0).values
pnls_[hl] = Backtester(estimator, ret).train(features_, target).pnl_
line(pnls_, cumsum=True, title='Robustness on feature half-lives')
0%| | 0/3 [00:00<?, ?it/s]

[19]: pnls_ = {}
for hl in [6, 12, 24]:
features_ = ret.rolling(window=hl).mean().fillna(0).values
pnls_[hl] = Backtester(estimator, ret).train(features_, target).pnl_
line(pnls_, cumsum=True, title='Robustness on features with rolling windows')
3.3.2 Ridge
[21]: estimator = make_pipeline(StandardScaler(with_mean=False),

Ridge(),
MeanVariance())
pnls['ridge']= Backtester(estimator, ret).train(features, target).pnl_

line(pnls['ridge'], cumsum=True, title='Ridge')
[22]: pnls_ = {}
for alpha in [0.1, 1, 10, 100]:
estimator_ = make_pipeline(StandardScaler(with_mean=False),␣
,→Ridge(alpha=alpha), MeanVariance())
pnls_[alpha] = Backtester(estimator_, ret).train(features_, target).pnl_

line(pnls_, cumsum=True, title='Ridge: Robustness on alpha')
3.3.3 Ridge with feature expansion
We can expand the set of features by using polynomial transfomrs with PolynomialFeatures.
[23]: PolynomialFeatures(degree=2).fit_transform(ret.iloc[:10]).shape
[23]: (10, 91)
Number of new features: intercept, initial features (=12), squared features (12), all cross features
of degree 1 (=6*12):
[24]: estimator = make_pipeline(StandardScaler(with_mean=False),

PolynomialFeatures(degree=2),
Ridge(alpha=100),
MeanVariance())
[25]: print(f'Number of features generated by degree=2: {1+ 12 + 12 + 6 * 11}')
Number of features generated by degree=2: 91
[26]: pnls['ridge_with_feature_expansion'] = Backtester(estimator, ret).

,→train(features_, target).pnl_
line(pnls['ridge_with_feature_expansion'], cumsum=True, title='Ridge with␣

,→feature extension')
[27]: pnls_ = {}
for alpha in [0.1, 1, 10, 100, 1000]:
estimator_ = make_pipeline(StandardScaler(with_mean=False),
PolynomialFeatures(degree=2),
Ridge(alpha=alpha),
MeanVariance())
pnls_[alpha] = Backtester(estimator_, ret).train(features_, target).pnl_
line(pnls_, cumsum=True, title='Ridge with feature expansion: Robustness on␣
,→alpha')
Putting all the types of linear predictors together, we can compare the cumulative pnls in the
graph below.
[28]: line(pd.concat(pnls, axis=1).assign(ALL = lambda x: x.mean(axis=1)), cumsum=True)

Chapter 4
Appendix: helper functions
Here we create some helper functions that will be used across notebooks using the magic
%%writefile.
4.1 Metrics
[18]: %%writefile ../ml4pmt/metrics.py

import numpy as np
def test_monthly(df):
return int(len(df) / len(df.asfreq("M"))) == 1
def test_bday(df):
return int(len(df) / len(df.asfreq("B"))) == 1
def test_day(df):
return int(len(df) / len(df.asfreq("D"))) == 1
def sharpe_ratio(df, num_period_per_year=None):

num_period_per_year = None
if test_monthly(df):
num_period_per_year = 12
if test_bday(df):
if test_day(df):
if num_period_per_year is None:
return np.nan
37
38 CHAPTER 4. APPENDIX: HELPER FUNCTIONS
else:
return df.mean() / df.std() * np.sqrt(num_period_per_year)
def drawdown(x, return_in_risk_unit=True, window=36, num_period_per_year=12):

dd = x.cumsum().sub(x.cumsum().cummax())
if return_in_risk_unit:
return dd.div(x.rolling(window).std().mul(np.sqrt(num_period_per_year)))
else:
return dd
Overwriting ../ml4pmt/metrics.py
4.2 Data visualisation
Data exploration, in particular based on visualisation, is crucial to modern data science. Pandas
has a lot of plotting functionalities (e.g. see the graph below), but we will find it usefull to use a
custom plot set of functions.
[19]: %%writefile ../ml4pmt/plot.py

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from ml4pmt.metrics import sharpe_ratio
plt.style.use("seaborn-whitegrid")
def line(
df,
sort=True,
figsize=(8, 5),
ax=None,
title="",
cumsum=False,
loc="center left",
bbox_to_anchor=(1, 0.5),
legend_sharpe_ratio=None,
legend=True,
yscale=None,
start_date=None,
):
if loc == "best":
bbox_to_anchor = None
if isinstance(df, dict):
df = pd.concat(df, axis=1)
4.2. DATA VISUALISATION 39
if isinstance(df, pd.Series):
df = df.to_frame()
if start_date is not None:
df = df[start_date:]
if cumsum & (legend_sharpe_ratio is None):
legend_sharpe_ratio = True
if legend_sharpe_ratio:
df.columns = [f"{c}: sr={sharpe_ratio(df[c]): 3.2f}" for c in df.columns]
if cumsum:
df = df.cumsum()
if sort:
df = df.loc[:, lambda x: x.iloc[-1].sort_values(ascending=False).index]
if ax is None:
fig, ax = plt.subplots(1, 1, figsize=figsize)
ax.plot(df.index, df.values)
if legend:
ax.legend(df.columns, loc=loc, bbox_to_anchor=bbox_to_anchor)
ax.set_title(title)
if yscale == "log":
ax.set_yscale("log")
def bar(
df,
err=None,
sort=True,
figsize=(8, 5),
ax=None,
title="",
horizontal=False,
baseline=None,
rotation=0,
):
if isinstance(df, pd.DataFrame):
df = df.squeeze()
if isinstance(df, dict):
df = pd.Series(df)
if sort:
df = df.sort_values()
if err is not None:
err = err.loc[df.index]
labels = df.index
x = np.arange(len(labels))
if ax is None:
if horizontal:
ax.barh(x, df.values, xerr=err, capsize=5)
ax.set_yticks(x)
ax.set_yticklabels(labels, rotation=0)
if baseline in df.index:
df_ = df.copy()
df_[df.index != baseline] = 0
ax.barh(x, df_.values, color="lightgreen")
else:
ax.bar(x, df.values, yerr=err, capsize=5)
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=0)
if baseline in df.index:
df_ = df.copy()
df_[df.index != baseline] = 0
ax.bar(x, df_.values, color="lightgreen")
ax.set_title(title)
def heatmap(
df,
ax=None,
figsize=(8, 5),
title="",
vmin=None,
vmax=None,
vcompute=True,
cmap="RdBu",
):
labels_x = df.index
x = np.arange(len(labels_x))
labels_y = df.columns
y = np.arange(len(labels_y))
if vcompute:
vmax = df.abs().max().max()
vmin = -vmax
if ax is None:
pos = ax.imshow(
df.T.values, cmap=cmap, interpolation="nearest", vmax=vmax, vmin=vmin
)
ax.set_xticks(x)
ax.set_yticks(y)
ax.set_xticklabels(labels_x, rotation=90)
ax.set_yticklabels(labels_y)
ax.set_title(title)
ax.grid(True)
fig.colorbar(pos, ax=ax)
4.2. DATA VISUALISATION 41
def scatter(
df,
xscale=None,
yscale=None,
xlabel=None,
ylabel=None,
xticks=None,
yticks=None,
figsize=(8, 5),
title=None,
):
ax.scatter(df, df.index, facecolors="none", edgecolors="b", s=50)
if xlabel is not None:
ax.set_xlabel(xlabel)
if ylabel is not None:
ax.set_ylabel(ylabel)
if xscale is not None:
ax.set_xscale(xscale)
if yscale is not None:
ax.set_yscale(yscale)
if yticks is not None:
ax.set_yticks(yticks)
ax.set_yticklabels(yticks)
if xticks is not None:
ax.set_xticks(xticks)
ax.set_xticklabels(xticks)
ax.set_title(title)
Overwriting ../ml4pmt/plot.py
[3]: from ml4pmt.plot import bar, line, heatmap
[6]: line(pd.Series(np.random.normal(size=50)), cumsum=True, title="This is a graph",

legend_sharpe_ratio=False)
[7]: bar(pd.Series(np.random.normal(size=50)), baseline=10, horizontal=True)
4.3 Dataset
In particular, we look at two datasets:
• Ken French’s data library (https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html).
• Berkshire Hathaway
4.3. DATASET 43
[17]: %%writefile ../ml4pmt/dataset.py
import logging
import os
import sys
from io import BytesIO
from pathlib import Path
from zipfile import ZipFile
import numpy as np
import pandas as pd
import requests
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logger = logging.getLogger(__name__)
def clean_kf_dataframes(df, multi_df=False):

"""
extract the annual and monthly dataframes from the csv file with specific␣
,→formatting
"""
idx = [-2] + list(np.where(df.notna().sum(axis=1) == 0)[0])
if multi_df:
cols = [" Average Value Weighted Returns -- Monthly"] + list(
df.loc[df.notna().sum(axis=1) == 0].index
)
returns_data = {"Annual": {}, "Monthly": {}}
for i in range(len(idx)):
if multi_df:
c_ = (
cols[i]
.replace("-- Annual", "")
.replace("-- Monthly", "")
.strip()
.replace("/", " ")
.replace(" ", "_")
)
if i != len(idx) - 1:
v = df.iloc[idx[i] + 2 : idx[i + 1] - 1].astype(float)
v.index = v.index.str.strip()
if len(v) != 0:
if len(v.index[0]) == 6:
v.index = pd.to_datetime(v.index, format="%Y%m")
if multi_df:
returns_data["Monthly"][c_] = v
else:
returns_data["Monthly"] = v
continue
if len(v.index[0]) == 4:
v.index = pd.to_datetime(v.index, format="%Y")
if multi_df:
returns_data["Annual"][c_] = v
else:
returns_data["Annual"] = v
return returns_data
def load_kf_returns(
filename="12_Industry_Portfolios", cache_dir=None, force_reload=False
):
"""
load industry returns for Ken French website:
https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
"""
if filename == "12_Industry_Portfolios":
skiprows, multi_df = 11, True
if filename == "F-F_Research_Data_Factors":
skiprows, multi_df = 3, False
if filename == "F-F_Momentum_Factor":
skiprows, multi_df = 13, False
if cache_dir is None:
cache_dir = Path(os.getcwd()) / "data"
if isinstance(cache_dir, str):
cache_dir = Path(cache_dir)
output_dir = cache_dir / filename
if (output_dir.is_dir()) & (~force_reload):
logger.info(f"logging from cache directory: {output_dir}")
returns_data = load_dict(output_dir)
else:
logger.info("loading from external source")
path = (
"http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/"
+ filename
+ "_CSV.zip"
)
r = requests.get(path)
files = ZipFile(BytesIO(r.content))
df = pd.read_csv(files.open(filename + ".CSV"), skiprows=skiprows,␣

,→index_col=0)
returns_data = clean_kf_dataframes(df, multi_df=multi_df)

4.3. DATASET 45
logger.info(f"saving in cache directory {output_dir}")

save_dict(returns_data, output_dir)
return returns_data
def load_buffets_data(cache_dir=None, force_reload=False):

filename = cache_dir / "ffdata_brk13f.parquet"
if (filename.is_file()) & (~force_reload):

logger.info(f"logging from cache directory: {filename}")
df = pd.read_parquet(filename)
else:
path = "https://github.com/slihn/buffetts_alpha_R/archive/master.zip"
r = requests.get(path)
files = ZipFile(BytesIO(r.content))
df = pd.read_csv(
files.open("buffetts_alpha_R-master/ffdata_brk13f.csv"), index_col=0
)
df.index = pd.to_datetime(df.index, format="%m/%d/%Y")
logger.info(f"saving in cache directory {filename}")
df.to_parquet(filename)
return df
symbol_dict = {
"TOT": "Total",
"XOM": "Exxon",
"CVX": "Chevron",
"COP": "ConocoPhillips",
"VLO": "Valero Energy",
"MSFT": "Microsoft",
"IBM": "IBM",
"TWX": "Time Warner",
"CMCSA": "Comcast",
"CVC": "Cablevision",
"YHOO": "Yahoo",
"DELL": "Dell",
"HPQ": "HP",
"AMZN": "Amazon",
"TM": "Toyota",
"CAJ": "Canon",
"SNE": "Sony",
"F": "Ford",
"HMC": "Honda",
"NAV": "Navistar",
"NOC": "Northrop Grumman",
"BA": "Boeing",
"KO": "Coca Cola",
"MMM": "3M",
"MCD": "McDonald's",
"PEP": "Pepsi",
"K": "Kellogg",
"UN": "Unilever",
"MAR": "Marriott",
"PG": "Procter Gamble",
"CL": "Colgate-Palmolive",
"GE": "General Electrics",
"WFC": "Wells Fargo",
"JPM": "JPMorgan Chase",
"AIG": "AIG",
"AXP": "American express",
"BAC": "Bank of America",
"GS": "Goldman Sachs",
"AAPL": "Apple",
"SAP": "SAP",
"CSCO": "Cisco",
"TXN": "Texas Instruments",
"XRX": "Xerox",
"WMT": "Wal-Mart",
"HD": "Home Depot",
"GSK": "GlaxoSmithKline",
"PFE": "Pfizer",
"SNY": "Sanofi-Aventis",
"NVS": "Novartis",
"KMB": "Kimberly-Clark",
"R": "Ryder",
"GD": "General Dynamics",
"RTN": "Raytheon",
"CVS": "CVS",
"CAT": "Caterpillar",
"DD": "DuPont de Nemours",
}
def load_sklearn_stock_returns(cache_dir=None, force_reload=False):

4.3. DATASET 47

filename = cache_dir / "sklearn_returns.parquet"
if (filename.is_file()) & (~force_reload):

logger.info(f"logging from cache directory: {filename}")
df = pd.read_parquet(filename)
else:
url = "https://raw.githubusercontent.com/scikit-learn/examples-data/
,→master/financial-data"
df = (
pd.concat(
{
c: pd.read_csv(f"{url}/{c}.csv", index_col=0,␣
,→parse_dates=True)[
"close"
].diff()
for c in symbol_dict.keys()
},
axis=1,
)
.asfreq("B")
.iloc[1:]
)
logger.info(f"saving in cache directory {filename}")

df.to_parquet(filename)
return df
def save_dict(data, output_dir):

assert isinstance(data, dict)
if output_dir.is_dir() is False:
os.mkdir(output_dir)
for k, v in data.items():
if isinstance(v, pd.DataFrame):
v.to_parquet(output_dir / f"{k}.parquet")
else:
save_dict(v, output_dir=output_dir / k)
def load_dict(input_dir):
data = {}
for o in os.scandir(input_dir):
if o.name.endswith(".parquet"):
k = o.name.replace(".parquet", "")
data[k] = pd.read_parquet(o)
elif o.is_dir:
data[o.name] = load_dict(o)
return data
Overwriting ../ml4pmt/dataset.py
4.3.1 Ken French data: industry returns
[9]: from ml4pmt.dataset import load_kf_returns, load_sklearn_stock_returns,␣

,→load_buffets_data
[10]: %%time
returns_data = load_kf_returns(
filename="12_Industry_Portfolios", cache_dir="data", force_reload=True
)
INFO:ml4pmt.dataset:loading from external source

INFO:ml4pmt.dataset:saving in cache directory data/12_Industry_Portfolios
CPU times: user 277 ms, sys: 23.1 ms, total: 300 ms
Wall time: 1.48 s
Reloading from a cache directory is faster!
[11]: %%time
returns_data = load_kf_returns(
filename="12_Industry_Portfolios", cache_dir="data", force_reload=False
)

CPU times: user 26.1 ms, sys: 3.09 ms, total: 29.2 ms
Wall time: 28.4 ms
[12]: returs_data_SMB_HML = load_kf_returns(

filename="F-F_Research_Data_Factors", cache_dir="data"
)
INFO:ml4pmt.dataset:logging from cache directory: data/F-F_Research_Data_Factors
[13]: returs_data_MOM = load_kf_returns(filename="F-F_Momentum_Factor",␣

,→cache_dir="data")
INFO:ml4pmt.dataset:logging from cache directory: data/F-F_Momentum_Factor

4.3. DATASET 49
4.3.2 Stock returns (2003-2007)
[14]: %%time
returns_data = load_sklearn_stock_returns(cache_dir="data", force_reload=True)

INFO:ml4pmt.dataset:saving in cache directory data/sklearn_returns.parquet
CPU times: user 1.08 s, sys: 63.5 ms, total: 1.14 s
Wall time: 19.8 s
[15]: from ml4pmt.metrics import sharpe_ratio

from ml4pmt.dataset import symbol_dict
start_date, end_date = returns_data.index[0].strftime('%Y-%m-%d'), returns_data.

,→index[-1].strftime('%Y-%m-%d')
df = returns_data.pipe(sharpe_ratio).rename(index=symbol_dict).sort_values()\
.pipe(lambda x: pd.concat([x.head(), x.tail()]))
bar(df, horizontal=True, title=f'Annualized stock sharpe ratio: {start_date} to␣
,→{end_date}')
4.3.3 13F Berkshire Hathaway
[16]: %%time
df = load_buffets_data(cache_dir="data", force_reload=True)

INFO:ml4pmt.dataset:saving in cache directory data/ffdata_brk13f.parquet
CPU times: user 60.8 ms, sys: 6.95 ms, total: 67.8 ms
Wall time: 914 ms
Bibliography
J.-P. Bouchaud, J. Bonart, J. Donier, and M. Gould. Trades, quotes and prices: financial markets under
the microscope. Cambridge University Press, 2018.
A. Frazzini, D. Kabiller, and L. H. Pedersen. Buffett’s alpha. Financial Analysts Journal, 74(4):35–55,
2018.
T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman. The elements of statistical learning: data
mining, inference, and prediction, volume 2. Springer, 2009.
M. Isichenko. Quantitative Portfolio Management: The Art and Science of Statistical Arbitrage. Wiley,
2022.
S. Jansen. Machine Learning for Algorithmic Trading: Predictive models to extract signals from market
and alternative data for systematic trading strategies with Python. Packt Publishing Ltd, 2020.
O. Ledoit and M. Wolf. Honey, i shrunk the sample covariance matrix. The Journal of Portfolio
Management, 30(4):110–119, 2004.
A. W. Lo. The statistics of sharpe ratios. Financial analysts journal, 58(4):36–52, 2002.
A. W. Lo and A. C. MacKinlay. When are contrarian profits due to stock market overreaction? The
review of financial studies, 3(2):175–205, 1990.
T. J. Moskowitz and M. Grinblatt. Do industries explain momentum? The Journal of finance, 54(4):
1249–1290, 1999.
K. P. Murphy. Probabilistic machine learning: an introduction. MIT press, 2022.
51

Intro ML For Quants

Uploaded by

Copyright:

Available Formats

Intro ML For Quants

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intro ML For Quants

Uploaded by

Copyright:

Available Formats

Machine-learning for Portfolio Management and Trading:

An Introduction with Scikit-Learn

4 Appendix: helper functions 37

1. Backtesting: Markowitz portfolio optimisation, industry momentum

2. Risk: risk-model shrinkage, non-normal returns

3. Linear models: pipelines Ridge, Lasso, feature engineering

4. Non-linear models: Boosted trees, Multi-layer perceptron

5. Factors: Value, Momentum, Style Analysis

6. Overfitting: hyperparameter search, validation strategy

7. Transaction cost turnover, leverage, portfolio optimisation with constraints

8. Mean reversion profitability of liquidity provision, survival-free sample

1.2.1 Data deluge

1.2.2 Data scouting

Alternative data = web + credit card transactions + geolocalisation + satellite imaging

pnlt = holdingst−1 × returnst .

1.4 The business model of hedge funds

HF = data + compute power + human capital + intangibles

[1]: from ml4pmt.plot import line, bar, heatmap

2.1 Markowitz portfolio optimisation

The mean-variance formula is extended to account for the linear constraints

To do so, we introduce the Lagrangian L (and Lagrange multiplier ξ)

Proof : the first-order condition is

The parameter ξ is chosen that Ah = b

The holding vector under constraint is

2.2 Industry momentum backtest

The setup for predicting industry returns is the following:

2.2.1 Industry data

[5]: from ml4pmt.dataset import load_kf_returns

[6]: returns_data = load_kf_returns(cache_dir="data")

INFO:ml4pmt.dataset:logging from cache directory: data/12_Industry_Portfolios

[7]: ret = returns_data["Monthly"]["Average_Value_Weighted_Returns"][:'1999']

2.2.2 Backtesting functions

In the next set of helper file, we introduce three main functions:

[8]: %%writefile ../ml4pmt/backtesting.py

def compute_batch_holdings(pred, V, A=None, past_h=None, constant_risk=False):

mu: numpy array (shape N * K)

def fit(self, X, y=None):

def predict(self, X):

def score(self, X, y):

def train(self, features, target):

self.estimator, features, target, self.ret, self.cv,␣

def _fit_predict(estimator, X, y, train, test, return_estimator=False):

[9]: from ml4pmt.backtesting import compute_batch_holdings, MeanVariance, Backtester

[11]: h = compute_batch_holdings(ret.mean(), ret.cov(), A, past_h=None)

[12]: np.allclose(h.dot(A), [0.])

[13]: A = np.stack([np.ones(N), np.zeros(N)], axis=1)

[14]: h = compute_batch_holdings(pred=ret.mean(), V=ret.cov(), A=A, past_h=None)

[15]: np.allclose(h.dot(A), [0., 0.])

2.2.3 Scikit-Learn TimeSeriesSplit

[16]: from sklearn.model_selection import TimeSeriesSplit

[17]: start_date = "1945-01-01"

More precisely, with TimeSeriesSplit:

• in practice, the target will been shifted by −1 and gap is set to 0.

• we can estimate batches with test_size > 1.

[18]: for train, test in cv.split(ret):

[18]: (Timestamp('1944-11-01 00:00:00'), Timestamp('1944-12-01 00:00:00'))

2.3 Empiricial results

2.3.1 Cumulative pnl