Intro ML For Quants
Intro ML For Quants
Intro ML For Quants
Sylvain Champonnois1
March 28, 2022
1 Email:[email protected]. This pdf was prepared for the class Data Science applied to
Finance, MSc Data Science for Business X-HEC.
2
Contents
1 Introduction 5
1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Data deluge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Data scouting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 ML research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 MLOps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Backtesting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 The business model of hedge funds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Backtesting 13
2.1 Markowitz portfolio optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Industry momentum backtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Industry data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Backtesting functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Scikit-Learn TimeSeriesSplit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Empiricial results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Cumulative pnl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Other backtest statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Linear estimators 27
3.1 Ridge / Lasso / Elastic net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Scikit-learn Pipeline and Multi-output . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Predicting industry returns with linear models . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Ridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.3 Ridge with feature expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3
4 CONTENTS
Chapter 1
Introduction
This material is an introduction to using machine-learning for portfolio management and trading.
Given the centrality of programming in hedge funds today, the concepts are exposed using only
jupyter notebooks in python. Moreover, we leverage the scikit-learn package (also known as
sklearn) to illustrate how machine-learning is used in practice in this context.
1.1 Outline
Outline:
We are interested in how quantitative hedge funds operate in practice. Today, quantitative hedge
funds are essentially consumers of data – they ingest all sorts of datasets and extract information
used to buy or sell systematically securities. Researchers and portfolio managers are deeply in-
volved in the process of data ingestion and information extraction, but they do not directly
decide which securities are bought or sold – instead algorithms do.
Because these processes of data ingestion and information extraction are so central to quan-
titative hedge fund operations, they have become software companies – a lot of the intellectual
property (IP) of hedge funds is embedded in the code they write. And in that sense, hedge funds
5
6 CHAPTER 1. INTRODUCTION
are not so different from other data-science based technology companies. (And in fact the hir-
ing has become very similar, with a lot of interest in profiles out of Computer Science, Machine-
Learning, Data engeeniring, Statistics, etc).
In the rest of this section, we first describe how new datasets have become available and what it
implies for Machine-learning research and more particularly, for quantitative hedge funds. We
then introduce the structure of this course. In particular, given this course will focus on code (that
is, python code), it is written entirely in jupyter notebooks.
1.2 Trends
There are now sensors everywhere in the physical world and most of the online interactions are
tracked – leading a ‘data deluge" (e.g. see Mary Meeker (2018) on internet trends or Lori Lewis
(2022)).
Two examples on how data is transforming hedgd funds in the Financial Times and Bloomberg:
FT (08/28/2017) and Bloomberg (06/15/2019).
1.2. TRENDS 7
1.2.3 ML research
ML research has become a race with new ideas coming out with an increasing speed – e.g. as
illustrated by the number of papers published on the scientific paper repository arxiv.com (Jeff
Dean (06/02/2019)). More precisely, as Francois Chollet (04/03/2019) points out, the issue is how
to process (ie. test empirically) these new ideas – as fast as possible to gain a competitive edge.
8 CHAPTER 1. INTRODUCTION
The success of deep-learning depends on: i) model “capacity”, ii) computational power, iii) dataset
size. Sun, Shrivastava, Singh, Gupta (2017) note that the size of the largest dataset has remained
somewhat constant over the last few years.
A particular success of deep-learning has been on Natural Language Processing (NLP) and there
too, the size of the largest models has increased dramatically. Victor Sanh, Lysandre Debut, Julien
Chaumond, Thomas Wolf (2019).
1.3. MLOPS 9
1.3 MLOps
MLOps (machine learning operations) represents a set of practices for the deployment of ML mod-
els in production. For quant hedge funds, there are two main concepts that we describe here:
• pipelines
• backtests
1.3.1 Pipelines
Pipeline:
A machine-learning pipeline is an end-to-end description of the automated flow of
data from raw inputs to a desired output. Each step represents a transformation of the
data, possibly with a fitted model.
The diagram below illustrates a pipeline for a quant fund. The end point (to the right of the
diagram) are the holdings in a set of traded securities – and combined with the returns on these
securities, the pnl of a given strategy can be computed. The entry point (to the left of the diagram)
are features. A set of transformations (pre-determined in the pipeline) are applied to these
features to produce the desired holdings. Some transformations in the pipeline are “fixed”
while others depend on fitted models (e.g. a ML predictor of returns or a risk model).
In the diagram, we emphasize the timing of these different objects:
• for a pnl at time t, the features and target include only information up to t − 1 so that the
holdings are known in t − 1 and can accrue returns over the period t.
The following equation summarizes this point:
10 CHAPTER 1. INTRODUCTION
1.3.2 Scikit-learn
The following notebooks and notes are largely based on scikit-learn. scikit-learn is
an extremely powerful (and widely used) package for machine-learning. In particular, it
provides a “grammar” for pipelines where each transformation or estimator class has the
fit/transform/predict functions with arguments as (X, y) where X represents the features and
y, targets.
1.3.3 Backtesting
A look-ahead bias occurs when data dated at t includes information only available after t; in
contrast, point-in-time data ensures that data dated at t is based on only information up to date t.
A backtest is a method to simulate a strategy using point-in-time historical data and evaluate its
profitability.
In order to illustrate how to use pipelines à la scikit-learn for quantitative portfolio manage-
ment, we introduce a thin layer of functions – in particular, the backtester class. This class allow
to run a rolling window simulation so that only information up to date t − 1 is used to determined
the holdings at that date.
1.4. THE BUSINESS MODEL OF HEDGE FUNDS 11
Backtesting
In this section, we construct a backtest using industry data. More precisely, we use data from Ken
French’s data library to construct a simple industry momentum return predictor.
The goal of a backtest is to assess the validity of a trading predictor at any point in the past.
In particular, it is crucial to avoid any forward-looking bias – in which information available
only after time t is mistakingly used at time t. In practice, the predictors are estimated over
rolling (or expanding) windows. We implement rolling window estimation with the sklearn
TimeSeriesSplit object.
For backtesting, visualisation is very important and we make use of some plotting functions in-
troduced in the Appendix:
First: a review of mean-variance optimisation for a universe with N assets: α is the return forecast:
α = E (r ).
h T Vh
Lemma [mean-variance]: the allocation that maximizes the utility h T α − 2λ is
h = λV −1 α
,
where λ is the risk-tolerance.
The ex-ante risk is h T Vh = λ2 α T V −1 α and the ex-ante Sharpe ratio is
h T E (r ) √
S= √ = αT V −1 α.
h T Vh
Corollary: The maximisation of the sharpe ratio is equivalent (up to a scaling factor) the mean-
variance optimisation.
13
14 CHAPTER 2. BACKTESTING
Ah = b.
h T Vh
L = hT α − − (h T A T − b T )ξ
2λ
The Lagrange multiplier ξ is a tuning parameter chosen exactly so that the constraint above
holds. At the optimal value of ξ, the constrained problem boils down to an unconstrained problem
with the adjusted return forecast α − A T ξ.
h T Vh
Lemma: the allocation that maximizes the utility h T α − 2λ under the linear constraint Ah = b is
−1 −1
−1 T −1 T −1 T −1 T −1
h=V A AV A b + λV α−A AV A AV α
∂L Vh
= α− − AT ξ = 0 ⇔ h = λV −1 [α − AT ξ ]
∂h λ
−1 T −1 T −1 −1 b
b = Ah = λAV [α − A ξ ] ⇒ ξ = [ AV A ] AV α−
λ
−1 −1
−1 T −1 T −1 T −1 T −1
hλ = V A AV A b + λV α−A AV A AV α
| {z } | {z }
minimum variance portfolio speculative portfolio
• The first term is what minimises the risk h T Vh under the constraint Ah = b (in particular, it
does not depend on expected returns or risk-tolerance).
• The second term is the speculative portfolio (it is sensitive to both inputs).
T
√ frontier is the relation between expected portfolio return h α and portfolio standard
The efficient
deviation h T Vh for varying level of risk-tolerance
q
T T
( x, y) 7→ hλ α, hλ Vhλ
q
When b = 0, the efficient frontier between hλT α and hλT Vhλ is a line through (0, 0); otherwise, it
is a parabolic curve.
2.2. INDUSTRY MOMENTUM BACKTEST 15
We focus on pure “alpha views” – that is, long-short “cash-neutral” portfolios where the sum of
holdings is zero. In this case b = 0 and A = 1 where
1= 1 ... 1 .
Time convention: - holdings ht and returns rt are known for period t – ie. *at the end of period t.
• so to compute pnl with forward-looking information, the holdings must only depend on
information up to t − 1
• in practice, we will have
pnlt = ht−1 × rt
import numpy as np
import pandas as pd
from joblib import Parallel, delayed
from ml4pmt.metrics import sharpe_ratio
from sklearn.base import BaseEstimator, clone
from sklearn.model_selection import TimeSeriesSplit
from sklearn.utils.metaestimators import _safe_split
"""
N, _ = V.shape
if isinstance(pred, pd.Series) | isinstance(pred, pd.DataFrame):
pred = pred.values
if pred.shape == (N,):
pred = pred[:, None]
elif pred.shape[1] == N:
pred = pred.T
invV = np.linalg.inv(V)
if A is None:
M = invV
else:
U = invV.dot(A)
if A.ndim == 1:
M = invV - np.outer(U, U.T) / U.dot(A)
else:
M = invV - U.dot(np.linalg.inv(U.T.dot(A)).dot(U.T))
h = M.dot(pred)
if constant_risk:
h = h / np.sqrt(np.diag(h.T.dot(V.dot(h))))
return h.T
class MeanVariance(BaseEstimator):
def __init__(self, transform_V=None, A=None, constant_risk=True):
18 CHAPTER 2. BACKTESTING
if transform_V is None:
self.transform_V = lambda x: np.cov(x.T)
else:
self.transform_V = transform_V
self.A = A
self.constant_risk = constant_risk
return h
class Backtester:
def __init__(
self,
estimator,
ret,
max_train_size=36,
test_size=1,
start_date="1945-01-01",
end_date=None,
):
self.start_date = start_date
self.end_date = end_date
self.estimator = estimator
self.ret = ret[: self.end_date]
self.cv = TimeSeriesSplit(
max_train_size=max_train_size,
test_size=test_size,
n_splits=1 + len(ret.loc[start_date:end_date]) // test_size,
)
)
return self
def fit_predict(
estimator,
features,
target,
ret,
cv,
return_estimator=False,
verbose=0,
pre_dispatch="2*n_jobs",
n_jobs=1,
):
parallel = Parallel(n_jobs=n_jobs, verbose=verbose,␣
,→pre_dispatch=pre_dispatch)
res = parallel(
delayed(_fit_predict)(
clone(estimator), features, target, train, test, return_estimator
)
for train, test in cv.split(ret)
)
if return_estimator:
pred, estimators = zip(*res)
else:
pred = res
cols = ret.columns
idx = ret.index[np.concatenate([test for _, test in cv.split(ret)])]
20 CHAPTER 2. BACKTESTING
if return_estimator:
return pd.DataFrame(np.concatenate(pred), index=idx, columns=cols),␣
,→estimators
else:
return pd.DataFrame(np.concatenate(pred), index=idx, columns=cols)
Overwriting ../ml4pmt/backtesting.py
[10]: T, N = ret.shape
A = np.ones(N)
[12]: True
[15]: True
Given that the data is monthly, we re-estimate the model every month. This is done by choosing
the parameter n_splits in the class TimeSeriesSplit as the number of months.
cv = TimeSeriesSplit(**params)
• gap could set to −1 because the model can have at most the same information as when
making the portfolio decision for the first date in the test window
• n_splits is determined so that the backtest starts (just) before a certain start date.
[20]: _h = []
for train, test in cv.split(ret):
m = MeanVariance()
m.fit(features[train], target[train])
_h += [m.predict(features[test])]
cols = ret.columns
idx = ret.index[np.concatenate([test for _, test in cv.split(ret)])]
h = pd.DataFrame(np.concatenate(_h), index=idx, columns=cols)
We can also extract information for the estimator – e.g. in this simple case, recover the covariance
matrix fitted by the class MeanVariance().
The off-the-top approach is to remove an asset from the tradable set and check whether the
portfolio sharpe ratio decreases (in which case, this asset is a contributor) or increases (in which
case, this asset is a detractor).
[30]: pnls_ott = {}
for c in ret.columns:
ret_ = ret.drop(c, axis=1)
features_ = transform_X(ret_)
target_ = transform_y(ret_)
pnl_ = Backtester(estimator=MeanVariance(), ret=ret_).train(features_,␣
,→target_).pnl_
pnls_ott[c] = pnl_.pipe(sharpe_ratio)
pnls_ott["ALL"] = pnl.pipe(sharpe_ratio)
Linear estimators
In this section, we take advantage of some of scikit-learn powerful features such as the
pipeline.
The Ridge regression provides more stable and accurate estimates than standard residual sum of
squares minimization
Lasso regression: the betas h β 1 , ..., β p i are chosen to minimize
p p
1 N
2 i∑
(yi − β 0 − ∑ xij β j )2 + λ ∑ | β j |
=1 j =1 j =1
The Lasso tends to promote sparse and stable models that can be more easily interpretable.
Elastic net: the betas h β 1 , ..., β p i are chosen to minimize
p p
1 N
2 i∑
(yi − β 0 − ∑ xij β j )2 + λ ∑ [(1 − α) β2j + α| β j |]
=1 j =1 j =1
“The lasso penalty is not very selective in the choice among a set of strong but corre-
lated predictors, and the ridge penalty is inclined to shrink the coefficients of correlated
variables towards each other. The compromise in the elastic net could cause the highly
correlated features to be averaged while encouraging a parsiminuous model."
To give an example, we use a diabetes dataset provided by sklearn.
27
28 CHAPTER 3. LINEAR ESTIMATORS
The API of scikit-learn is to have fit, predict and transform. More precisely, estimator like
linear regression or Ridge regression have the property to deal with multi-ouputs, but do not have
a transform property. We add this property to be able to use the pipeline operator.
3.3. PREDICTING INDUSTRY RETURNS WITH LINEAR MODELS 29
class LinearRegression(LinearRegression):
def transform(self, X):
return self.predict(X)
class Ridge(Ridge):
def transform(self, X):
return self.predict(X)
class RidgeCV(RidgeCV):
def transform(self, X):
return self.predict(X)
class MultiOutputRegressor(MultiOutputRegressor):
def transform(self, X):
return self.predict(X)
class MLPRegressor(MLPRegressor):
def transform(self, X):
return self.predict(X)
Overwriting ../ml4pmt/estimators.py
ret = returns_data['Monthly']['Average_Value_Weighted_Returns'][:'1999']
[14]: True
[18]: pnls_ = {}
for hl in tqdm([6, 12, 24]):
features_ = ret.ewm(halflife=hl).mean().fillna(0).values
pnls_[hl] = Backtester(estimator, ret).train(features_, target).pnl_
line(pnls_, cumsum=True, title='Robustness on feature half-lives')
[19]: pnls_ = {}
for hl in [6, 12, 24]:
features_ = ret.rolling(window=hl).mean().fillna(0).values
pnls_[hl] = Backtester(estimator, ret).train(features_, target).pnl_
line(pnls_, cumsum=True, title='Robustness on features with rolling windows')
3.3.2 Ridge
[22]: pnls_ = {}
for alpha in [0.1, 1, 10, 100]:
estimator_ = make_pipeline(StandardScaler(with_mean=False),␣
,→Ridge(alpha=alpha), MeanVariance())
We can expand the set of features by using polynomial transfomrs with PolynomialFeatures.
34 CHAPTER 3. LINEAR ESTIMATORS
[23]: PolynomialFeatures(degree=2).fit_transform(ret.iloc[:10]).shape
Number of new features: intercept, initial features (=12), squared features (12), all cross features
of degree 1 (=6*12):
[27]: pnls_ = {}
for alpha in [0.1, 1, 10, 100, 1000]:
estimator_ = make_pipeline(StandardScaler(with_mean=False),
PolynomialFeatures(degree=2),
Ridge(alpha=alpha),
MeanVariance())
pnls_[alpha] = Backtester(estimator_, ret).train(features_, target).pnl_
line(pnls_, cumsum=True, title='Ridge with feature expansion: Robustness on␣
,→alpha')
3.3. PREDICTING INDUSTRY RETURNS WITH LINEAR MODELS 35
Putting all the types of linear predictors together, we can compare the cumulative pnls in the
graph below.
Here we create some helper functions that will be used across notebooks using the magic
%%writefile.
4.1 Metrics
def test_monthly(df):
return int(len(df) / len(df.asfreq("M"))) == 1
def test_bday(df):
return int(len(df) / len(df.asfreq("B"))) == 1
def test_day(df):
return int(len(df) / len(df.asfreq("D"))) == 1
37
38 CHAPTER 4. APPENDIX: HELPER FUNCTIONS
else:
return df.mean() / df.std() * np.sqrt(num_period_per_year)
Overwriting ../ml4pmt/metrics.py
Data exploration, in particular based on visualisation, is crucial to modern data science. Pandas
has a lot of plotting functionalities (e.g. see the graph below), but we will find it usefull to use a
custom plot set of functions.
plt.style.use("seaborn-whitegrid")
def line(
df,
sort=True,
figsize=(8, 5),
ax=None,
title="",
cumsum=False,
loc="center left",
bbox_to_anchor=(1, 0.5),
legend_sharpe_ratio=None,
legend=True,
yscale=None,
start_date=None,
):
if loc == "best":
bbox_to_anchor = None
if isinstance(df, dict):
df = pd.concat(df, axis=1)
4.2. DATA VISUALISATION 39
if isinstance(df, pd.Series):
df = df.to_frame()
if start_date is not None:
df = df[start_date:]
if cumsum & (legend_sharpe_ratio is None):
legend_sharpe_ratio = True
if legend_sharpe_ratio:
df.columns = [f"{c}: sr={sharpe_ratio(df[c]): 3.2f}" for c in df.columns]
if cumsum:
df = df.cumsum()
if sort:
df = df.loc[:, lambda x: x.iloc[-1].sort_values(ascending=False).index]
if ax is None:
fig, ax = plt.subplots(1, 1, figsize=figsize)
ax.plot(df.index, df.values)
if legend:
ax.legend(df.columns, loc=loc, bbox_to_anchor=bbox_to_anchor)
ax.set_title(title)
if yscale == "log":
ax.set_yscale("log")
def bar(
df,
err=None,
sort=True,
figsize=(8, 5),
ax=None,
title="",
horizontal=False,
baseline=None,
rotation=0,
):
if isinstance(df, pd.DataFrame):
df = df.squeeze()
if isinstance(df, dict):
df = pd.Series(df)
if sort:
df = df.sort_values()
if err is not None:
err = err.loc[df.index]
labels = df.index
x = np.arange(len(labels))
if ax is None:
fig, ax = plt.subplots(1, 1, figsize=figsize)
if horizontal:
ax.barh(x, df.values, xerr=err, capsize=5)
40 CHAPTER 4. APPENDIX: HELPER FUNCTIONS
ax.set_yticks(x)
ax.set_yticklabels(labels, rotation=0)
if baseline in df.index:
df_ = df.copy()
df_[df.index != baseline] = 0
ax.barh(x, df_.values, color="lightgreen")
else:
ax.bar(x, df.values, yerr=err, capsize=5)
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=0)
if baseline in df.index:
df_ = df.copy()
df_[df.index != baseline] = 0
ax.bar(x, df_.values, color="lightgreen")
ax.set_title(title)
def heatmap(
df,
ax=None,
figsize=(8, 5),
title="",
vmin=None,
vmax=None,
vcompute=True,
cmap="RdBu",
):
labels_x = df.index
x = np.arange(len(labels_x))
labels_y = df.columns
y = np.arange(len(labels_y))
if vcompute:
vmax = df.abs().max().max()
vmin = -vmax
if ax is None:
fig, ax = plt.subplots(1, 1, figsize=figsize)
pos = ax.imshow(
df.T.values, cmap=cmap, interpolation="nearest", vmax=vmax, vmin=vmin
)
ax.set_xticks(x)
ax.set_yticks(y)
ax.set_xticklabels(labels_x, rotation=90)
ax.set_yticklabels(labels_y)
ax.set_title(title)
ax.grid(True)
fig.colorbar(pos, ax=ax)
4.2. DATA VISUALISATION 41
def scatter(
df,
xscale=None,
yscale=None,
xlabel=None,
ylabel=None,
xticks=None,
yticks=None,
figsize=(8, 5),
title=None,
):
fig, ax = plt.subplots(1, 1, figsize=figsize)
ax.scatter(df, df.index, facecolors="none", edgecolors="b", s=50)
if xlabel is not None:
ax.set_xlabel(xlabel)
if ylabel is not None:
ax.set_ylabel(ylabel)
if xscale is not None:
ax.set_xscale(xscale)
if yscale is not None:
ax.set_yscale(yscale)
if yticks is not None:
ax.set_yticks(yticks)
ax.set_yticklabels(yticks)
if xticks is not None:
ax.set_xticks(xticks)
ax.set_xticklabels(xticks)
ax.set_title(title)
Overwriting ../ml4pmt/plot.py
4.3 Dataset
• Berkshire Hathaway
4.3. DATASET 43
import logging
import os
import sys
from io import BytesIO
from pathlib import Path
from zipfile import ZipFile
import numpy as np
import pandas as pd
import requests
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logger = logging.getLogger(__name__)
"""
idx = [-2] + list(np.where(df.notna().sum(axis=1) == 0)[0])
if multi_df:
cols = [" Average Value Weighted Returns -- Monthly"] + list(
df.loc[df.notna().sum(axis=1) == 0].index
)
returns_data = {"Annual": {}, "Monthly": {}}
for i in range(len(idx)):
if multi_df:
c_ = (
cols[i]
.replace("-- Annual", "")
.replace("-- Monthly", "")
.strip()
.replace("/", " ")
.replace(" ", "_")
)
if i != len(idx) - 1:
v = df.iloc[idx[i] + 2 : idx[i + 1] - 1].astype(float)
v.index = v.index.str.strip()
if len(v) != 0:
if len(v.index[0]) == 6:
v.index = pd.to_datetime(v.index, format="%Y%m")
if multi_df:
returns_data["Monthly"][c_] = v
else:
44 CHAPTER 4. APPENDIX: HELPER FUNCTIONS
returns_data["Monthly"] = v
continue
if len(v.index[0]) == 4:
v.index = pd.to_datetime(v.index, format="%Y")
if multi_df:
returns_data["Annual"][c_] = v
else:
returns_data["Annual"] = v
return returns_data
def load_kf_returns(
filename="12_Industry_Portfolios", cache_dir=None, force_reload=False
):
"""
load industry returns for Ken French website:
https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
"""
if filename == "12_Industry_Portfolios":
skiprows, multi_df = 11, True
if filename == "F-F_Research_Data_Factors":
skiprows, multi_df = 3, False
if filename == "F-F_Momentum_Factor":
skiprows, multi_df = 13, False
if cache_dir is None:
cache_dir = Path(os.getcwd()) / "data"
if isinstance(cache_dir, str):
cache_dir = Path(cache_dir)
output_dir = cache_dir / filename
if (output_dir.is_dir()) & (~force_reload):
logger.info(f"logging from cache directory: {output_dir}")
returns_data = load_dict(output_dir)
else:
logger.info("loading from external source")
path = (
"http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/"
+ filename
+ "_CSV.zip"
)
r = requests.get(path)
files = ZipFile(BytesIO(r.content))
else:
logger.info("loading from external source")
path = "https://github.com/slihn/buffetts_alpha_R/archive/master.zip"
r = requests.get(path)
files = ZipFile(BytesIO(r.content))
df = pd.read_csv(
files.open("buffetts_alpha_R-master/ffdata_brk13f.csv"), index_col=0
)
df.index = pd.to_datetime(df.index, format="%m/%d/%Y")
logger.info(f"saving in cache directory {filename}")
df.to_parquet(filename)
return df
symbol_dict = {
"TOT": "Total",
"XOM": "Exxon",
"CVX": "Chevron",
"COP": "ConocoPhillips",
"VLO": "Valero Energy",
"MSFT": "Microsoft",
"IBM": "IBM",
"TWX": "Time Warner",
"CMCSA": "Comcast",
"CVC": "Cablevision",
"YHOO": "Yahoo",
"DELL": "Dell",
"HPQ": "HP",
"AMZN": "Amazon",
46 CHAPTER 4. APPENDIX: HELPER FUNCTIONS
"TM": "Toyota",
"CAJ": "Canon",
"SNE": "Sony",
"F": "Ford",
"HMC": "Honda",
"NAV": "Navistar",
"NOC": "Northrop Grumman",
"BA": "Boeing",
"KO": "Coca Cola",
"MMM": "3M",
"MCD": "McDonald's",
"PEP": "Pepsi",
"K": "Kellogg",
"UN": "Unilever",
"MAR": "Marriott",
"PG": "Procter Gamble",
"CL": "Colgate-Palmolive",
"GE": "General Electrics",
"WFC": "Wells Fargo",
"JPM": "JPMorgan Chase",
"AIG": "AIG",
"AXP": "American express",
"BAC": "Bank of America",
"GS": "Goldman Sachs",
"AAPL": "Apple",
"SAP": "SAP",
"CSCO": "Cisco",
"TXN": "Texas Instruments",
"XRX": "Xerox",
"WMT": "Wal-Mart",
"HD": "Home Depot",
"GSK": "GlaxoSmithKline",
"PFE": "Pfizer",
"SNY": "Sanofi-Aventis",
"NVS": "Novartis",
"KMB": "Kimberly-Clark",
"R": "Ryder",
"GD": "General Dynamics",
"RTN": "Raytheon",
"CVS": "CVS",
"CAT": "Caterpillar",
"DD": "DuPont de Nemours",
}
else:
logger.info("loading from external source")
url = "https://raw.githubusercontent.com/scikit-learn/examples-data/
,→master/financial-data"
df = (
pd.concat(
{
c: pd.read_csv(f"{url}/{c}.csv", index_col=0,␣
,→parse_dates=True)[
"close"
].diff()
for c in symbol_dict.keys()
},
axis=1,
)
.asfreq("B")
.iloc[1:]
)
def load_dict(input_dir):
data = {}
48 CHAPTER 4. APPENDIX: HELPER FUNCTIONS
for o in os.scandir(input_dir):
if o.name.endswith(".parquet"):
k = o.name.replace(".parquet", "")
data[k] = pd.read_parquet(o)
elif o.is_dir:
data[o.name] = load_dict(o)
return data
Overwriting ../ml4pmt/dataset.py
[10]: %%time
returns_data = load_kf_returns(
filename="12_Industry_Portfolios", cache_dir="data", force_reload=True
)
[11]: %%time
returns_data = load_kf_returns(
filename="12_Industry_Portfolios", cache_dir="data", force_reload=False
)
[14]: %%time
returns_data = load_sklearn_stock_returns(cache_dir="data", force_reload=True)
df = returns_data.pipe(sharpe_ratio).rename(index=symbol_dict).sort_values()\
.pipe(lambda x: pd.concat([x.head(), x.tail()]))
bar(df, horizontal=True, title=f'Annualized stock sharpe ratio: {start_date} to␣
,→{end_date}')
[16]: %%time
df = load_buffets_data(cache_dir="data", force_reload=True)
CPU times: user 60.8 ms, sys: 6.95 ms, total: 67.8 ms
Wall time: 914 ms
Bibliography
J.-P. Bouchaud, J. Bonart, J. Donier, and M. Gould. Trades, quotes and prices: financial markets under
the microscope. Cambridge University Press, 2018.
A. Frazzini, D. Kabiller, and L. H. Pedersen. Buffett’s alpha. Financial Analysts Journal, 74(4):35–55,
2018.
T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman. The elements of statistical learning: data
mining, inference, and prediction, volume 2. Springer, 2009.
M. Isichenko. Quantitative Portfolio Management: The Art and Science of Statistical Arbitrage. Wiley,
2022.
S. Jansen. Machine Learning for Algorithmic Trading: Predictive models to extract signals from market
and alternative data for systematic trading strategies with Python. Packt Publishing Ltd, 2020.
O. Ledoit and M. Wolf. Honey, i shrunk the sample covariance matrix. The Journal of Portfolio
Management, 30(4):110–119, 2004.
A. W. Lo. The statistics of sharpe ratios. Financial analysts journal, 58(4):36–52, 2002.
A. W. Lo and A. C. MacKinlay. When are contrarian profits due to stock market overreaction? The
review of financial studies, 3(2):175–205, 1990.
T. J. Moskowitz and M. Grinblatt. Do industries explain momentum? The Journal of finance, 54(4):
1249–1290, 1999.
51