Quant Developers' Tools and Techniques: Quant Books, #2
()
About this ebook
QUANT DEVELOPERS' TOOLS AND TECHNIQUES Series
Is a comprehensive series of books enabling QUANT Developers to succeed in this ever changing and challenging line of business.
The book series is intended for finance professionals, such as controllers, equity analysts, quantitative developers, data scientists, programmers, and everybody interested in trading and winning in the financial markets.
If you plan to become a QUANT developer, the first volume is the prelude to a comprehensive series, presenting and demonstrating the tools and techniques required to succeed in this new data driven world, such as:
Volume 1
Statistics, Visualization, Simple- & Multiple Linear Regression
...build up the basis for a solid understanding of the covered parts of the QUANT body of knowledge,
in a learning-by-example fashion.
Volume 2
Multiple- & Logistic Regression, Bayes Theorem & Forecasting, Ratios & Risk measures, Monte Carlo Simulation
...further develop your advanced understanding of the tools and techniques required to succeed
in the QUANT profession, again in a learning-by-example format.
Volume 3
SQLite Database Programming, Pandas DataFrames, Time Series Analysis, Autoregression Models
...further broadens your understanding of the tools and techniques required to succeed
in the QUANT profession, again in a learning-by-example format.
Volume 4
Data Management, Clustering & Forecasting Techniques, Stochastic Programming, QUANT Tools, Microstructure Analysis, Trading Strategy Development, Neural Network- & Machine Leaning Techniques
...set up the foundation for very lucrative career in QUANT development. As learning-by-example
is the most effective way, i know of, to learn and master this complex material essential for the
QUANT profession, which is the most lucrative although challenging profession. To
become a QUANT pro, certainly requires an in-depth mastery of the QUANT body of knowledge, in
order to excel in this ever evolving and challenging profession (release: TBT).
Related to Quant Developers' Tools and Techniques
Titles in the series (2)
Quant Developers' Tools and Techniques: Quant Books, #1 Rating: 0 out of 5 stars0 ratingsQuant Developers' Tools and Techniques: Quant Books, #2 Rating: 0 out of 5 stars0 ratings
Related ebooks
Enterprise Risk Analytics for Capital Markets: Proactive and Real-Time Risk Management Rating: 0 out of 5 stars0 ratingsExotic Option Pricing and Advanced Lévy Models Rating: 0 out of 5 stars0 ratingsMarket Microstructure: The Hidden Mechanics of Trading Rating: 0 out of 5 stars0 ratingsAI and ML Applications for Decision-Making in Financial Literacy Rating: 0 out of 5 stars0 ratingsGetting Started with Forex Trading Using Python: Beginner’s guide to the currency market and development of trading algorithms Rating: 0 out of 5 stars0 ratingsInside the Black Box: A Simple Guide to Systematic Investing Rating: 3 out of 5 stars3/5Algorithmic Trading Playbook: Strategies for Consistent Profits Rating: 0 out of 5 stars0 ratingsSingle Stock Futures: A Trader's Guide Rating: 0 out of 5 stars0 ratingsThe Vega Factor: Oil Volatility and the Next Global Crisis Rating: 0 out of 5 stars0 ratingsFinancial Instrument Pricing Using C++ Rating: 2 out of 5 stars2/5Forecasting in Financial and Sports Gambling Markets: Adaptive Drift Modeling Rating: 0 out of 5 stars0 ratingsCentral Counterparties: Mandatory Central Clearing and Initial Margin Requirements for OTC Derivatives Rating: 4 out of 5 stars4/5Listed Volatility and Variance Derivatives: A Python-based Guide Rating: 0 out of 5 stars0 ratingsDesigning Trading Systems: Building Algorithms for Market Success Rating: 0 out of 5 stars0 ratingsStrategic Asset Allocation in Fixed Income Markets: A Matlab Based User's Guide Rating: 0 out of 5 stars0 ratingsEquity Derivatives: Theory and Applications Rating: 0 out of 5 stars0 ratingsFX Options and Smile Risk Rating: 0 out of 5 stars0 ratingsFinancial Modeling of the Equity Market: From CAPM to Cointegration Rating: 1 out of 5 stars1/5A Currency Options Primer Rating: 0 out of 5 stars0 ratingsModelling Single-name and Multi-name Credit Derivatives Rating: 0 out of 5 stars0 ratingsFX Option Performance: An Analysis of the Value Delivered by FX Options since the Start of the Market Rating: 0 out of 5 stars0 ratingsAlpha Machines: Inside the AI-Driven Future of Finance Rating: 0 out of 5 stars0 ratingsOptimizing the Aging, Retirement, and Pensions Dilemma Rating: 0 out of 5 stars0 ratingsF# for Machine Learning Essentials Rating: 0 out of 5 stars0 ratingsQuantitative Asset Management: Factor Investing and Machine Learning for Institutional Investing Rating: 0 out of 5 stars0 ratingsFinancial Modelling in Python Rating: 3 out of 5 stars3/5Python for Algorithmic Trading Cookbook: Recipes for designing, building, and deploying algorithmic trading strategies with Python Rating: 0 out of 5 stars0 ratingsThe Mathematics of Derivatives: Tools for Designing Numerical Algorithms Rating: 3 out of 5 stars3/5
Computers For You
The Invisible Rainbow: A History of Electricity and Life Rating: 5 out of 5 stars5/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 3 out of 5 stars3/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5Elon Musk Rating: 4 out of 5 stars4/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsCompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsEverybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsAlan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5Dark Aeon: Transhumanism and the War Against Humanity Rating: 5 out of 5 stars5/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5An Ultimate Guide to Kali Linux for Beginners Rating: 3 out of 5 stars3/5Storytelling with Data: Let's Practice! Rating: 4 out of 5 stars4/5Going Text: Mastering the Command Line Rating: 4 out of 5 stars4/5
Reviews for Quant Developers' Tools and Techniques
0 ratings0 reviews
Book preview
Quant Developers' Tools and Techniques - Manfred Hindering
1. Univariate Analysis
Univariate analysis refers to the study of data observation concerning one single attribute.
As indicated in Wikipedia [2023
], there are three common ways to perform univariate analysis on datasets consisting of a single variable:
Summarystatistics : measures the center and spread of values.
Frequencytable : describes how often different values occur.
Charts : used to visualize the distribution and patterns of values.
In the following Python code segment, we conduct univariate analysis on a small dataset, to demonstrate the process.
import pandas as pd
# DataFrame df = pd.DataFrame({
'A': [1, 1, 2, 3.5, 4, 4, 4, 5, 5, 6.5, 7, 7.4, 8, 13, 14.2],
'B': [5, 7, 7, 9, 12, 9, 9, 4, 6, 8, 8, 9, 3, 2, 6],
'C': [11, 8, 10, 6, 6, 5, 9, 12, 6, 6, 7, 8, 7, 9, 15] }) # inspect print('dataset:') print(df) print(df.shape)
dataset:
A B C
0 1.0 5 11
1 1.0 7 8
2 2.0 7 10
3 3.5 9 6
4 4.0 12 6
5 4.0 9 5
6 4.0 9 9
7 5.0 4 12
8 5.0 6 6
9 6.5 8 6
10 7.0 8 7
11 7.4 9 8
12 8.0 3 7
13 13.0 2 9
14 14.2 6 15
(15, 3)
1.1. Calculate summary statistics
The mean() and std() functions, which are available on the Pandas DataFrame object are applied. They have the following signatures:
df['col'].mean() df['col'].std()
they compute the expected value or mean \mu and the standard deviation \sigma of the specified data column. In Python, we implement:
import numpy as np
# statistical moments mu = np.float64(df['A'].mean()) std = np.float64(df['A'].std())
# results print(type:
, type(mu)) print(f"mean : {mu:.4f}) print(f
std : {std:.4f}")
type:
mean : 5.7067
std : 3.8583
For visualization purposes, the plot() function available on the DataFrame object can be applied. It has the following signature.
df['A'].plot(ax=ax) df['A'].hist(ax=ax)
Both plots are generated in the previously defined axes object called ax.
import matplotlib.pyplot as plt
# plot fig, ax = plt.subplots(figsize=(5,3)) df['A'].plot(ax=ax) df['A'].hist(ax=ax) plt.title('cumulative line chart and histogram') plt.xlabel('count') plt.ylabel('value column A') plt.show()
_images/c0e84152816285cfc460317b2a466432f54bef85277edbd9866f6e790fb28d9a.png1.2. Compute statistical moments
To learn more about the given dataset, a full set of statistical moments is computed. The following statistical moments provide a condensed view of the dataset. The respective functions are all available on the Pandas DataFrame object.
mean() : the average value of all observations
median() : the value in the middle of the histogram
modal() : the most often occurring value in the histogram
std() : the standard deviation, or spread of the observation
skew() : a measure of how symmetrical the histogram is
kurtosis() : the hump-ness of the histogram
In the following Python code segment, their usage is demonstrated. The statistical measures are collected in a Python dictionary data structure for further processing and reporting.
# compute statistical moments res = dict() res['mean'] = df['A'].mean() res['std'] = df['A'].std() res['median'] = df['A'].median() res['mode'] = df['A'].mode()[0] # first value is the mode res['skew'] = df['A'].skew() res['kurtosis'] = df['A'].kurtosis() # inspect print('moments:') for k,v in res.items():
print(f'{k.ljust(8)} : {v:.4f}')
moments:
mean : 5.7067
std : 3.8583
median : 5.0000
mode : 4.0000
skew : 1.0538
kurtosis : 0.8386
1.3. Create a table with value counts
In frequency analysis, by counting the values in a given column of the DataFrame, we gain a more in-depth insight into the properties of a given dataset. For that purpose we make use of the value_counts() function, provided by the Pandas DataFrame object, to produce such a table. The function has the following signature:
df['A'].value_counts()
it returns the number of occurrences of values within the specified data column.
The values for the histogram are given as follows:
countsDf['A'] = df['A'].values
In the following Python code segment, we implement this idea:
# create frequency table countsDf = pd.DataFrame() countsDf['A'] = df['A'].values countsDf['freq'] = df['A'].value_counts() # Series print(type(countsDf)) # DataFrame print(countsDf) # sorted, no NaN values counts = df['A'].value_counts() # Series print(type(counts)) print('A count') print(counts)
A freq
0 1.0 NaN
1 1.0 2.0
2 2.0 1.0
3 3.5 NaN
4 4.0 3.0
5 4.0 2.0
6 4.0 NaN
7 5.0 1.0
8 5.0 1.0
9 6.5 NaN
10 7.0 NaN
11 7.4 NaN
12 8.0 NaN
13 13.0 1.0
14 14.2 NaN
A count
4.0 3
1.0 2
5.0 2
2.0 1
3.5 1
6.5 1
7.0 1
7.4 1
8.0 1
13.0 1
14.2 1
Name: A, dtype: int64
Interpretation
As shown in the above output, we can observe, that the value 4 occurs 3 times, the value 1 occurs 2 times, the value 5 occurs 2 times, and the value 2 occurs 1 time, and so on. This represents the frequency of observations occurring in the dataset analyzed.
1.4. Visualize data with a Boxplot
The boxplot is a visualization in which the min, max, and mean values are presented in a single combined diagram element. Any outlier observations are shown as points outside the box.
By using the boxplot() function, which is available in the matplotlib.pyplot library, we can visualize the data. The boxplot() function has the following signature:
boxplot = df.boxplot(column=['A'], grid=False, color='orange',
patch_artist=True, vert=True)
where:
column : specifies the data source column
grid : set to True, specifies the grid to be displayed
color : the color of the bar itself
patch_artist : (True) solid bar, (False) hollow bar
vert : vertical orientation (True), horizontal (False)
it returns a boxplot and displays it in the next output cell.
To change the style of the produced box plots, we can employ pre-defined style sets. In the following section, we list the first 5 available styles:
import matplotlib.pyplot as plt
# list available styles plt.style.available[:5] # first 5 only
['Solarize_Light2',
'_classic_test_patch',
'_mpl-gallery',
'_mpl-gallery-nogrid',
'bmh']
import matplotlib.pyplot as plt plt.style.use('bmh')
# plot fig, (ax1,ax2,ax3) = plt.subplots(1,3, figsize=(8,3)) boxplot1 = df.boxplot(column=['A'], grid=True, color='maroon',
patch_artist=False, vert=True, ax=ax1) boxplot1.set_facecolor('lightblue') boxplot2 = df.boxplot(column=['B'], grid=False, color='green',
patch_artist=False, vert=True, ax=ax2) boxplot2.set_facecolor('lightyellow') boxplot3 = df.boxplot(column=['C'], grid=True, color='red',
patch_artist=True, vert=True, ax=ax3) boxplot3.set_facecolor('lightgreen') plt.suptitle('box plot variations', fontsize=14) plt.tight_layout() plt.grid(True) plt.show() # inspect print(df.describe())
_images/4e9a94548646d444927de47630db26339dbb46aa87e1b3f9563d815e45ed9c81.pngA B C
count 15.000000 15.000000 15.000000
mean 5.706667 6.933333 8.333333
std 3.858287 2.658320 2.742956
min 1.000000 2.000000 5.000000
25% 3.750000 5.500000 6.000000
50% 5.000000 7.000000 8.000000
75% 7.200000 9.000000 9.500000
max 14.200000 12.000000 15.000000
Interpretation
Column B:
The mean value of 6.93 is shown as a horizontal bar within the green box.
The min value of 2.0 is represented by the small horizontal bar at the bottom of the chart.
The max value of 12.0 is represented by the small horizontal bar at the top of the chart.
The 25% percentile value of 5.5 is represented by the lower bound of the green box.
The 75% percentile value of 9.0 is represented by the upper bound of the green box.
There are no outlier observations to be indicated.
1.5. Visualize data with a Bar chart
The hist() function, provided by the DataFrame object, is used to create a histogram of the specified data column. The function has the following signature.
df.hist(column='A', color='lightgreen', edgecolor='red', ax=ax1)
where
column : the data in column A is the be used
color : the color for the bars is set to be green
edgecolor : the frame of the bars is set to be red
ax : set to ax1, the prepared axes object
The bar chart is created in the prepared axes object ax1.
In the following Python code segment, we create a bar chart and set its properties:
import matplotlib.pyplot as plt
# plot fig, ax1 = plt.subplots(figsize=(5,3)) df.hist(column='A', color='orange', edgecolor='black', ax=ax1) plt.grid(True) plt.ylabel('count') plt.xlabel('x') plt.title('Histogram of column A data') plt.show() # inspect print(type(df['A'].value_counts())) # Series print('A count') print(df['A'].value_counts())
_images/dae8d980b0997be78063b78d40fa5b3cc1e2713c1e024bd15383e9b230628942.pngA count
4.0 3
1.0 2
5.0 2
2.0 1
3.5 1
6.5 1
7.0 1
7.4 1
8.0 1
13.0 1
14.2 1
Name: A, dtype: int64
1.6. Plot kernel density estimate
The kernel density estimate (KDE) refers to the estimated frequency distribution, computed from the sample data. In contrast, the probability density function (pdf), is computed from the entire population.
import seaborn as sns
# customize sns.set_style(whitegrid
) # plot fig, ax1 = plt.subplots(figsize=(5,3)) sns.kdeplot(df['A'], ax=ax1) # kde plt.title('kernel density estimate (KDE)') plt.show() # inspect print(f"mu : {df['A'].mean():.4f}) print(f
sigma : {df['A'].std():.4f}")
mu : 5.7067
sigma : 3.8583
Note the usage of NumPy to generate random values. The discrete=True in the last line assures that the ticks are centered.
1.7. Plot probability density function
Remember, the probability density function (pdf) is computed from the entire population.
import numpy as np import pandas as pd from pandas import DataFrame import matplotlib.pyplot as plt import scipy.stats as stats
# parameter m = 10 s = 3 N = 1000 # reproducibility np.random.seed(1) x = np.random.normal(m, s, N) df = pd.DataFrame({PDF
: x}) # sort values df.sort_values(by=['PDF'], inplace=True) df_mean = np.mean(df[PDF
]) df_std = np.std(df[PDF
]) pdf = stats.norm.pdf(df[PDF
], df_mean, df_std) # plot fig = plt.figure(figsize=(5,3)) plt.plot(df[PDF
], pdf, c='maroon') l = f'probability density function ($\\mu$={m}, $\\sigma^2$={s})' plt.title(l) plt.xlabel('x') plt.ylabel('pdf(x)') plt.grid(True) plt.show() # inspect print(f'mu : {m:.4f}') print(f'sigma : {s:.4f}')
mu : 10.0000
sigma : 3.0000
1.8. Boxplot examples
For demonstration purposes, a random dataset is generated, to produce various types of box plots.
rectangular boxplot
notched shaped boxplot
In the following Python code segment, the usage is demonstrated.
import matplotlib.pyplot as plt import numpy as np
# reproducibility np.random.seed(1) # random numbers all_data = [np.random.normal(0, std,
size=100) for std in range(1, 4)] labels = ['x1', 'x2', 'x3'] # plots fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,
figsize=(8,3), sharey=True) # rectangular box plot bplot1 = ax1.boxplot(all_data,
vert=True,
patch_artist=True,
labels=labels) ax1.set_title('Rectangular box plot') # notch shape box plot bplot2 = ax2.boxplot(all_data,
notch=True,
vert=True,
patch_artist=True,
labels=labels) ax2.set_title('Notched box plot') # fill with colors colors = ['pink', 'lightblue', 'lightgreen'] for bplot in (bplot1, bplot2):
for patch, color in zip(bplot['boxes'], colors):
patch.set_facecolor(color) # Adding horizontal grid lines for ax in [ax1, ax2]:
ax.yaxis.grid(True) # horizontal grid
ax.set_xlabel('Three separate samples')
ax.set_ylabel('Observed values') plt.show()
_images/13a151f087196c3971ee8bb5c2cd5ebe1455b2b419fd87ce1051da3162099f04.png2. Bivariate Analysis
The purpose of bivariate analysis is to gain an in-depth understanding of the relationship between two variables.
A bivariate analysis studies the relationship among two variables, one independent or regressor variable and a target or response variable. We evaluate the existence and strength of any linear relationship between the regressor and the target variable. Thereby, we quantify the impact of the independent variable on the target variable.
According to Wikipedia [2023
], there are three common ways to perform bivariate analysis:
Scatter plots
Correlation analysis
Simple linear regression
2.1. Analyze Exam scores versus hours studied
The following example demonstrates how to perform each of these types of bivariate analysis. Suppose we attempt to study the impact of learning hours on the final exam score. For that purpose, a sample dataset is set up in Pandas DataFrame format, which contains information about two subject variables.
hours : studying hours spent in preparation for an exam
score : the exam score achieved
The dataset contains 20 data pairs of observations.
import pandas as pd
# create DataFrame df = pd.DataFrame({
'hours': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3,
3, 4, 4, 5, 5, 6, 6, 6, 7, 8],
'score': [75, 66, 68, 74, 78, 72, 85, 82, 90, 82,
80, 88, 85, 90, 92, 94, 94, 88, 91, 96] }) # inspect print('dataset:') print(df.head(10))
dataset:
hours score
0 1 75
1 1 66
2 1 68
3 2 74
4 2 78
5 2 72
6 3 85
7 3 82
8 3 90
9 3 82
2.2. Scatter plots
The first step in data analytics is to visualize the dataset in a scatter plot. For that purpose, the scatter() function is applied, which is part of the Pandas DataFrame object and has the following signatures:
plt.scatter(df.hours,
df.score)
where
df.hours : denotes the independent variable, displayed on the x-axis of the chart
df.score : denotes the target variable, displayed on the y-axis of the chart
resulting in the desired scatter diagram.
In the following Python code segment, this is shown and the result is interpreted.
import matplotlib.pyplot as plt
# plot fig = plt.figure(figsize=(8,4)) plt.scatter(df.hours, df.score) plt.title('Hours studied vs. Exam Score') plt.xlabel('Hours studied') plt.ylabel('Exam Score') plt.grid(True) plt.show()
_images/04ac5dda4f06d4f6e3a2e1a9ade2e6328766d064efe76b89ec39946c29d8f367.pngInterpretation
On the x-axis the independent variable hours studied is places, and on the y-axis, we show the exam score achieved as the target variable. From the plot, we can deduce that there is a somewhat positive relationship between the two variables.
As the hours studied increase, the exam score increases as well, which is suggested by the positive linear relationship between the regressor and response variable.
A more quantitative assessment of the direction and strength of this relationship is to be conducted.
2.3. Compute correlation coefficient
The Pearson correlation coefficient is a way to quantify the linear relationship between variables. To demonstrate this, we use the corr() function, which is available on the Pandas DataFrame object. It has the following signature:
df.corr()
it returns a correlation matrix.
We might want to bring up the help information for the corr() function.
# help(df.corr)
# perform correlation analysis res = df.corr() # inspect print(type(res)) print('correlation:') print(res)
correlation:
hours score
hours 1.000000 0.891306
score 0.891306 1.000000
Interpretation
The returned correlation matrix reports the Pearson correlation coefficient r of 0.8913 on the non-diagonal elements of the correlation matrix. This indicates a strong positive relationship between Hours studied and the expected exam Score.
Note that the elements on the diagonal matrix elements are reported to be 1.0, which indicates a perfect positive correlation among such values. This is obvious, since for example the element Hours is perfectly correlated with itself, likewise the element Score.
2.4. Simple Linear Regression
Simple linear regression is a statistical method to quantify the strength of the linear relationship between two variables.
We can use the ols() function from the statsmodels package to fit a simple linear regression model for hours studied and exam scores received, resulting in this model.
Simple Linear Regression:
\hat{y} = \beta_0 + \beta_1 \cdot Hours + \epsilonwhere
\hat{y} : dependent variable
\beta_0 : intercept coefficient
\beta_1 : Hours coefficient
hours : independent variable
\epsilon : residual term
The ols() function is used with the x and y parameters specified. The fit() method is applied to the specified model, to obtain the parameter estimates \beta_i for the regression model. The signatures of the ols() function becomes:
model = sm.OLS(y, x).fit()
where
y: the dependent of the target variable
x: the regressor or independent variable
results:
model : the model object with the fitted parameter estimates
The following Python code segment demonstrates this approach.
import statsmodels.api as sm # help(sm.OLS)
import statsmodels.api as sm from quantlib import print_img
# specify model y = df['score'] # target x = df[['hours']] # regressor x = sm.add_constant(x) # add const # fit model model = sm.OLS(y, x).fit() # results print_img(model.summary())
_images/f9050cf7c8af87c90cae1adb267b6c0940dbb7c6d0fcbd6173197c4cfd7d01a9.pngInterpretation
The fitted regression equation turns out to be
Score = 69.07 + 3.85 \cdot Hours + \epsilonThis equation indicates that additional Hours studied are associated with an average increase of 3.8471 in the Score of the exam. We can also use the fitted regression equation to predict the Score that a student receives based on his Hours studied.
When, for example, a student studies a total of 3 Hours for a particular exam, what is the best estimate for the final Score received? By inserting the values in the regression equation, it becomes Score = 69.07 + 3.85 \cdot 3 = 81.62 . Meaning, that a student who studies for 3 Hours is predicted to receive a Score of 81.62 in the final exam.
Beware of the limitations of this linear regression model by considering an example where the student studies for 15 hours, resulting in a Score = 69.07 + 3.85 \cdot 15 = 126.84 . This score above 100 is not realistic, since the relationship between Hours studied and the resulting Score is - in reality - asymptotic, rather than linear.
The coefficient of determination R^2 measure is reported to be 0.7944, as it informs us that the estimated regression equation explains about 80% of the variation between the True values and the estimated regression values.
A more thorough analysis of the models’ competence is required, to arrive at a model which can be trusted in making important predictions.
3. Linear Regression on Spector
A linear regression model with independently and identically distributed (IID) errors. See D.Montgomery [1992
]. With the error terms exhibiting either heteroskedasticity or autocorrelation, an estimation of the regression parameters can be obtained through
OlS : ordinary least squares, for i.i.d errors
WLS : weighted least squares, for heteroskedastic errors
GLS : generalized least squares, for arbitrary covariance types
AR(p) : generalized least squares, for autocorrelated errors
The statsmodels library provides four different classes.
GLS : generalized least squares for arbitrary covariance
OLS : ordinary least squares for i.i.d. errors
WLS : weighted least squares for heteroskedastic errors
GLSAR : feasible generalized least squares with autocorrelated AR(p) errors
All regression models define the same methods and follow the same structure, as Greene [2003
]points out. In this way, such models can be used interchangeably, although some of the models provide additional model-specific methods and attributes.
The GLS model is a superclass of the other regression classes except for RecursiveLS, RollingWLS, and RollingOLS. In the following code segments, these options are explored.
3.1. Load dataset
We load the spector dataset Statsmodels [2023
], which is provided by the statsmodels datasets repository.
Experimental data on the effectiveness of the personalized system of instruction (PSI) program. The Spector dataset contains 32 observations and 4 variables.
Variable definition:
Observe how we can access the endog and exog variables:
spector_data.endog.name spector_data.exog.columns.values
As a preparation of the dataset, we have to introduce an additional column of ones. This is required by the regression procedures, and can be achieved by using the add_constant() function of the statsmodels library:
spector_data.exog = sm.add_constant(
spector_data.exog, prepend=False)
where
spector_data.exog : the exog or regressor variables
prepend : set to False
In the following Python code segment, the dataset is retrieved from the online repository and an initial inspection of the data is performed.
import numpy as np import statsmodels.api as sm
# dataset spector_data = sm.datasets.spector.load() print('dataset:') print('Before:') print('endog :', spector_data.endog.name) print('exog :', spector_data.exog.columns.values) print('shape:', spector_data.exog.shape) spector_data.exog = sm.add_constant(spector_data.exog,
prepend=False) # inspect print() print('After:') print('endog :', spector_data.exog.columns.values) print('shape:', spector_data.exog.shape) print() print('endogenous:') print(spector_data.endog.head(10)) print() print('exogenous:') print(spector_data.exog.head(10)) print() print('describe:') print(spector_data.data.describe()) print(spector_data.data.shape)
dataset:
Before:
endog : GRADE
exog : ['GPA' 'TUCE' 'PSI']
shape: (32, 3)
After:
endog : ['GPA' 'TUCE' 'PSI' 'const']
shape: (32, 4)
endogenous:
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
5 0.0
6 0.0
7 0.0
8 0.0
9 1.0
Name: GRADE, dtype: float64
exogenous:
GPA TUCE PSI const
0 2.66 20.0 0.0 1.0
1 2.89 22.0 0.0 1.0
2 3.28 24.0 0.0 1.0
3 2.92 12.0 0.0 1.0
4 4.00 21.0 0.0 1.0
5 2.86 17.0 0.0 1.0
6 2.76 17.0 0.0 1.0
7 2.87 21.0 0.0 1.0
8 3.03 25.0 0.0 1.0
9 3.92 29.0 0.0 1.0
describe:
GPA TUCE PSI GRADE
count 32.000000 32.000000 32.000000 32.000000
mean 3.117188 21.937500 0.437500 0.343750
std 0.466713 3.901509 0.504016 0.482559
min 2.060000 12.000000 0.000000 0.000000
25% 2.812500 19.750000 0.000000 0.000000
50% 3.065000 22.500000 0.000000 0.000000
75% 3.515000 25.000000 1.000000 1.000000
max 4.000000 29.000000 1.000000 1.000000
(32, 4)
Observe the last column const, as it contains the previously added ones.
Additional information on the spector dataset can be obtained, with the info() function, which provides a brief description of the data set and its items.
# inspect print('info:') print(spector_data.exog.info())
info:
RangeIndex: 32 entries, 0 to 31
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 GPA 32 non-null float64
1 TUCE 32 non-null float64
2 PSI 32 non-null float64
3 const 32 non-null float64
dtypes: float64(4)
memory usage: 1.1 KB
None
3.2. Linear regression model specification
To estimate the coefficient parameters, we make use of the ols() function of the statsmodels library, which has the following signature.
mod = sm.OLS(spector_data.endog, spector_data.exog) res = mod.fit()
where
spector_data.endog : the regressor variables
spector_data.exog : the target variable
Observe that a model object and a result object are returned.
The summary() report, which is available from the result object, shows a collection of key performance metrics of the regression analysis.
In the following Python code, we implement this approach.
from quantlib import print_img
# model mod = sm.OLS(spector_data.endog, spector_data.exog) res = mod.fit() # results print_img(res.summary())
_images/bdd56d2aa6cf972a03463661033c1c9489702880a98057b082883a69996f08e1.pngObserve, that the dataset consists of 32 data rows and 4 columns.
# inspect print('exog:', spector_data.exog_name) print('endog:', spector_data.endog_name) print(mod.endog.imag.shape) print(mod.exog.imag.shape)
exog: ['GPA', 'TUCE', 'PSI']
endog: GRADE
(32,)
(32, 4)
The regressor variables are GPA, TUCE, PSI and the target variable is GRADE.
3.3. Extract values from the result object
By being able to extract specific values from the model object or the result object, which facilitates additional computations, analysis and reporting. The endog_names, for example, can be accessed, like so.
res.model.endog_names
By accessing the underlying model object from the result object, we demonstrate how specific values can be extracted. For confirmation purposes, the corresponding portion of the summary().tables[0] is provided as well.
from quantlib import print_img
# confirmation print_img(res.summary().tables[0]) # left-hand side print(f'Dep. Variable : {res.model.endog_names:s}') print('Model :') print('Method :') print('Date :') print('Time :') print(f'No. Observations : {res.nobs:.0f}') print(f'Df Residuals : {res.df_resid:.0f}') print(f'Df Model : {res.df_model:.0f}') print(f'Covariance Type : {res.cov_type:s}') print() # right-hand side print(f'R-squared : {res.rsquared:.6f}') print(f'adjustedR2 : {res.rsquared_adj:.6f}') print(f'F-statistic