Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Quant Developers' Tools and Techniques: Quant Books, #2
Quant Developers' Tools and Techniques: Quant Books, #2
Quant Developers' Tools and Techniques: Quant Books, #2
Ebook819 pages3 hours

Quant Developers' Tools and Techniques: Quant Books, #2

Rating: 0 out of 5 stars

()

Read preview

About this ebook

QUANT DEVELOPERS' TOOLS AND TECHNIQUES Series

     

Is a comprehensive series of books enabling QUANT Developers to succeed in this ever changing and challenging line of business.

 

The book series is intended for finance professionals, such as controllers, equity analysts, quantitative developers, data scientists, programmers, and everybody interested in trading and winning in the financial markets.

 

If you plan to become a QUANT developer, the first volume is the prelude to a comprehensive series, presenting and demonstrating the tools and techniques required to succeed in this new data driven world, such as:

 

Volume 1

Statistics, Visualization, Simple- & Multiple Linear Regression

      ...build up the basis for a solid understanding of the covered parts of the QUANT body of knowledge,

      in a learning-by-example fashion.

 

Volume 2

Multiple- & Logistic Regression, Bayes Theorem & Forecasting, Ratios & Risk measures, Monte Carlo Simulation

      ...further develop your advanced understanding of the tools and techniques required to succeed

      in the QUANT profession, again in a learning-by-example format.

 

Volume 3

SQLite Database Programming, Pandas DataFrames, Time Series Analysis, Autoregression Models

      ...further broadens your understanding of the tools and techniques required to succeed

      in the QUANT profession, again in a learning-by-example format.

 

Volume 4

Data Management, Clustering & Forecasting Techniques, Stochastic Programming, QUANT Tools, Microstructure Analysis, Trading Strategy Development, Neural Network- & Machine Leaning Techniques

      ...set up the foundation for very lucrative career in QUANT development. As learning-by-example

      is the most effective way, i know of, to learn and master this complex material essential for the

      QUANT profession, which is the most lucrative although challenging profession. To

      become a QUANT pro, certainly requires an in-depth mastery of the QUANT body of knowledge, in

      order to excel in this ever evolving and challenging profession (release: TBT).

LanguageEnglish
Release dateJul 3, 2024
ISBN9798227801432
Quant Developers' Tools and Techniques: Quant Books, #2

Related to Quant Developers' Tools and Techniques

Titles in the series (2)

View More

Related ebooks

Computers For You

View More

Related articles

Reviews for Quant Developers' Tools and Techniques

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Quant Developers' Tools and Techniques - Manfred Hindering

    1. Univariate Analysis

    Univariate analysis refers to the study of data observation concerning one single attribute.

    As indicated in Wikipedia [2023

    ], there are three common ways to perform univariate analysis on datasets consisting of a single variable:

    Summarystatistics : measures the center and spread of values.

    Frequencytable : describes how often different values occur.

    Charts : used to visualize the distribution and patterns of values.

    In the following Python code segment, we conduct univariate analysis on a small dataset, to demonstrate the process.

    import pandas as pd

     

     

    # DataFrame df = pd.DataFrame({

     

       

    'A': [1, 1, 2, 3.5, 4, 4, 4, 5, 5, 6.5, 7, 7.4, 8, 13, 14.2],

     

       

    'B': [5, 7, 7, 9, 12, 9, 9, 4, 6, 8, 8, 9, 3, 2, 6],

     

       

    'C': [11, 8, 10, 6, 6, 5, 9, 12, 6, 6, 7, 8, 7, 9, 15] }) # inspect print('dataset:') print(df) print(df.shape)

    dataset:

          A  B  C

    0    1.0  5  11

    1    1.0  7  8

    2    2.0  7  10

    3    3.5  9  6

    4    4.0  12  6

    5    4.0  9  5

    6    4.0  9  9

    7    5.0  4  12

    8    5.0  6  6

    9    6.5  8  6

    10  7.0  8  7

    11  7.4  9  8

    12  8.0  3  7

    13  13.0  2  9

    14  14.2  6  15

    (15, 3)

    1.1. Calculate summary statistics

    The mean() and std() functions, which are available on the Pandas DataFrame object are applied. They have the following signatures:

    df['col'].mean() df['col'].std()

    they compute the expected value or mean \mu and the standard deviation \sigma of the specified data column. In Python, we implement:

    import numpy as np

     

     

    # statistical moments mu = np.float64(df['A'].mean()) std = np.float64(df['A'].std())

     

     

    # results print(type:, type(mu)) print(f"mean : {mu:.4f}) print(fstd  : {std:.4f}")

    type:

    mean : 5.7067

    std  : 3.8583

    For visualization purposes, the plot() function available on the DataFrame object can be applied. It has the following signature.

    df['A'].plot(ax=ax) df['A'].hist(ax=ax)

    Both plots are generated in the previously defined axes object called ax.

    import matplotlib.pyplot as plt

     

    # plot fig, ax = plt.subplots(figsize=(5,3)) df['A'].plot(ax=ax) df['A'].hist(ax=ax) plt.title('cumulative line chart and histogram') plt.xlabel('count') plt.ylabel('value column A') plt.show()

    _images/c0e84152816285cfc460317b2a466432f54bef85277edbd9866f6e790fb28d9a.png

    1.2. Compute statistical moments

    To learn more about the given dataset, a full set of statistical moments is computed. The following statistical moments provide a condensed view of the dataset. The respective functions are all available on the Pandas DataFrame object.

    mean() : the average value of all observations

    median() : the value in the middle of the histogram

    modal() : the most often occurring value in the histogram

    std() : the standard deviation, or spread of the observation

    skew() : a measure of how symmetrical the histogram is

    kurtosis() : the hump-ness of the histogram

    In the following Python code segment, their usage is demonstrated. The statistical measures are collected in a Python dictionary data structure for further processing and reporting.

    # compute statistical moments res = dict() res['mean'] = df['A'].mean() res['std'] = df['A'].std() res['median'] = df['A'].median() res['mode'] = df['A'].mode()[0] # first value is the mode res['skew'] = df['A'].skew() res['kurtosis'] = df['A'].kurtosis() # inspect print('moments:') for k,v in res.items():

     

       

    print(f'{k.ljust(8)} : {v:.4f}')

    moments:

    mean    : 5.7067

    std      : 3.8583

    median  : 5.0000

    mode    : 4.0000

    skew    : 1.0538

    kurtosis : 0.8386

    1.3. Create a table with value counts

    In frequency analysis, by counting the values in a given column of the DataFrame, we gain a more in-depth insight into the properties of a given dataset. For that purpose we make use of the value_counts() function, provided by the Pandas DataFrame object, to produce such a table. The function has the following signature:

    df['A'].value_counts()

    it returns the number of occurrences of values within the specified data column.

    The values for the histogram are given as follows:

    countsDf['A'] = df['A'].values

    In the following Python code segment, we implement this idea:

    # create frequency table countsDf = pd.DataFrame() countsDf['A'] = df['A'].values countsDf['freq'] = df['A'].value_counts() # Series print(type(countsDf)) # DataFrame print(countsDf) # sorted, no NaN values counts = df['A'].value_counts() # Series print(type(counts)) print('A      count') print(counts)

          A  freq

    0    1.0  NaN

    1    1.0  2.0

    2    2.0  1.0

    3    3.5  NaN

    4    4.0  3.0

    5    4.0  2.0

    6    4.0  NaN

    7    5.0  1.0

    8    5.0  1.0

    9    6.5  NaN

    10  7.0  NaN

    11  7.4  NaN

    12  8.0  NaN

    13  13.0  1.0

    14  14.2  NaN

    A      count

    4.0    3

    1.0    2

    5.0    2

    2.0    1

    3.5    1

    6.5    1

    7.0    1

    7.4    1

    8.0    1

    13.0    1

    14.2    1

    Name: A, dtype: int64

    Interpretation

    As shown in the above output, we can observe, that the value 4 occurs 3 times, the value 1 occurs 2 times, the value 5 occurs 2 times, and the value 2 occurs 1 time, and so on. This represents the frequency of observations occurring in the dataset analyzed.

    1.4. Visualize data with a Boxplot

    The boxplot is a visualization in which the min, max, and mean values are presented in a single combined diagram element. Any outlier observations are shown as points outside the box.

    By using the boxplot() function, which is available in the matplotlib.pyplot library, we can visualize the data. The boxplot() function has the following signature:

    boxplot = df.boxplot(column=['A'], grid=False, color='orange',

                                   

    patch_artist=True, vert=True)

    where:

    column : specifies the data source column

    grid : set to True, specifies the grid to be displayed

    color : the color of the bar itself

    patch_artist : (True) solid bar, (False) hollow bar

    vert : vertical orientation (True), horizontal (False)

    it returns a boxplot and displays it in the next output cell.

    To change the style of the produced box plots, we can employ pre-defined style sets. In the following section, we list the first 5 available styles:

    import matplotlib.pyplot as plt

     

     

    # list available styles plt.style.available[:5] # first 5 only

    ['Solarize_Light2',

    '_classic_test_patch',

    '_mpl-gallery',

    '_mpl-gallery-nogrid',

    'bmh']

    import matplotlib.pyplot as plt plt.style.use('bmh')

     

     

    # plot fig, (ax1,ax2,ax3) = plt.subplots(1,3, figsize=(8,3)) boxplot1 = df.boxplot(column=['A'], grid=True, color='maroon',

                       

    patch_artist=False, vert=True, ax=ax1) boxplot1.set_facecolor('lightblue') boxplot2 = df.boxplot(column=['B'], grid=False, color='green',

                       

    patch_artist=False, vert=True, ax=ax2) boxplot2.set_facecolor('lightyellow') boxplot3 = df.boxplot(column=['C'], grid=True, color='red',

                       

    patch_artist=True, vert=True, ax=ax3) boxplot3.set_facecolor('lightgreen') plt.suptitle('box plot variations', fontsize=14) plt.tight_layout() plt.grid(True) plt.show() # inspect print(df.describe())

    _images/4e9a94548646d444927de47630db26339dbb46aa87e1b3f9563d815e45ed9c81.png

                  A          B          C

    count  15.000000  15.000000  15.000000

    mean    5.706667  6.933333  8.333333

    std    3.858287  2.658320  2.742956

    min    1.000000  2.000000  5.000000

    25%    3.750000  5.500000  6.000000

    50%    5.000000  7.000000  8.000000

    75%    7.200000  9.000000  9.500000

    max    14.200000  12.000000  15.000000

    Interpretation

    Column B:

    The mean value of 6.93 is shown as a horizontal bar within the green box.

    The min value of 2.0 is represented by the small horizontal bar at the bottom of the chart.

    The max value of 12.0 is represented by the small horizontal bar at the top of the chart.

    The 25% percentile value of 5.5 is represented by the lower bound of the green box.

    The 75% percentile value of 9.0 is represented by the upper bound of the green box.

    There are no outlier observations to be indicated.

    1.5. Visualize data with a Bar chart

    The hist() function, provided by the DataFrame object, is used to create a histogram of the specified data column. The function has the following signature.

    df.hist(column='A', color='lightgreen', edgecolor='red', ax=ax1)

    where

    column : the data in column A is the be used

    color : the color for the bars is set to be green

    edgecolor : the frame of the bars is set to be red

    ax : set to ax1, the prepared axes object

    The bar chart is created in the prepared axes object ax1.

    In the following Python code segment, we create a bar chart and set its properties:

    import matplotlib.pyplot as plt

     

     

    # plot fig, ax1 = plt.subplots(figsize=(5,3)) df.hist(column='A', color='orange', edgecolor='black', ax=ax1) plt.grid(True) plt.ylabel('count') plt.xlabel('x') plt.title('Histogram of column A data') plt.show() # inspect print(type(df['A'].value_counts())) # Series print('A      count') print(df['A'].value_counts())

    _images/dae8d980b0997be78063b78d40fa5b3cc1e2713c1e024bd15383e9b230628942.png

    A      count

    4.0    3

    1.0    2

    5.0    2

    2.0    1

    3.5    1

    6.5    1

    7.0    1

    7.4    1

    8.0    1

    13.0    1

    14.2    1

    Name: A, dtype: int64

    1.6. Plot kernel density estimate

    The kernel density estimate (KDE) refers to the estimated frequency distribution, computed from the sample data. In contrast, the probability density function (pdf), is computed from the entire population.

    import seaborn as sns

     

     

    # customize sns.set_style(whitegrid) # plot fig, ax1 = plt.subplots(figsize=(5,3)) sns.kdeplot(df['A'], ax=ax1) # kde plt.title('kernel density estimate (KDE)') plt.show() # inspect print(f"mu    : {df['A'].mean():.4f}) print(fsigma : {df['A'].std():.4f}")

    _images/a66e060acc82c9938f3794c662f44499145db3be8cd021a756003713749307e5.png

    mu    : 5.7067

    sigma : 3.8583

    Note the usage of NumPy to generate random values. The discrete=True in the last line assures that the ticks are centered.

    1.7. Plot probability density function

    Remember, the probability density function (pdf) is computed from the entire population.

    import numpy as np import pandas as pd from pandas import DataFrame import matplotlib.pyplot as plt import scipy.stats as stats

     

     

    # parameter m = 10 s = 3 N = 1000 # reproducibility np.random.seed(1) x = np.random.normal(m, s, N) df = pd.DataFrame({PDF: x}) # sort values df.sort_values(by=['PDF'], inplace=True) df_mean = np.mean(df[PDF]) df_std = np.std(df[PDF]) pdf = stats.norm.pdf(df[PDF], df_mean, df_std) # plot fig = plt.figure(figsize=(5,3)) plt.plot(df[PDF], pdf, c='maroon') l = f'probability density function ($\\mu$={m}, $\\sigma^2$={s})' plt.title(l) plt.xlabel('x') plt.ylabel('pdf(x)') plt.grid(True) plt.show() # inspect print(f'mu    : {m:.4f}') print(f'sigma : {s:.4f}')

    _images/f57353706eda2f868ffdcec2af6ed6b075ecc38112169aae065030c51355bfa8.png

    mu    : 10.0000

    sigma : 3.0000

    1.8. Boxplot examples

    For demonstration purposes, a random dataset is generated, to produce various types of box plots.

    rectangular boxplot

    notched shaped boxplot

    In the following Python code segment, the usage is demonstrated.

    import matplotlib.pyplot as plt import numpy as np

     

     

    # reproducibility np.random.seed(1) # random numbers all_data = [np.random.normal(0, std,

                           

    size=100) for std in range(1, 4)] labels = ['x1', 'x2', 'x3'] # plots fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,

                                 

    figsize=(8,3), sharey=True) # rectangular box plot bplot1 = ax1.boxplot(all_data,

     

                       

    vert=True,

     

                       

    patch_artist=True,

                       

    labels=labels)  ax1.set_title('Rectangular box plot') # notch shape box plot bplot2 = ax2.boxplot(all_data,

     

                       

    notch=True,

                       

    vert=True,

                       

    patch_artist=True,

                       

    labels=labels) ax2.set_title('Notched box plot') # fill with colors colors = ['pink', 'lightblue', 'lightgreen'] for bplot in (bplot1, bplot2):

     

       

    for patch, color in zip(bplot['boxes'], colors):

     

           

    patch.set_facecolor(color) # Adding horizontal grid lines for ax in [ax1, ax2]:

     

       

    ax.yaxis.grid(True) # horizontal grid

     

       

    ax.set_xlabel('Three separate samples')

     

       

    ax.set_ylabel('Observed values')  plt.show()

    _images/13a151f087196c3971ee8bb5c2cd5ebe1455b2b419fd87ce1051da3162099f04.png

    2. Bivariate Analysis

    The purpose of bivariate analysis is to gain an in-depth understanding of the relationship between two variables.

    A bivariate analysis studies the relationship among two variables, one independent or regressor variable and a target or response variable. We evaluate the existence and strength of any linear relationship between the regressor and the target variable. Thereby, we quantify the impact of the independent variable on the target variable.

    According to Wikipedia [2023

    ], there are three common ways to perform bivariate analysis:

    Scatter plots

    Correlation analysis

    Simple linear regression

    2.1. Analyze Exam scores versus hours studied

    The following example demonstrates how to perform each of these types of bivariate analysis. Suppose we attempt to study the impact of learning hours on the final exam score. For that purpose, a sample dataset is set up in Pandas DataFrame format, which contains information about two subject variables.

    hours : studying hours spent in preparation for an exam

    score : the exam score achieved

    The dataset contains 20 data pairs of observations.

    import pandas as pd

     

     

    # create DataFrame df = pd.DataFrame({

     

       

    'hours': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3,

     

                 

    3, 4, 4, 5, 5, 6, 6, 6, 7, 8],

     

       

    'score': [75, 66, 68, 74, 78, 72, 85, 82, 90, 82,

     

                 

    80, 88, 85, 90, 92, 94, 94, 88, 91, 96] }) # inspect print('dataset:') print(df.head(10))

    dataset:

      hours  score

    0      1    75

    1      1    66

    2      1    68

    3      2    74

    4      2    78

    5      2    72

    6      3    85

    7      3    82

    8      3    90

    9      3    82

    2.2. Scatter plots

    The first step in data analytics is to visualize the dataset in a scatter plot. For that purpose, the scatter() function is applied, which is part of the Pandas DataFrame object and has the following signatures:

    plt.scatter(df.hours,

                   

    df.score)

    where

    df.hours : denotes the independent variable, displayed on the x-axis of the chart

    df.score : denotes the target variable, displayed on the y-axis of the chart

    resulting in the desired scatter diagram.

    In the following Python code segment, this is shown and the result is interpreted.

    import matplotlib.pyplot as plt

     

     

    # plot fig = plt.figure(figsize=(8,4)) plt.scatter(df.hours, df.score) plt.title('Hours studied vs. Exam Score') plt.xlabel('Hours studied') plt.ylabel('Exam Score') plt.grid(True) plt.show()

    _images/04ac5dda4f06d4f6e3a2e1a9ade2e6328766d064efe76b89ec39946c29d8f367.png

    Interpretation

    On the x-axis the independent variable hours studied is places, and on the y-axis, we show the exam score achieved as the target variable. From the plot, we can deduce that there is a somewhat positive relationship between the two variables.

    As the hours studied increase, the exam score increases as well, which is suggested by the positive linear relationship between the regressor and response variable.

    A more quantitative assessment of the direction and strength of this relationship is to be conducted.

    2.3. Compute correlation coefficient

    The Pearson correlation coefficient is a way to quantify the linear relationship between variables. To demonstrate this, we use the corr() function, which is available on the Pandas DataFrame object. It has the following signature:

    df.corr()

    it returns a correlation matrix.

    We might want to bring up the help information for the corr() function.

    # help(df.corr)

    # perform correlation analysis res = df.corr() # inspect print(type(res)) print('correlation:') print(res)

    correlation:

              hours    score

    hours  1.000000  0.891306

    score  0.891306  1.000000

    Interpretation

    The returned correlation matrix reports the Pearson correlation coefficient r of 0.8913 on the non-diagonal elements of the correlation matrix. This indicates a strong positive relationship between Hours studied and the expected exam Score.

    Note that the elements on the diagonal matrix elements are reported to be 1.0, which indicates a perfect positive correlation among such values. This is obvious, since for example the element Hours is perfectly correlated with itself, likewise the element Score.

    2.4. Simple Linear Regression

    Simple linear regression is a statistical method to quantify the strength of the linear relationship between two variables.

    We can use the ols() function from the statsmodels package to fit a simple linear regression model for hours studied and exam scores received, resulting in this model.

    Simple Linear Regression:

    \hat{y} = \beta_0 + \beta_1 \cdot Hours + \epsilon

    where

    \hat{y} : dependent variable

    \beta_0 : intercept coefficient

    \beta_1 : Hours coefficient

    hours : independent variable

    \epsilon : residual term

    The ols() function is used with the x and y parameters specified. The fit() method is applied to the specified model, to obtain the parameter estimates \beta_i for the regression model. The signatures of the ols() function becomes:

    model = sm.OLS(y, x).fit()

    where

    y: the dependent of the target variable

    x: the regressor or independent variable

    results:

    model : the model object with the fitted parameter estimates

    The following Python code segment demonstrates this approach.

    import statsmodels.api as sm # help(sm.OLS)

    import statsmodels.api as sm from quantlib import print_img

     

     

    # specify model y = df['score'] # target x = df[['hours']] # regressor x = sm.add_constant(x) # add const # fit model model = sm.OLS(y, x).fit() # results print_img(model.summary())

    _images/f9050cf7c8af87c90cae1adb267b6c0940dbb7c6d0fcbd6173197c4cfd7d01a9.png

    Interpretation

    The fitted regression equation turns out to be

    Score = 69.07 + 3.85 \cdot Hours + \epsilon

    This equation indicates that additional Hours studied are associated with an average increase of 3.8471 in the Score of the exam. We can also use the fitted regression equation to predict the Score that a student receives based on his Hours studied.

    When, for example, a student studies a total of 3 Hours for a particular exam, what is the best estimate for the final Score received? By inserting the values in the regression equation, it becomes Score = 69.07 + 3.85 \cdot 3 = 81.62 . Meaning, that a student who studies for 3 Hours is predicted to receive a Score of 81.62 in the final exam.

    Beware of the limitations of this linear regression model by considering an example where the student studies for 15 hours, resulting in a Score = 69.07 + 3.85 \cdot 15 = 126.84 . This score above 100 is not realistic, since the relationship between Hours studied and the resulting Score is - in reality - asymptotic, rather than linear.

    The coefficient of determination R^2 measure is reported to be 0.7944, as it informs us that the estimated regression equation explains about 80% of the variation between the True values and the estimated regression values.

    A more thorough analysis of the models’ competence is required, to arrive at a model which can be trusted in making important predictions.

    3. Linear Regression on Spector

    A linear regression model with independently and identically distributed (IID) errors. See D.Montgomery [1992

    ]. With the error terms exhibiting either heteroskedasticity or autocorrelation, an estimation of the regression parameters can be obtained through

    OlS : ordinary least squares, for i.i.d errors

    WLS : weighted least squares, for heteroskedastic errors

    GLS : generalized least squares, for arbitrary covariance types

    AR(p) : generalized least squares, for autocorrelated errors

    The statsmodels library provides four different classes.

    GLS : generalized least squares for arbitrary covariance

    OLS : ordinary least squares for i.i.d. errors

    WLS : weighted least squares for heteroskedastic errors

    GLSAR : feasible generalized least squares with autocorrelated AR(p) errors

    All regression models define the same methods and follow the same structure, as Greene [2003

    ]points out. In this way, such models can be used interchangeably, although some of the models provide additional model-specific methods and attributes.

    The GLS model is a superclass of the other regression classes except for RecursiveLS, RollingWLS, and RollingOLS. In the following code segments, these options are explored.

    3.1. Load dataset

    We load the spector dataset Statsmodels [2023

    ], which is provided by the statsmodels datasets repository.

    Experimental data on the effectiveness of the personalized system of instruction (PSI) program. The Spector dataset contains 32 observations and 4 variables.

    Variable definition:

    Observe how we can access the endog and exog variables:

    spector_data.endog.name spector_data.exog.columns.values

    As a preparation of the dataset, we have to introduce an additional column of ones. This is required by the regression procedures, and can be achieved by using the add_constant() function of the statsmodels library:

    spector_data.exog = sm.add_constant(

     

                           

    spector_data.exog, prepend=False)

    where

    spector_data.exog : the exog or regressor variables

    prepend : set to False

    In the following Python code segment, the dataset is retrieved from the online repository and an initial inspection of the data is performed.

    import numpy as np import statsmodels.api as sm

     

     

    # dataset spector_data = sm.datasets.spector.load() print('dataset:') print('Before:') print('endog :', spector_data.endog.name) print('exog :', spector_data.exog.columns.values) print('shape:', spector_data.exog.shape) spector_data.exog = sm.add_constant(spector_data.exog,

                                           

    prepend=False) # inspect print() print('After:') print('endog :', spector_data.exog.columns.values) print('shape:', spector_data.exog.shape) print() print('endogenous:') print(spector_data.endog.head(10)) print() print('exogenous:') print(spector_data.exog.head(10)) print() print('describe:') print(spector_data.data.describe()) print(spector_data.data.shape)

    dataset:

    Before:

    endog : GRADE

    exog : ['GPA' 'TUCE' 'PSI']

    shape: (32, 3)

     

    After:

    endog : ['GPA' 'TUCE' 'PSI' 'const']

    shape: (32, 4)

     

    endogenous:

    0    0.0

    1    0.0

    2    0.0

    3    0.0

    4    1.0

    5    0.0

    6    0.0

    7    0.0

    8    0.0

    9    1.0

    Name: GRADE, dtype: float64

     

    exogenous:

        GPA  TUCE  PSI  const

    0  2.66  20.0  0.0    1.0

    1  2.89  22.0  0.0    1.0

    2  3.28  24.0  0.0    1.0

    3  2.92  12.0  0.0    1.0

    4  4.00  21.0  0.0    1.0

    5  2.86  17.0  0.0    1.0

    6  2.76  17.0  0.0    1.0

    7  2.87  21.0  0.0    1.0

    8  3.03  25.0  0.0    1.0

    9  3.92  29.0  0.0    1.0

     

    describe:

                GPA      TUCE        PSI      GRADE

    count  32.000000  32.000000  32.000000  32.000000

    mean    3.117188  21.937500  0.437500  0.343750

    std    0.466713  3.901509  0.504016  0.482559

    min    2.060000  12.000000  0.000000  0.000000

    25%    2.812500  19.750000  0.000000  0.000000

    50%    3.065000  22.500000  0.000000  0.000000

    75%    3.515000  25.000000  1.000000  1.000000

    max    4.000000  29.000000  1.000000  1.000000

    (32, 4)

    Observe the last column const, as it contains the previously added ones.

    Additional information on the spector dataset can be obtained, with the info() function, which provides a brief description of the data set and its items.

    # inspect print('info:') print(spector_data.exog.info())

    info:

    RangeIndex: 32 entries, 0 to 31

    Data columns (total 4 columns):

    #  Column  Non-Null Count  Dtype 

    ---  ------  --------------  ----- 

    0  GPA    32 non-null    float64

    1  TUCE    32 non-null    float64

    2  PSI    32 non-null    float64

    3  const  32 non-null    float64

    dtypes: float64(4)

    memory usage: 1.1 KB

    None

    3.2. Linear regression model specification

    To estimate the coefficient parameters, we make use of the ols() function of the statsmodels library, which has the following signature.

    mod = sm.OLS(spector_data.endog, spector_data.exog) res = mod.fit()

    where

    spector_data.endog : the regressor variables

    spector_data.exog : the target variable

    Observe that a model object and a result object are returned.

    The summary() report, which is available from the result object, shows a collection of key performance metrics of the regression analysis.

    In the following Python code, we implement this approach.

    from quantlib import print_img

     

     

    # model mod = sm.OLS(spector_data.endog, spector_data.exog) res = mod.fit() # results print_img(res.summary())

    _images/bdd56d2aa6cf972a03463661033c1c9489702880a98057b082883a69996f08e1.png

    Observe, that the dataset consists of 32 data rows and 4 columns.

    # inspect print('exog:', spector_data.exog_name) print('endog:', spector_data.endog_name) print(mod.endog.imag.shape) print(mod.exog.imag.shape)

    exog: ['GPA', 'TUCE', 'PSI']

    endog: GRADE

    (32,)

    (32, 4)

    The regressor variables are GPA, TUCE, PSI and the target variable is GRADE.

    3.3. Extract values from the result object

    By being able to extract specific values from the model object or the result object, which facilitates additional computations, analysis and reporting. The endog_names, for example, can be accessed, like so.

    res.model.endog_names

    By accessing the underlying model object from the result object, we demonstrate how specific values can be extracted. For confirmation purposes, the corresponding portion of the summary().tables[0] is provided as well.

    from quantlib import print_img

     

     

    # confirmation print_img(res.summary().tables[0]) # left-hand side print(f'Dep. Variable    : {res.model.endog_names:s}') print('Model            :') print('Method          :') print('Date            :') print('Time            :') print(f'No. Observations : {res.nobs:.0f}') print(f'Df Residuals    : {res.df_resid:.0f}') print(f'Df Model        : {res.df_model:.0f}') print(f'Covariance Type  : {res.cov_type:s}') print() # right-hand side print(f'R-squared        : {res.rsquared:.6f}') print(f'adjustedR2      : {res.rsquared_adj:.6f}') print(f'F-statistic   

    Enjoying the preview?
    Page 1 of 1