ML Unit2 SimpleLinearRegression pdf-60-97

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Some concepts

Some Concepts
Sampling distribution of a statistic
 Sampling distribution of a statistic is the probability distribution for the possible values of the statistic that results when random
sample of size 𝑛 are repeatedly drawn from population
 Suppose you randomly sampled 10 people from the population of women in Houston, Texas, between the ages of 21 and 35
years and computed the mean height of your sample.
 Now sample mean would not be equal to mean of all the women in Houston
 It might be somewhat lower or it might be somewhat higher, but it would not equal the population mean exactly.
 Similarly, second sample is taken of 10 people from the same population
 Not necessarily mean of the second sample to equal the mean of the first sample
 A critical part of inferential statistics involves determining how far sample statistics are likely to vary from each other and from
the population parameter
 Sample statisticscould be
 Mean, Mean absolute value of the deviation from the mean, Standard Deviation of the sample, Variance of the sample
 In the above example, the statistics is sample mean and population mean
 The numerical descriptive measures calculated from the samples are known as statistics

Ref-
David Scott , Mikki Hebl, Rudy Guerra , Dan Osherson, and Heidi Zimmer, Introduction to Statistics, Online edition
1 1

William Mendenhall, Robert Beaver, Barbara Beaver, Introduction to probability and statistics, Cengage, 14 th edition
Sampling Distributions and Inferential
Statistics
 We collect sample data
 From this data we estimate parameters of the sampling distribution
 This knowledge of the sampling distribution is useful for knowing the degree to which means from different samples would
differ from each other and from the population mean
 It would give you a sense of how close your particular sample mean is likely to be to the population mean
 This information is directly available from a sampling distribution
 The most common measure of how much sample means differ from each other is the standard deviation of the sampling
distribution of the mean
 This standard deviation is called the standard error of the mean
 If all the sample means were very close to the population mean, then the standard error of the mean would be small
 On the other hand, if the sample means varied considerably, then the standard error of the mean would be large
 Example
 Assume that sample mean were 125 and we estimated that the standard error of the mean were 5
 For a normal distribution, then it would be likely that your sample mean would be within 10 units of the population mean since most of a normal
distribution is within two standard deviations of the mean
Sampling distribution of the mean
 Mean
 The mean of the sampling distribution of the mean is the
mean of the population from which the scores were sampled
 If a population has a mean μ, then the mean of the
sampling distribution of the mean is also μ.
 Symbol 𝜇M is used to refer to the mean of the sampling
distribution of the mean
 Formula for the mean of the sampling distribution of the
mean can be written as: 𝜇𝑀 = 𝜇
Sampling distribution of the mean
 Variance
 The variance of the sampling distribution of the mean is
2
computed as follows 𝜎𝑀 = 𝜎 2 /𝑛
 That is, the variance of the sampling distribution of the mean
is the population variance divided by 𝑛, the sample size (the
number of samples used to compute a mean).
 Thus, the larger the sample size, the smaller the variance of
the sampling distribution of the mean
Sampling distribution of the mean
 Standard Error
 The standard error of the mean is the standard deviation of
the sampling distribution of the mean. It is therefore the
square root of the variance of the sampling distribution of
the mean and can be written as: 𝜎𝑀 = 𝜎/ 𝑁
 The standard error is represented by a σ because it is a
standard deviation
 The subscript (M) indicates that the standard error in
question is the standard error of the mean
Conditions for inference
 A good sample must have the following characteristics
 Representative of entire population
 Big enough to draw conclusion from (n>=30)
 Randomly picked
 Sampling distribution of the sample mean needs to be approximately normal
 This is true if our parent population is normal
 or if sample size is reasonably large (n ≥30)
 Independent
 Individual observations need to be independent
 If sampling is done without replacement, the sample size should not be more than 10% of
the population
Need to know
 If the sampled population is normal, then the sampling
distribution will also be normal
 When the sampled population is approximately
symmetric, the sampling distribution becomes
approximately normal
 When the sampled population is skewed, the sample
size 𝑛 ≥ 30 , should be taken to so that the sampling
distribution becomes approximately normal
Central limit theorem
 The central limit theorem states that: Given a
population with a finite mean μ and a finite non-zero
variance σ2, the sampling distribution of the mean
approaches a normal distribution with a mean of μ
and a variance of 𝜎 2 /𝑁 as N, the sample size,
increases.
Analysis of variance ANOVA
 Analysis of Variance (ANOVA) is a statistical
method used to test differences between two or
more means
 ANOVA is used to test general rather than specific
differences among means
Analysis of variance ANOVA for linear
regression
 Divide total variation in y ("total sum of squares") into two components:
• due to the change in x ("regression sum of squares")
• due to random error ("error sum of squares“)
• Data= Fit + Error
• SST= SSR+SSE
2
• σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2 = σ𝑛𝑖=1 𝑦ෝ𝑖 − 𝑦ത + σ𝑛𝑖=1(𝑦𝑖 −𝑦ෝ𝑖 )2
 If the regression sum of squares is a "large" component of the total sum of
squares
 it suggests that there is a linear association between the predictor x and the
response y
ANOVA for linear regression
 SST= SSR+SSE
 The degrees of freedom associated with each of
these sums of squares follow a similar decomposition
 That is
 𝑑𝑓 𝑜𝑓 𝑆𝑆𝑇 = 𝑑𝑓 𝑜𝑓 𝑆𝑆𝑅 + 𝑑𝑓 𝑜𝑓 𝑆𝑆𝐸
Parameters of ANOVA

 Mean squares in analysis of variance table can be used to test


 the 𝑛𝑢𝑙𝑙 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐻0 𝑣𝑒𝑟𝑠𝑢𝑠 𝑡ℎ𝑒 𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐻𝑎
 𝐻0: 𝜇1 = 𝜇2 = ⋯ 𝜇𝑘
 𝐻𝑎 = 𝐴𝑡𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛𝑠 𝑖𝑠 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑜𝑡ℎ𝑒𝑟𝑠
Calculating F Test
 SSR and SSE are used to find two mean squares, MST and
MSE respectively
𝑆𝑆𝑅 𝑆𝑆𝑅
 𝑀𝑆𝑅 = =
𝑑𝑓 𝑜𝑓 𝑆𝑆𝑅 1
𝑆𝑆𝐸 𝑆𝑆𝐸
 𝑀𝑆𝐸 = =
𝑑𝑓 𝑜𝑓 𝑆𝑆𝐸 𝑛−2
𝑀𝑆𝑅 𝑆𝑆𝑅
 The test statistic 𝐹= = 𝑆𝑆𝐸
𝑀𝑆𝐸
𝑛−2
 F is often referred to as the analysis of variance F-test
Calculating F Test
 F Test is a ratio of two values.
 Each has its own degree of freedom
 Numerator degree of freedom is k, where k is number of independent variables
 Denominator degree of freedom is n-k-1
 F distribution is skewed to right and is never negative
 Null hypothesis is B= 0
 Alternate hypothesis is B not equal to zero
 We always reject in upper tail
 So if Fcalculated > Fcritical (from the table), reject the null hypothesis.
 That is the correlation exists

T-distribution and F-distribution
F-distribution curve is dependent on degrees of freedom for numerator and
denominator
 F- Distribution is positive sided
 Therefore F- distribution does not have two tails
 T- distribution is positive and negative sided, has two tails
T- distribution F- distribution

p p

Significance level, α = 2p Significance level, α = p


Ex: ANOVA table and the F-test
 Relation between skin cancer mortality and latitude
 There were 49 states in the data set (49 samples)
 DF associated with SSR = 1 for the simple linear regression model
 DF associated with SSTO = n-1 = 49-1 = 48
 DF associated with SSE = n-2 = 49-2 = 47
 Total degrees of freedom: 1 + 47 = 48
ANOVA table and the F-test
 The sums of squares add up: SSTO = SSR + SSE
53637 = 36464 + 17173
 F = [SSR/(1)]/[SSE/(n-2)]
= 36464×47/17173
= 99.97
ANOVA table and the F-test
 Assume C = 99%
 Therefore α = 1% = 0.001
 For the data, calculated p-value is 0.000
 Which is less than 0.001
 Therefore null hypothesis can be rejected (slope is not 0)
Example
 Suppose the ANOVA table is given as follows-

Verify whether the regression model provides a better fit to the data than a model that
contains no independent variables
Using the F-distribution table for alpha = 0.05, with numerator of degrees of
freedom 2 (df for Regression) and denominator degrees of freedom 9
Conclusion from the above example
 We find that the F critical value is 4.2565
 Since our f statistic (5.09) is greater than the F
critical value (4.2565),
 We can conclude that the regression model as a
whole is statistically significant.
Example
 Find F stat and T Math score xi
Final Calculus
grade yi
stat for simple 39
43
65
78

linear regression 21
64
52
82

for the given data 57


47
92
89
28 73

 Do this example in 75
34
98
56

your notebooks 52 75
Hypothesis Using ANOVA

 For computation of F-statistics, numerator df and


denominator degree of freedom are considered
 In excel ‘=finv(α,df1,df2)’ shows F* value
 Same can be used for ‘=tinv(p,df1)’
Ex 1: SLR Evaluation using ANOVA
 Given data set contains the winning times (in seconds) of the 22 men's 200
meter Olympic sprints held between 1900 and 1996
 Is there a linear relationship between year and the winning times?
 Are Sprinters Getting Faster?

Example from : online.stat.psu.edu/stat462/node/108/


Conduct the formal F-test
 Null hypothesis H0: β1 = 0
 Alternative hypothesis Ha: β1 ≠ 0
 Consider P-value, which is 0.000 (to three decimal places), That is, the P-value is less than 0.001
 Therefore, we reject the null hypothesis H0: β1 = 0 in favor of the alternative hypothesis HA: β1 ≠
0
Equivalence of ANOVA F-test and t-test

(-13.33)2 = 177.7
Equivalence of ANOVA F-test and t-test

 For a given significance level α, the F-test of β1 = 0 versus β1 ≠ 0 is algebraically


equivalent to the two-tailed t-test.
 We get the same P-values,
 If F-test rejects H0 then t-test also rejects it
 Same is for Ha
 The F-test is appropriate for testing that the slope differs from 0 (β1 ≠ 0).
 Use the t-test to test that the slope is positive (β1 > 0) or negative (β1 < 0)
 The F-test is more useful for the multiple regression model for which more than one slope
parameter are to be tested
Equivalence of ANOVA F-test and t-test

 P-value associated with the t-test is the same as the P-value


associated with the analysis of variance F-test
 This is always true for the simple linear regression model
 Both P-values are 0.000 (to three decimal places)
Limitations of Statistical Model

 Regression model is selected in order to approximate the true population


 Simple Linear Regression model uses one independent variable
and has two parameters, intercept and slope
y^ = bo + b1x
 Multiple regression model uses more than one independent variables
y^ = bo+ b1x + b2x2 + …. + bnxn
 And tries to fit all data points
Simple Multiple Linear Regression Model

• May lead to overfitting and may cause random error


• Overfit regression models have too many terms for the number of observations
• Which results in noise coefficients rather than actual relationships

actual
relationship
between
variables overfit model
Over-fitting
 Overfitting is a modeling error that occurs when a function or model is too
closely fit the training set and getting a drastic difference of fitting in test set
 A statistical model begins to describe
the random error in the data rather than the
relationships between variables
 R-squared is a popular measure of quality of fit in regression
 However it does not offer significant information about how well a given
regression model can predict future values
 Overfitting lead to erroneous R-squared, regression coefficients and p-values
in the population
 Overfitting a regression model reduces its generalizability outside the original
dataset.
Detecting over-fit models: Cross validation
• We can detect overfitting by determining whether your model fits new data as
well as it fits the data used to estimate the model
• Used to estimate the behaviour of the large data set based on the a small part
of data set
• Evaluate machine learning models on a limited data sample
• Use k number of groups to split the dataset
• Called k-fold cross validation
• Randomly split the dataset into k fold/groups of equal size
• First fold is considered as validation set and is verified against the remaining k-1
folds
K-fold cross validation
 Choosing the right value of k is quite complex
 Behaviour of the model is dependent on the dataset
 Some ways of choosing the value of k is
 Each train test group of data should be large enough to statistically
representative
 K=10, experimentally proven to be optimum
 K=n, where n is size of data set such that each sample is given
equal opportunity
 This is called Leave One Out Cross Validation (LOOCV)
Ex: k-fold cross validation
 Data samples: [1, 2, 3, 4, 5, 6]
 K=3
 Fold1 = [5,2], Fold2 = [1,3], Fold3 = [4,6]
 Model1: trained on fold1 + fold2, tested on fold3
 Model2: trained on fold2 + fold3, tested on fold1
 Model3: trained on fold1 + fold3, tested on fold2
 Minimum number of samples can be 1
Cross validation: The ideal procedure

• Divide data into three sets, training, validation and test sets
• Parameters of regression model are calculated based on training data
• And accuracy is measured for new data
• The validation error gives an unbiased estimate of the predictive power of a
model.
K- fold Cross validation
 Split data into 5 samples
 Fit a model to the training sample
 Use test sample to determine Cross Validation Metric
 Repeat the process for next sample
References
 Machine Learning, IBM
 David Scott1, Mikki Hebl, Rudy Guerra1, Dan Osherson, and
Heidi Zimmer, Introduction to Statistics, Online edition
 William Mendenhall, Robert Beaver, Barbara Beaver,
Introduction to probability and statistics, Cengage, 14th edition
 Ref: https://online.stat.psu.edu/stat462/node/91/
 https://openstax.org/details/books/introductory-business-
statistics

You might also like