Topic 6B Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

TMA1111 Mathematical Techniques Faculty of Information Science & Technology

TOPIC 6: HYPOTHESES & REGRESSION

B. REGRESSION & CORRELATION

The Simple Linear Regression Model

Regression analysis is the part of statistics that deals with investigation of the
relationship between two or more variables using probabilistic models.

A regression model is a mathematical equation that describes the relationship between


2 or more variables.
The simplest deterministic mathematical relationship between two variables x and y is
a linear relationship using the equation/ model: 𝑦 = 𝛽0 + 𝛽1 𝑥.
However, the relationship between two variables x and y may not be deterministic.

For our discussion, we shall assume that values of the variable x are fixed by the
experimenter. The variable x is the independent (predictor, explanatory) variable.
For a fixed x, the second variable will be a random variable Y with observed value y,
referred to as the dependent (response) variable.

A simple regression model includes only two variables: independent (X) and dependent
(Y) variables.
A regression model that gives a straight-line relationship between two variables is
called simple linear regression model.

A first step in regression analysis involving two variables is to construct a scatter plot
of the observed data. In such a plot, each (xi, yi) is represented as a point plotted on a
two-dimensional coordinate system.

Scatter Plot
- A scatter plot or scatter diagram is a plot of the paired observations.
- It is a useful summary of a set of bivariate data (two variables), usually drawn before
working out a linear correlation coefficient or fitting a regression line.
- It gives a good visual picture of the relationship between the two variables, and aids
the interpretation of the correlation coefficient or regression model.
- The resulting pattern from the plot indicates the type and strength of the relationship
between the two variables.

TCK/TCP (2020) Page 1 of 13


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

(from Statistics Glossary by Easton & McColl)

A simple linear regression model describes the linear relationship between dependent
variable Y and a single independent variable x as
𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀
where Y is the response variable/dependent variable
x is the explanatory variable/predictor/ independent variable
𝛽0 and 𝛽1 are the regression coefficients
𝜀 is the random error, with E[𝜀] = 0 and Var[𝜀] = 𝜎 2
𝛽0 , 𝛽1 and 𝜎 2 are parameters.
NOTE:
1.  0 indicates the y-intersect only if the scope of the model includes the value x = 0.
2. 1 indicates the changes in the mean respond associated with one unit increase in
x. ( 1 is also the slope of the regression line.)
3. The true (or population) regression line 𝑌 = 𝛽0 + 𝛽1 𝑥 is the line of mean
values; for a particular x value, y is the expected value of Y for that value of x.

Figure 1. Points corresponding to observations from the simple linear regression model

TCK/TCP (2020) Page 2 of 13


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

6.6 Estimated Regression Model


- Many straight lines (regression models) can be drawn through the scatter plot.
- Each line has different values of slope ( 1 ) and y-intercept (  0 ).
- To find a line that best fit the points in the scatter plot, we use least squares method
to obtain a best fit line called least square regression line.
- The estimaded least square regression line (the regression of y on x; it is also called
by estimated regression model or best fit line) may written as y = 𝛽̂0 + 𝛽̂1 𝑥.

Least Squares Method


Consider a given sample data {(x1, y1), (x2, y2), …, (xi, yi) ,…, (xn, yn) }. Let yi is the
observed value of a rv Yi, where 𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 . The errors i are independent
random variables.

If the line 𝑦 = 𝛽0 + 𝛽1 𝑥 is used to fit the model, the fitted values 𝑦̂𝑖 are obtained via
𝑦̂ = 𝛽0 + 𝛽1 𝑥 . The residual 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 = 𝑦𝑖 − 𝛽0 + 𝛽1 𝑥𝑖 is the vertical
deviation of the point (xi, yi) from the fitted line y = 𝛽0 + 𝛽1 𝑥.

The error sum of squares, denoted by SSE, is:


SSE    i2   (Yi  Yˆi ) 2   (Yi   o   1 X i ) 2
i i i
It is used as the measure of goodness of fit.
Using the principle of least squares, we minimize this sum of squares to obtain the
estimated regression line or least squares line.

A line provides a good fit to the data if the vertical distances (deviations) from the
observed points to the line are “small” (see Figure 2).

Figure 2. Deviations of observed data from line y = 𝛽0 + 𝛽1 𝑥.

TCK/TCP (2020) Page 3 of 13


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

By having such regression model, one may be able to predict Y at unknown values of
X from the knowledge of the trend between X and Y.

Example 6.11
Suppose that in a certain chemical process the reaction time Y (hour) is related to the
temperature (oF) in the chamber in which the reaction takes place according to the
simple linear regression model with equation Y = 5.00 - 0.01X and  = 0.075.
a. What is the expected change in reaction time for a 1 oF increase in temperature? For
a 10 oF increase in temperature?

b. What is the expected reaction time when temperature is 200 oF? When temperature
is 250 oF?

Solution:
a. When X = 1 oF, expected change for a one degree increase,
𝛽1 = -0.01*1 = - 0.01#
When X = 10 oF, expected change for a one degree increase,
𝛽1 = -0.01*10 = -0.1#
b. When X = 200 oF, Y = 5.00 – 0.01(200) = 3.00#
When X = 250 oF, Y = 5.00 – 0.01(250) = 2.50#

Least Squares Regression Line


The estimated least-squares (regression) line for the data is given by
y = 𝛽̂0 + 𝛽̂1 𝑥.
S
Here the least squares estimate of the slope: ̂1  xy

S xx

S xy
 x  y 
  xy    S xx  x 
 x 
2
2

where and
n n
(S stand for “sum of square”)

and the least squares estimate of the intercept: ˆ0  y  ˆ1 x

where y = mean of y 
y , x = mean of x 
x and n = sample size
n n

TCK/TCP (2020) Page 4 of 13


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

To minimize SSE with respect to the linear regression parameters (0, 1) :
Least squares estimators of 𝛽0 and 𝛽1 given above are unbiased and have minimum
variance among all other unbiased estimators.
In computing 𝛽̂0 , use extra digits (at least up to 4 decimal) in 𝛽̂1 because if 𝑥̅ is large
in magnitude, rounding will affect the final answer.

NOTE:
It must be emphasized that before 𝛽̂0 and 𝛽̂1 are computed, a scatter plot should be
examined to see whether a linear probabilistic model is plausible.
If the points do not tend to cluster about a straight line with roughly the same degree
of spread for all x, other models should be investigated.
In practice, plots and regression calculations are usually done by using a statistical
computer package.

Estimating 2 and 
The parameter variance, 2, determines the amount of variability inherent in the
regression model.
After a regression model has been fitted, the fitted values 𝑦̂𝑖 are obtained via
𝑦̂𝑖 = 𝛽̂0 + 𝛽̂1 𝑥 with residuals 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 .

The residuals can be used to give an estimate of 2.


An unbiased estimate of 2 is given by
𝑆𝑆𝐸
̂𝜎 2 = 𝑠 2 =
𝑛−2
with SSE is the error sum of square of errors:
S 
SSE  S YY   XY  S XY
 S XX 
 S YY   1 S XY

TCK/TCP (2020) Page 5 of 13


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Steps to Solve Simple Linear Regression Problem


The simple regression, 𝑦 = 𝛽0 + 𝛽1 𝑥 , is to find from the estimated least square
regression, y = 𝛽̂0 + 𝛽̂1 𝑥.

Step 1: Draw the scatter plot of the (X,Y) data for visual inspection of the relationship
that may exist between X and Y.
{Note: This step can be skipped if the scatter diagram is not required in the question.}

Step 2: Construct the following table to facilitate computation.


k X Y X2 Y2 XY
1 x1 y1 x1 2 y1 2 x1 y1
2 x2 y2 x2 2 y2 2 x2 y2

     

n xn yn xn 2 yn 2 xn yn
Sum x i
i y
i
i x
i
2
i y
i
2
i x y
i
i i

Step 3: Calculate the linear regression parameters (o, 1) using the formula below:

S xy
 x  y 
  xy    S xx  x 2
 x  2

and
n n
S xy
Hence, ̂1  and ˆ0  y  ˆ1 x
S xx

where y = mean of y 
y , x = mean of x 
x and n = sample size
n n

Step 4: The linear regression model of the data is given by


y = 𝛽̂0 + 𝛽̂1 𝑥 by substitute the values of 𝛽0 and 𝛽1 .

Additionally (if asked by question), we can compute 𝑆𝑆𝐸 = 𝑆𝑦𝑦 − 𝛽̂1 𝑆𝑥𝑦 and

2 𝑆𝑆𝐸
y 
 y  2

hence, an unbiased estimate of 2, ̂𝜎 = 𝑠2 =


2
where S yy .
𝑛−2 n

TCK/TCP (2020) Page 6 of 13


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Example 6.12
A cloth manufacturer wants to determine the relationship between the thickness of a
synthetic fiber and its tensile strength. Researchers took measurements at various pre-
selected, known levels of fiber thickness, and the following data was collected.
Fiber thickness 40 31 34 44 49 36 41 50 39 45
Tensile strength 83 74 72 70 75 73 70 76 79 72
If the fiber strength thickness was 45, what would be the predicted strength?
In addition, give an estimate of the standard deviation of the model error.

Solution:
Step 1: Draw the scatter plot of the (X,Y) data for visual inspection of the
relationship that may exist between X and Y.
{Note: This step can be skipped}
Y
85
*
80
*
75
* **
** *
70 * *
0 30 35 40 45 50 X

Step 2: Construct the following table to facilitate computation.


k x y x2 y2 xy
1 40 83 1600 6889 3320
2 31 74 961 5476 2294
3 34 72   
     
9 39 79
10 45 72
Sum  x =409  y =744 x 2
=17077  y =55504
2
 x y =30436

TCK/TCP (2020) Page 7 of 13


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Step 3: Calculate the linear regression parameters (o, 1):


Using the table above, n =10, we determine
1 1
x   xi  40.9 , y   yi  74.4
n i n i

S 6.4
and ˆ1  XY   0.01834
S XX 348.9
ˆ0  y  ˆ1 x  74.4  (0.01834)( 40.9)  73.6499

Step 4: The linear regression model of the data is given by


y = 73.6499 + 0.0183x

When thickness, x = 45, the model predicts tensile strength, y, to be


y = 73.6499 + 0.0183(45)
=74.4734#

For SSE and estimate of 2:


2
(∑𝑖 𝑦𝑖 ) 7442
𝑆𝑦𝑦 = ∑ 𝑦𝑖2 − = 55504 − = 150.4
𝑛 10
𝑖
𝑆𝑆𝐸 = 𝑆𝑦𝑦 − 𝛽̂1 𝑆𝑥𝑦 = 150.4 − (0.01834)(6.4) = 150.28

150.28
̂𝜎 2 = 𝑠 2 == 18.785
8
An estimate of the standard deviation ( )of the model error is √18.785 = 4.33

NOTE: For the above example,


1) 0 does not give any meaning since the scope of sample data not include x = 0.
2) Within the scope of the model, we have linear relationship between x and y.
3) We should not make inference about the relationship between x and y for value out
of the range of sample data.

TCK/TCP (2020) Page 8 of 13


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Exercise 1
A chemical engineer is investigating the effect of process operating temperature on
product yield. The study results in the following data:

Temperature, 0C 100 110 120 130 140 150 160 170 180 190
Yield, % 45 51 54 61 66 70 74 78 85 89

a) What variable should be the independent variable and dependent variable?


b) The data is plotted in Fig. 14-1 (scatter diagram). What can you observe from the
scatter diagram? What model is reasonable to use?
c) Find the regression line equation that represents this set of data.

Solution:

TCK/TCP (2020) Page 9 of 13


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

6.7 The Coefficient of Determination & Correlation

6.7.1 The Coefficient of Determination

The sample coefficient of determination, r2, represents the proportion of the total
variation of the variable Y that can be explained by a linear relationship with the values
of X.
It is widely used to determine how well a regression fits. In other words, how close the
points are to the regression line.

A quantitative measure of the total amount of variation in observed y values is given


by the total sum of squares.

SSE = the sum of squared deviations about the least squares line Y   0  1 X ,
SST = the sum of squared deviations about the horizontal line at height y.
SSE/SST = the proportion of total variation that cannot be explained by the simple
linear regression model,
1 – SSE/SST = the proportion of observed y variation explained by the model.

Thus, r2 = 1 – SSE/SST

The higher the value of r2, the more successful is the simple linear regression model in
explaining y variation.

TCK/TCP (2020) Page 10 of 13


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

6.7.2 Correlation

Correlation analysis is used to measure the strength of linear relation between X and
Y by means of a single number called a correlation coefficient.

Population correlation coefficient  defined as


 XY
 , with 1    +1.
 XX  YY

Some useful indications of correlation coefficient:


  = ±1 only occur when we have a perfect linear relationship between the two
variables
  = +1 implies a perfect linear relationship with a positive slope (1 > 0),
  = 1 implies a perfect linear relationship with a negative slope (1 < 0),

Thus, if a sample’s correlation coefficient is close to unity in magnitude, this implies


a good correlation or linear association between X and Y, whereas values that near
to zero indicate little or no correlation.

Sample Estimate of correlation coefficient

Sample estimate of the correlation coefficient, r , is defined as


S xy
r or r  r2
S xx S yy
The value of r (1  r  1) measures how good or how strong is the linear relationship
between X and Y.

TCK/TCP (2020) Page 11 of 13


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

From the scatter plot or scatter diagram, we may roughly estimate the correlation
coefficient, r.

Y Y

X X

Y Y

X X

A value of r near 0 is not evidence of the lack of a strong relationship, but only the
absence of a linear relation or correlation.
A value of r that fall within the range from 0 to 0.5 is considered weak, strong if it is
between 0.8 to 1, and moderate otherwise. Refer to the diagram below for the summary
of r value:

-1 -0.8 -0.5 0 0.5 0.8 1

Strong Moderate Weak Weak Moderate Strong


negative negative negative positive positive positive
relationship relationship relationship relationship relationship relationship

TCK/TCP (2020) Page 12 of 13


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Example 6.13

Construct the correlation coefficient between X (test grade) & Y (number of years) if
SXX = 10.5, SYY = 1504.1, SXY = 114.5
What does this values indicates?

Solution:
S xy
r
S xx S yy
114.5

10.51504.1
= 0.9111#

The r value of 0.9111 shows a strong correlation between test grade and the number of
years.

Exercise 2
Evaluate the correlation coefficient of the data in Exercise 1.
State any conclusion you may draw from the answer.

TCK/TCP (2020) Page 13 of 13

You might also like