Topic 6B Regression
Topic 6B Regression
Topic 6B Regression
Regression analysis is the part of statistics that deals with investigation of the
relationship between two or more variables using probabilistic models.
For our discussion, we shall assume that values of the variable x are fixed by the
experimenter. The variable x is the independent (predictor, explanatory) variable.
For a fixed x, the second variable will be a random variable Y with observed value y,
referred to as the dependent (response) variable.
A simple regression model includes only two variables: independent (X) and dependent
(Y) variables.
A regression model that gives a straight-line relationship between two variables is
called simple linear regression model.
A first step in regression analysis involving two variables is to construct a scatter plot
of the observed data. In such a plot, each (xi, yi) is represented as a point plotted on a
two-dimensional coordinate system.
Scatter Plot
- A scatter plot or scatter diagram is a plot of the paired observations.
- It is a useful summary of a set of bivariate data (two variables), usually drawn before
working out a linear correlation coefficient or fitting a regression line.
- It gives a good visual picture of the relationship between the two variables, and aids
the interpretation of the correlation coefficient or regression model.
- The resulting pattern from the plot indicates the type and strength of the relationship
between the two variables.
A simple linear regression model describes the linear relationship between dependent
variable Y and a single independent variable x as
𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀
where Y is the response variable/dependent variable
x is the explanatory variable/predictor/ independent variable
𝛽0 and 𝛽1 are the regression coefficients
𝜀 is the random error, with E[𝜀] = 0 and Var[𝜀] = 𝜎 2
𝛽0 , 𝛽1 and 𝜎 2 are parameters.
NOTE:
1. 0 indicates the y-intersect only if the scope of the model includes the value x = 0.
2. 1 indicates the changes in the mean respond associated with one unit increase in
x. ( 1 is also the slope of the regression line.)
3. The true (or population) regression line 𝑌 = 𝛽0 + 𝛽1 𝑥 is the line of mean
values; for a particular x value, y is the expected value of Y for that value of x.
Figure 1. Points corresponding to observations from the simple linear regression model
If the line 𝑦 = 𝛽0 + 𝛽1 𝑥 is used to fit the model, the fitted values 𝑦̂𝑖 are obtained via
𝑦̂ = 𝛽0 + 𝛽1 𝑥 . The residual 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 = 𝑦𝑖 − 𝛽0 + 𝛽1 𝑥𝑖 is the vertical
deviation of the point (xi, yi) from the fitted line y = 𝛽0 + 𝛽1 𝑥.
A line provides a good fit to the data if the vertical distances (deviations) from the
observed points to the line are “small” (see Figure 2).
By having such regression model, one may be able to predict Y at unknown values of
X from the knowledge of the trend between X and Y.
Example 6.11
Suppose that in a certain chemical process the reaction time Y (hour) is related to the
temperature (oF) in the chamber in which the reaction takes place according to the
simple linear regression model with equation Y = 5.00 - 0.01X and = 0.075.
a. What is the expected change in reaction time for a 1 oF increase in temperature? For
a 10 oF increase in temperature?
b. What is the expected reaction time when temperature is 200 oF? When temperature
is 250 oF?
Solution:
a. When X = 1 oF, expected change for a one degree increase,
𝛽1 = -0.01*1 = - 0.01#
When X = 10 oF, expected change for a one degree increase,
𝛽1 = -0.01*10 = -0.1#
b. When X = 200 oF, Y = 5.00 – 0.01(200) = 3.00#
When X = 250 oF, Y = 5.00 – 0.01(250) = 2.50#
S xx
S xy
x y
xy S xx x
x
2
2
where and
n n
(S stand for “sum of square”)
where y = mean of y
y , x = mean of x
x and n = sample size
n n
To minimize SSE with respect to the linear regression parameters (0, 1) :
Least squares estimators of 𝛽0 and 𝛽1 given above are unbiased and have minimum
variance among all other unbiased estimators.
In computing 𝛽̂0 , use extra digits (at least up to 4 decimal) in 𝛽̂1 because if 𝑥̅ is large
in magnitude, rounding will affect the final answer.
NOTE:
It must be emphasized that before 𝛽̂0 and 𝛽̂1 are computed, a scatter plot should be
examined to see whether a linear probabilistic model is plausible.
If the points do not tend to cluster about a straight line with roughly the same degree
of spread for all x, other models should be investigated.
In practice, plots and regression calculations are usually done by using a statistical
computer package.
Estimating 2 and
The parameter variance, 2, determines the amount of variability inherent in the
regression model.
After a regression model has been fitted, the fitted values 𝑦̂𝑖 are obtained via
𝑦̂𝑖 = 𝛽̂0 + 𝛽̂1 𝑥 with residuals 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 .
Step 1: Draw the scatter plot of the (X,Y) data for visual inspection of the relationship
that may exist between X and Y.
{Note: This step can be skipped if the scatter diagram is not required in the question.}
n xn yn xn 2 yn 2 xn yn
Sum x i
i y
i
i x
i
2
i y
i
2
i x y
i
i i
Step 3: Calculate the linear regression parameters (o, 1) using the formula below:
S xy
x y
xy S xx x 2
x 2
and
n n
S xy
Hence, ̂1 and ˆ0 y ˆ1 x
S xx
where y = mean of y
y , x = mean of x
x and n = sample size
n n
Additionally (if asked by question), we can compute 𝑆𝑆𝐸 = 𝑆𝑦𝑦 − 𝛽̂1 𝑆𝑥𝑦 and
2 𝑆𝑆𝐸
y
y 2
Example 6.12
A cloth manufacturer wants to determine the relationship between the thickness of a
synthetic fiber and its tensile strength. Researchers took measurements at various pre-
selected, known levels of fiber thickness, and the following data was collected.
Fiber thickness 40 31 34 44 49 36 41 50 39 45
Tensile strength 83 74 72 70 75 73 70 76 79 72
If the fiber strength thickness was 45, what would be the predicted strength?
In addition, give an estimate of the standard deviation of the model error.
Solution:
Step 1: Draw the scatter plot of the (X,Y) data for visual inspection of the
relationship that may exist between X and Y.
{Note: This step can be skipped}
Y
85
*
80
*
75
* **
** *
70 * *
0 30 35 40 45 50 X
S 6.4
and ˆ1 XY 0.01834
S XX 348.9
ˆ0 y ˆ1 x 74.4 (0.01834)( 40.9) 73.6499
150.28
̂𝜎 2 = 𝑠 2 == 18.785
8
An estimate of the standard deviation ( )of the model error is √18.785 = 4.33
Exercise 1
A chemical engineer is investigating the effect of process operating temperature on
product yield. The study results in the following data:
Temperature, 0C 100 110 120 130 140 150 160 170 180 190
Yield, % 45 51 54 61 66 70 74 78 85 89
Solution:
The sample coefficient of determination, r2, represents the proportion of the total
variation of the variable Y that can be explained by a linear relationship with the values
of X.
It is widely used to determine how well a regression fits. In other words, how close the
points are to the regression line.
SSE = the sum of squared deviations about the least squares line Y 0 1 X ,
SST = the sum of squared deviations about the horizontal line at height y.
SSE/SST = the proportion of total variation that cannot be explained by the simple
linear regression model,
1 – SSE/SST = the proportion of observed y variation explained by the model.
Thus, r2 = 1 – SSE/SST
The higher the value of r2, the more successful is the simple linear regression model in
explaining y variation.
6.7.2 Correlation
Correlation analysis is used to measure the strength of linear relation between X and
Y by means of a single number called a correlation coefficient.
From the scatter plot or scatter diagram, we may roughly estimate the correlation
coefficient, r.
Y Y
X X
Y Y
X X
A value of r near 0 is not evidence of the lack of a strong relationship, but only the
absence of a linear relation or correlation.
A value of r that fall within the range from 0 to 0.5 is considered weak, strong if it is
between 0.8 to 1, and moderate otherwise. Refer to the diagram below for the summary
of r value:
Example 6.13
Construct the correlation coefficient between X (test grade) & Y (number of years) if
SXX = 10.5, SYY = 1504.1, SXY = 114.5
What does this values indicates?
Solution:
S xy
r
S xx S yy
114.5
10.51504.1
= 0.9111#
The r value of 0.9111 shows a strong correlation between test grade and the number of
years.
Exercise 2
Evaluate the correlation coefficient of the data in Exercise 1.
State any conclusion you may draw from the answer.