Chapter IV: Correlation and Linear Regression

 Scatter Diagrams and Correlations
o Scatter Diagram
o Correlation
 Linear Regression
o Least-squares criterion
o Calculates slope and intercept for least squares line
o Plot least-squares line
o Use least-squares line for prediction
o Interpretation of least-square line (regression)
o Coefficient of Determination

Scatter Diagram (Plot)

A graph in which data pairs (x, y) are plotted as points a grid with horizontal axis x and vertical
axis y.

X = explanatory variable
Y = response variable

A scatter plot helps us determine if there is a relationship between x any y values.

Sample Correlation Coefficient

A numerical measurement (value) that assesses the strength of a linear relationship between 2
variables x and y.

How to use table 4-6 to test p

1. First computer r from a random sample of n data pairs (x, y).

2. Find the table entry in the row headed by n and the column headed by your choice of a. Your
choice of a is the risk you are willing to take of mistakenly concluding that p ≠ 0 when, in fact, p
= 0.

3. Compare r to the table entry.

a) If r ≥ table entry, then there is sufficient evidence to conclude that p ≠ 0, and we say that
r is significant. In other words, we conclude that there is some population correlation
between the two variables x and y.
b) If r < table entry, then the evidence is insufficient to conclude that p ≠ 0, and we say that
r is not significant. We do not have enough evidence to conclude that there is any
correlation between the two variables x and y.
There does exist sufficient to conclude that a linear relationship between the inlet phosphorous
(100mg/L) and the outlet phosphorous level (100mg/L) in the population of California wetlands
biotreatment facilities exists at a confidence level of 99% tested with a simple random sample
(SRS) of n=8.

Notice how just about anyone can read you conclusion and have general understanding of the

Correlation can be thought of as a measure of how well a linear model (line) fits the data points
on a scatter diagram.

Correlation does not mean Causation (x does not necessarily cause y to change)

1) The scatter diagram r are from a sample (not entire population)
2) Lurking variables
3) Range of samples
The correlation between a variable using average is usually higher than r for raw data.
Do not use average for correlation: it may false inflater.
Lurking Variable- a variable not included as an explanatory variable (on response) that may be
responsive for
o changes in x, or
o changes in y, or
o changes in both x and y

Linear Regression

Key points

1) when does it make sense to perform linear regression?

 If r is statistically significant
2) What can we do with Linear Regression?
o Determine the equation to explain the linear relationship between x and y
(assuming it is significant)
o Predict a value of y for any value of x that is within the range of x-values in the
o Determine how well the linear equation (model) fits data.

How do we find the best – fitting equation for the data?

o Least – squares criterion – we minimize the sum of the squares of the vertical distance
(sum of squared) error or sum of squared residuals) from all data points to the line.


The residual is the difference between the observed and predicted values for y:

residual = observed y - predicted y

residual = 

Least- Squares Line (Regression Line)

Interpretation of Slope B
 The slope tells us how fast the value in y changes when x changes.
 To make the interpretation standard, we use the following sentence

Interpretation of Slope A

When to not interpret the intercept

1. We do not have any values of x at or near 0.
2. It does not make practical sense for x to be zero

Influential Point- a point (x, y) is influential if removing it will substantially change the
intercept of slope of the regression line. (Usually points near min or max value of x, with y for
away from remainder of points).

Prediction – we are only allowed to predict values of y using values of x, that range of the x-
values in our sample (interpolation).

Note: We are not allowed to predict values of y using values of x that are outside of the x values
in our sample (extrapolation).

