Scatterplots and Regression
Scatterplots and Regression
Scatterplots and Regression
20 20
15 15
10 10
5 5
0 0
5 10 15 20 5 10 15 20
20 20
15 15
10 10
5 5
0 0
5 10 15 20 5 10 15 20
Even though scatterplots can look like a mess, sometimes we’re able to
see trends in the data. For example, the two graphs on the left definitely
seem to be roughly following a line: the one on top looks like it follows a
line with a positive slope; the bottom one looks like it follows a line with a
negative slope.
353
The graph in the upper right looks like it might be following a positively-
sloped line, but if it is, the trend is not as clear as either of the graphs on
the left.
And the graph in the lower right doesn’t look like it’s following any trend at
all.
When we say that the data in a scatterplot appears to follow a trend, what
we’re really saying is that it appears to follow some line, or maybe some
other kind of curve, like for example an exponential curve or sinusoidal
curve. No matter the shape of the curve that the data follows, we call it the
approximating curve, and the process of finding the equation of the
approximating curve is called curve fitting.
Regression line
It was intuitive for us to start looking for trends in the scatterplots as soon
as we saw the plotted points. And, in fact, spotting trends is probably
what we spend most of our time doing when we work with scatterplots.
The plot alone isn’t super helpful, but if we can use the plot to observe
some kind of a trend in the data, then we might be able to use that trend
to draw conclusions or make predictions about the data.
The most common way that we’ll do this is with a regression line. It’s the
line that best shows the trend in the data given in a scatterplot. A
354
regression line is also called the best-fit line, line of best fit, or least-
squares line.
The regression line is a trend line we use to model a linear trend that we
see in a scatterplot, but realize that some data will show a relationship that
isn’t necessarily linear. For example, the relationship might follow the
curve of a parabola, in which case the regression curve would be parabolic
in nature. For the rest of this lesson we’ll focus mostly on linear regression.
n ∑ xy − ∑ x ∑ y
b=
n ∑ x 2 − ( ∑ x)2
∑ y − b∑ x
a=
n
∑
xy is the sum of all the products of the x and y,
355
∑
x is the sum of all the x-values,
∑
y is the sum of all the y-values,
Once we find the equation of the regression line, we denote it with y,̂
(pronounced “y-hat”), to indicate that it’s a regression line, and remind us
that it’s an approximation for the data set. So the equation of the
regression line is
ŷ = a + bx
Example
356
x y
0 0.8
2 1.0
4 0.2
6 0.2
8 2.0
10 0.8
12 0.6
We’ll start by calculating the slope, m. There are 7 data points in this set,
so n = 7. It can be helpful to calculate xy and x 2 for each data point, plus
find the sum of the x-values and the sum of the y-values, and add all of
these into the data table, since we’ll be using them in our calculations. Our
new table that includes this extra information will be
x y xy x2
0 0.8 0 0
2 1.0 2 4
4 0.2 0.8 16
6 0.2 1.2 36
8 2.0 16 64
10 0.8 8 100
357
Let’s plug what we’ve found into the formula for slope.
n ∑ xy − ∑ x ∑ y
b=
n ∑ x 2 − ( ∑ x)2
7(35.2) − (42)(5.6)
b=
7(364) − (42)2
246.4 − 235.2
b=
2,548 − 1,764
11.2
b=
784
b ≈ 0.0143
Now let’s plug what we’ve found into the formula for the y-intercept.
∑ y − b∑ x
a=
n
11.2
5.6 − 784
(42)
a=
7
5.6 − 0.6
a=
7
5
a=
7
a ≈ 0.7143
358
from occurring in the estimated values, and giving us an inaccurate
regression line. Therefore, we can say that the regression line is given
approximately by
ŷ = a + bx
ŷ = 0.0143x + 0.7143
Let’s plot the data points on a scatterplot and then add in the regression
line we found to double-check ourselves.
2.0
1.6
1.2
0.8
0.4
0
2 4 6 8 10 12
The regression line looks like it runs roughly through the data, indicating
the trend.
With this last example, we might notice that the data actually wasn’t super
linear. If we look at the scatterplot we made, we might even say it has
359
more of a sinusoidal shape, and we can see that the point around x = 8
looks like an outlier.
Form
If the data roughly follows a linear trend line, we can say the relationship is
linear. If the data more closely follows a parabolic curve, we would say the
relationship is parabolic. If the scatterplot just looks like one big blob, and
we can’t really see any relationship in the data, then we would say there’s
no relationship or correlation at all.
Linear correlation:
360
20
15
10
-5
5 10 15 20
Parabolic correlation:
30
22.5
15
7.5
0
3.75 7.5 11.25 15
No correlation:
20
15
10
0
5 10 15 20
Direction
361
If the regression line has a positive slope, the data has a positive linear
relationship; if the regression line of the data has a negative slope, the
data has a negative linear relationship.
20
15
10
-5
5 10 15 20
20
15
10
-5
5 10 15 20
Strength
If the data is clustered tightly around its regression line, we might say it
shows a strong linear relationship. If the data is loosely clustered, we
might say it shows a moderate linear relationship. A weak linear
relationship would be data that is spread out but still noticeably in the
form of a trend line or curve.
362
Strong linear relationship:
20
15
10
-5
5 10 15 20
20
15
10
0
5 10 15 20
Outliers
Whether the data has a strong or weak relationship of any kind can also
be affected by the existence of outliers, or lack thereof. Remember that an
outlier is a data point that lies far away from the trend line.
363
Outlier
20
15
10
0
5 10 15 20
If all of the data points are very tightly clustered, then there are no
outliers, which means the data shows a strong relationship. But if there are
some or many outliers away from the majority, then the data shows a
moderate relationship.
The more outliers there are, and the further away they are, the weaker the
relationship. The fewer outliers there are, and the more tightly clustered
the data, the stronger the relationship.
Example
Describe any trend in the data, in terms of form, direction, strength, and
outliers.
364
20
15
10
0
5 10 15 20
Let’s look at each part one at a time: form, direction, strength, and
outliers.
365
20
15
10
0
5 10 15 20
If we take out a few points that are further from the regression line, like
(6,18), (11,4), (14,8), and (17,20),
20
15
10
0
5 10 15 20
we can see that the new, adjusted regression line fits the remaining data a
little bit better:
366
20
15
10
0
5 10 15 20
20
15
10
0
5 10 15 20
and we can see the effect that some of these outliers have on the
regression line.
367
The purpose of regression
So what’s the purpose of curve fitting in general, or finding the equation of
the regression line specifically? Well, the main purpose for finding the
approximating curve, whether it’s a regression line or a regression curve
with some other shape, is to come up with an equation that we can use to
make predictions.
In the first example from this section, we were given a data table:
x y
0 0.8
2 1.0
4 0.2
6 0.2
8 2.0
10 0.8
12 0.6
368
And that’s the purpose of regression. Technically, regression is just the
process of estimating the value of the dependent variable from a given
value of the independent variable.
369