Regression Primer
Regression Primer
The Basics
At a very basic level, regression analysis is simply a statistical modeling approach
to ascertain the relationship among certain variables of interest. This is best
understood by example.
Suppose we have a simple linear demand curve that takes the form of P = 20 – 4Q.
Note this is just a form of the equation Y = mX + b, which is the equation of a line.
Graphically, this equation represents the demand curve from our supply and
demand analysis. For our analysis, we are going to rewrite this equation in terms of
the inverse demand representation (i.e. just rewrite the equation by solving for Q):
Q = 5 – 0.25P, which is graphed accordingly.
The interpretation is that a 1 dollar increase in the price causes quantity demanded
to decrease by .25 units, or a 4 dollar increase in price causes the quantity
demanded to decrease by 1 unit. This interpretation is nothing more than
describing the slope coefficient (this is the m in Y=mX + b form), which represents
the rise/run or the ΔY/ΔX. In this example, ΔY/ΔX is actually ΔQ/ΔP. This is just the
application and understanding of the equation for a line.
So what does this have to do with regression analysis? While the above theoretical
example of a demand curve is nice and clean, how does this notion apply in the real
world with actual data? We don’t have perfect linear demand curves in the real
world. In fact, all we have, usually, is a bunch of data that represents the
combination of price and quantity points for a particular good/service over time.
For example, let’s use the “Where’s the Beef?” data set to plot out the monthly data
for the price ($/lb.) and quantity (index 2001=100) of ground chuck roast from
January 2001 through July 2005, with quantity on the vertical axis and price on the
horizontal. In the real world, this quantity and price relationship is represented by
this collection of data points. How do we reconcile this data with the nice, clean
theoretical graph from the previous page? If we want to estimate a demand curve
1
BUAD 284 Linear Regression Primer
for ground chuck roast, we need to find a line that best represents this data. In fact,
we want to find a “line of best fit”. The goal of regression analysis (at a very basic
level) is to estimate that line!
So, how do we fit this line? We want to minimize the distance between the
observations (points) and the estimated regression line across the entire sample.
The distance between an observation and the regression line is the error or the
residual. The “line of best fit” is the line that minimizes the sum of these squared
errors.1
At the end of the day, all we are trying to do is estimate a line to represent this
data. In this case, the “line of best fit” is represented by the following equation:
𝑄̂ = 248 − 56.7𝑃
We can interpret the slope to say something about the relationship between P and
Q. Specifically, on average, a $1 increase in the price of ground chuck roast (per
1
Why squared? We want to treat positive and negative errors the same when summing, thus squaring removes the negative
sign.
2
BUAD 284 Linear Regression Primer
pound), leads to a 56.74 index point decrease in the quantity demanded of ground
chuck roast. This interpretation corresponds directly to the theoretical
interpretation from the previous example, only now we derived a relationship for
actual data. That is regression analysis.
Diving Deeper
Let’s formalize the concepts from above. Managerial decisions are typically based on
the relationship between two or more variables. Regression analysis can be used to
derive an equation describing how said variables are related. The variable being
predicted is called the dependent variable.2 There is only one dependent variable in
the model. The variable(s) used to explain this predicted value is referred to as the
independent variable(s).3 A simple linear regression model that contains only one
independent variable can be expressed as:
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
The subscript i denotes an individual observation from your sample of size n, where
i = 1, 2, 3, … n. Note that 𝑦 represents the dependent variable, 𝑥 represents the
independent variable, 𝛽0 and 𝛽1 are parameters, 𝛽0 being the intercept and 𝛽1 the
slope, and 𝜀 is an error term that accounts for variability in y that cannot be
explained by x. In general, we can have any number of independent variables, I’ll
represent that with k:
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑘 𝑥𝑖𝑘 + 𝜀𝑖
We estimate the above model using Ordinary Least Squares (OLS) and obtain:
𝑦̂𝑖 = 𝑏0 + 𝑏1 𝑥𝑖1 + 𝑏2 𝑥𝑖2 + ⋯ + 𝑏𝑘 𝑥𝑖𝑘
Note that 𝑦̂𝑖 represents the predicted value of 𝑦𝑖 based upon 𝑏0 , 𝑏1 , 𝑏2 , 𝑒𝑡𝑐. which are
estimates of 𝛽0 , 𝛽1 , 𝛽2, and so on. In a nutshell, we fit a line (surface in the case of 2
or more independent variables) such that the squared error between actual 𝑦 and
predicted 𝑦 is as small as possible:
MIN ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2
The above minimization problem leads to what is referred to Ordinary Least
Squares (OLS) regression. This is the standard technique and true workhorse in
linear regression modeling. In general, we are interested in the following
information from the output of our regression:
1) Sign and magnitude of our estimated coefficients 𝑏0 , 𝑏1 , 𝑏2 , 𝑒𝑡𝑐.
2) Statistical significance of our estimated coefficients
2The dependent variable is usually denoted using the letter y. It is also sometimes called the left-hand side variable.
3The independent variable(s) is usually denoted using the letter x. It is also sometimes called the right-hand side variable(s)
or explanatory variable(s).
3
BUAD 284 Linear Regression Primer
a. Are the estimated coefficients statistically different from zero? Look for
P-values ≤ 0.05 as those coefficient are statically different from zero
with 95% confidence.
3) Goodness of fit of our model
a. Check adjusted R2. This ranges between 0 and 1 and tells us how much
variability in y is explained by x.