Regression Equation: Independent Variable Predictor Variable Explanatory Variable Dependent Variable Response Variable
Regression Equation: Independent Variable Predictor Variable Explanatory Variable Dependent Variable Response Variable
Definition
Regression Equation
The regression equation expresses a relationship between x
(called the independent variable, predictor variable or
explanatory variable, and y (called the dependent variable or
response variable.
04/02/24 1
Simple Linear Regression
04/02/24 5
04/02/24 6
04/02/24 7
Simple Regression
Regression Equation
Given a collection of paired data, the regression equation
^
y = b + bx
0 1
Regression Line
The graph of the regression equation is called the
regression line (or line of best fit, or least squares line).
04/02/24 8
Notation for
Regression Equation
Population Sample
Parameter Statistic
04/02/24 9
How can 1 interpreted
04/02/24 10
Estimated Linear Regression Equation
If we knew the values of 0 and 1 then we could plug it into the equation and find the
mean value E(y).
E (y ) = 0 + 1 x
We estimate 0 by b0 and 1 by b1
04/02/24 11
Estimated Linear Regression Equation
Estimate
^
of the Simple Linear Equation
y = b 0 + b1 x
04/02/24 12
The Estimating Process in Simple Linear Regression
Regression model X y
y = 0 + 1 x + x1 y1
Regression equation x2 y2
E (y ) = 0 + 1 x . .
Unknown . .
0 and 1 xn yn
04/02/24 13
The Least Square Method
b1
x y ( x y )/ n
i i i i
x ( x ) / n
2
i i
2
b0 = y – b1 x (y-intercept)
04/02/24 15
If you find b1 first, then
b0 = y - b1 x
04/02/24 16
The regression line fits
the sample points best.
04/02/24 17
Calculating the
Regression Equation
Data
x 1 1 3 5
y 2 8 6 4
04/02/24 18
Calculating the
Regression Equation
Data
x 1 1 3 5
y 2 8 6 4
y = 20
x2 = 36 4(48) – (10) (20)
y2 = 120 b1 = 4(36) – (10)2
xy = 48
–8
b1 = 44
= –0.181818
04/02/24 19
Calculating the
Regression Equation
Data
x 1 1 3 5
y 2 8 6 4
n=4
x = 10 b0 = y – b1 x
y = 20 5 – (–0.181818)(2.5) = 5.45
x2 = 36
y2 = 120
xy = 48
04/02/24 20
Calculating the
Regression Equation
Data
x 1 1 3 5
y 2 8 6 4
04/02/24 21
Percentage of children immunized against DPT and under-five mortality rate for
20 countries, 1992
• Bolivia 77 118
• Brazil 69 65
• Cambodia 32 184
• Canada 85 8
• China 94 43
• Czech Republic 99 12
• Egypt 89 55
• Ethiopia 13 208
• Finland 95 7
• France 95 9
• Greece 54 9
• India 89 124
• Italy 95 10
• Japan 87 6
• Mexico 91 33
• Poland 98 16
• Russian Federation 73 32
• Senegal 47 145
• Turkey 76 87
• United Kingdom 90 9
• Source : United Nations Children’s Fund, The State of the World’s Children 1994, New York; Oxford University Press.
04/02/24 22
Example
The result of fitting a simple linear regression to under-five mortality
rate (y) and Percentage of children immunized against DPT (x) is:
y = 219 – 2.1 x
The equation tells that countries who do not have immunization at all
(Percentage of children immunized against DPT = 0) have under-five
mortality rate 219 per 1000 live births on average, and for every
additional percent of immunization, under-five mortality rate decreases
by 2.1 per 1000 live births.
04/02/24 23
Predictions
In predicting a value of y based on some
given value of x ...
1. If there is not a significant linear
correlation, the best predicted y-value is y.
04/02/24 24
Predicting the Value of a Variable
Start
04/02/24 26
Guidelines for Using The
Regression Equation
1. If there is no significant linear correlation,
don’t use the regression equation to make
predictions.
2. When using the regression equation for
predictions, stay within the scope of the
available sample data.
3. A regression equation based on old data is
not necessarily valid now.
4. Don’t make predictions about a population
that is different from the population from
which the sample data was drawn.
04/02/24 27
Coefficient of Determination
Question : How well does the estimated regression line fits the data.
04/02/24 28
Definition
Coefficient of determination
the amount of the variation in y that is
explained by the regression line
explained variation.
r=
2
total variation
or
simply square r
04/02/24 29
SSE , SST and SSR
SST : A measure of how well the observations cluster around y
SSE : A measure of how well the observations cluster around ŷ
04/02/24 30
Coefficient determination r2 = SSR/SST
SST y 2 ( y ) 2 / n
SSR
xy ( x y ) / n
2
x x / n
2 2
1 r2 0
r2 = 14200/15730 = .9027
04/02/24 32
Hypothesis testing
04/02/24 33
Estimation of B1 (confidence interval )
Simple linear regression
04/02/24 38
04/02/24 39
04/02/24 40
04/02/24 41
04/02/24 42
04/02/24 43
Multiple Regression
04/02/24 44
04/02/24 45
04/02/24 46
Multiple regression
• The simple linear regression model is easily extended to the case of two or
more explanatory variables. Such a model is called a multiple regression
model, and has the form:
y = a + b1x1 + b2x2 + … + bnxn.
• For example, birth weight depends on maternal age, sex, gestational week
and parity, a model which includes these variables might explain better
the variation in birth weight. We could fit a multiple regression of the
form:
04/02/24 47
Interpretation
• In a multiple regression model, we say that the effect of an
independent variable xi on the dependent variable y has been
adjusted for the other explanatory variables in the model.
• Adjusted estimates are less affected by confounding between
the factors.
• In the birth weight example, after adjusting for sex, gestational
age and parity, the effect of maternal age on birth weight is to
change birth weight by b1 grams for every additional one year
of maternal age.
04/02/24 48
y = a + b1x1 + b2x2 + … + bnxn.
04/02/24 49
Table 10. Birth weight in relation some attributes, multiple
regression analysis.
04/02/24 50
Logistic Regression
• In linear regression, the response variable Y is continuous, and we were interested
to identify a set of explanatory variables that predict its mean value while
explaining the variability of the values.
• Interest is still to identify a set of explanatory variables that predict the mean of
the response variable.
• The mean of the dichotomous random variable (‘p’), is the proportion of times
that it takes the value 1. I.e.
p= Pr(success) = Pr(Y=1)
• Our interest is therefore to estimate the probability ‘p’ and identify explanatory
variables that influence its value.This is possible using a statistical method called
logistic regression.
51
The Model
• Consider one explanatory variable x1
52
The Model…
• To solve the above problem we may fit a model
p = exp(+1x1)
• This equation guarantees that the estimate of p is positive, but still can result in a value greater
than 1.
• The logistic regression model can be used to express the relationship between dichotomous
response variable and explanatory variables in any of the following ways:
– In terms of the probability of the event, p
– In terms of the odds the event, p/1-p
– In terms of the log odds or logit of the event, ln (p/1-p)
53
Steps
• Step 1. Observe the data
– Y represents a dichotomous response and X represents
explanatory (categorical or continuous)
– All observations (measurements) must be independent
– The underlying population from which the sample is selected
must be normal.
54
Steps…
• Step3. Estimate the coefficients
p= 1_______
1+e -(a+b1x1)
55
Steps…
• Step 5. Interpretation of results
• The logistic regression model can provide estimates of the odds ratio given by OR =
exp(b)
• The odds ratio obtained from logistic regression with one dichotomous explanatory
variable is the same as that obtained from 2x2 table. Hence interpretation is similar to
that of 2x2 tables
• The odds ratio obtained from logistic regression with one non-dichotomous categorical
explanatory variable is the same as that obtained from rxc table.
• For a continuous explanatory variable, the odds ratio obtained from logistic regression
model corresponds to a unit increase in the variable
56
Multiple logistic regression
• Multiple logistic regression is an extension of the above
to investigate more complicated relationship between
one dichotomous response variable and many
explanatory variables
• Ln(p/1-p) = a+ b x + b x +….+ b x
1 1 2 2 q q
57
Odds ratio
• The odds ratio obtained from multiple logistic
regression is an adjusted odds ratio
• OR = ebi
• 95% CI of OR = ebi±1.96*SE(bi)
58
SPSS OUTPUT Logistic regression
Variables in the Equation
Exp (B) is equal to the OR and the corresponding lower and upper
95% CI are the limits
If the upper and the lower 95% CI doesn’t touch the unity ‘1’, then it is
a sign of significance
Preventive Risk
0 +1
Non-Parametric Tests
• For all statistical tests that we have mentioned up to this point, the population from
which the data were sampled were assumed to be either normally distributed or
approximately so.
• In fact, this property is necessary for the tests to be valid. If the data do not conform the
assumptions made by such traditional techniques, and for small sample sizes, statistical
methods known as nonparametric methods should be used instead.
• Nonparametric tests of hypotheses follow the same general procedure as parametric
tests.
• We give a summary table of equivalent non parametric tests used for the parametric tests
that we have already studied.
04/02/24 60