Regression Equation: Independent Variable Predictor Variable Explanatory Variable Dependent Variable Response Variable

Regression
Definition
 Regression Equation
The regression equation expresses a relationship between x
(called the independent variable, predictor variable or
explanatory variable, and y (called the dependent variable or
response variable.
The typical equation of a straight

^
line y = mx + b is expressed
in the form y = b0 + b1x, where b0 is the y-intercept and b1 is
the slope.
04/02/24 1
Simple Linear Regression
• Fitting a simple linear regression model to the data allows us

to explain or predict the values of one variable (the
dependent or outcome or response variable or y) given the
values of a second variable (called the independent or
exposure or explanatory variable or x).
• For example if we are interested in predicting under-five

mortality rate from percentage of children immunized against
DPT shown in we would treat immunization as independent
variable and mortality rate as dependent variable.
• The basic idea of simple linear regression is to find the

straight line which best fits the data.
04/02/24 3
04/02/24 4
Assumptions
1. We are investigating only linear
relationships.
2. For each x-value, y is a random variable

having a normal (bell-shaped) distribution.
04/02/24 5
04/02/24 6
04/02/24 7
Simple Regression
 Regression Equation
Given a collection of paired data, the regression equation
^
y = b + bx
0 1
algebraically describes the relationship between the two

variables
 Regression Line
The graph of the regression equation is called the
regression line (or line of best fit, or least squares line).
04/02/24 8
Notation for
Regression Equation
Population Sample
Parameter Statistic
y-intercept of regression equation 0 b0
Slope of regression equation 1 b1
Equation of the regression line y = 0 + 1 x ^

y = b0 + b1 x
04/02/24 9
How can 1 interpreted
• 1 > 0 then as x increase the expected value

of y =0 + 1 x increase
• 1 < 0 as x increases the expected value of y
will decrease
• 1 = 0 No relation ship between X and Y
04/02/24 10
Estimated Linear Regression Equation
If we knew the values of 0 and 1 then we could plug it into the equation and find the
mean value E(y).
E (y ) = 0 + 1 x
But we do not know the values for 0 and 1
We have to estimate them using The Least Square Method
We estimate them using sample data.
We estimate 0 by b0 and 1 by b1
04/02/24 11
Estimated Linear Regression Equation
Simple Linear Equation

E (y ) = 0 + 1 x
Estimate
^
of the Simple Linear Equation
y = b 0 + b1 x
04/02/24 12
The Estimating Process in Simple Linear Regression
Regression model X y
y = 0 + 1 x +  x1 y1
Regression equation x2 y2
E (y ) = 0 + 1 x . .
Unknown . .
0 and 1 xn yn
b0 and b1 Estimated regression

provide estimates for equation
0 and 1 y = b0 + b1 x
^
Sample statistics
b0 and b1
04/02/24 13
The Least Square Method
• Slope for the Estimated Regression Equation
b1 
 x y ( x  y )/ n
i i i i
 x ( x ) / n
2
i i
2
• y -Intercept for the Estimated Regression Equation

b0  y  b1 x
xi = value of independent variable for i th observation

_yi = value of dependent variable for i th observation
x_ = mean value for independent variable
y = mean value for dependent variable
04/02/24 n = total number of observations 14
Formula for b0 and b1
xy) – (x) (y)

n(
b1 = (slope)
n(x ) – (x)
2 2
b0 = y – b1 x (y-intercept)
calculators or computers can

compute these values
04/02/24 15
If you find b1 first, then
b0 = y - b1 x
,where y is the mean of the y-values and x is the

mean of the x values
04/02/24 16
The regression line fits
the sample points best.
04/02/24 17
Calculating the
Regression Equation
Data
x 1 1 3 5
y 2 8 6 4
Use this sample to find the regression equation.
04/02/24 18
Calculating the
Regression Equation
Data
x 1 1 3 5
y 2 8 6 4
n(xy) – (x) (y)

n=4 b1 =
x = 10 n(x ) –(x)
2 2
y = 20
x2 = 36 4(48) – (10) (20)
y2 = 120 b1 = 4(36) – (10)2
xy = 48
–8
b1 = 44
= –0.181818
04/02/24 19
Calculating the
Regression Equation
Data
x 1 1 3 5
y 2 8 6 4
n=4
x = 10 b0 = y – b1 x
y = 20 5 – (–0.181818)(2.5) = 5.45
x2 = 36
y2 = 120
xy = 48
04/02/24 20
Calculating the
Regression Equation
Data
x 1 1 3 5
y 2 8 6 4
n=4 The estimated equation of the regression line is:

x = 10
y = 20 ^
x2 = 36 y = 5.45 – 0.182x
y2 = 120
xy = 48
04/02/24 21
Percentage of children immunized against DPT and under-five mortality rate for
20 countries, 1992
• Nation Percentage mortality Rate

Immunized Per 1000 Live Births
• Bolivia 77 118
• Brazil 69 65
• Cambodia 32 184
• Canada 85 8
• China 94 43
• Czech Republic 99 12
• Egypt 89 55
• Ethiopia 13 208
• Finland 95 7
• France 95 9
• Greece 54 9
• India 89 124
• Italy 95 10
• Japan 87 6
• Mexico 91 33
• Poland 98 16
• Russian Federation 73 32
• Senegal 47 145
• Turkey 76 87
• United Kingdom 90 9
• Source : United Nations Children’s Fund, The State of the World’s Children 1994, New York; Oxford University Press.
04/02/24 22
Example
 The result of fitting a simple linear regression to under-five mortality
rate (y) and Percentage of children immunized against DPT (x) is:
y = 219 – 2.1 x
 The intercept is 219 and the slope is –2.1.
 The equation tells that countries who do not have immunization at all
(Percentage of children immunized against DPT = 0) have under-five
mortality rate 219 per 1000 live births on average, and for every
additional percent of immunization, under-five mortality rate decreases
by 2.1 per 1000 live births.
 Testing a null hypothesis HO : b=0, we found p<0.001. This tells us

immunization against DPT is a significant predictor of under-five
mortality
04/02/24 23
Predictions
In predicting a value of y based on some
given value of x ...
1. If there is not a significant linear
correlation, the best predicted y-value is y.
2. If there is a significant linear correlation, the

best predicted y-value is found by substituting the x-
value into the regression equation.
04/02/24 24
Predicting the Value of a Variable
Start
Calculate the value of r

and test the hypothesis
that  = 0
Is Use the regression

there a Yes equation to make
significant linear predictions. Substitute
correlation the given value in the
? regression equation.
No
Given any value of one
variable, the best predicted
value of the other variable
is its sample mean.
Example:
Given the sample data in we found that the regression equation is y =

–113 + 2.27x. Given that x = 85, find the best predicted value of ^y, the
number of events .
We must consider whether there is a linear correlation that justifies
the use of that equation.
Assume : We do have a significant linear correlation (with r = 0.922).
^
y = –113 + 2.27x
–113 + 2.27(85) = 80.0
The predicted number of events is 80.0.
04/02/24 26
Guidelines for Using The
Regression Equation
1. If there is no significant linear correlation,
don’t use the regression equation to make
predictions.
2. When using the regression equation for
predictions, stay within the scope of the
available sample data.
3. A regression equation based on old data is
not necessarily valid now.
4. Don’t make predictions about a population
that is different from the population from
which the sample data was drawn.
04/02/24 27
Coefficient of Determination
Question : How well does the estimated regression line fits the data.
Coefficient of determination is a measure for Goodness of Fit.

Goodness of Fit of the estimated regression line to the data.
Given an observation with values of yi and xi.

We put xi in the equation and get
. ŷi = b0 + b1xi
(yi – ŷi) is called residual.
It is the error in using ŷi to estimate yi.
SSE =  (yi- ŷi)2
04/02/24 28
Definition
Coefficient of determination
the amount of the variation in y that is
explained by the regression line
explained variation.
r=
2
total variation
or
simply square r
04/02/24 29
SSE , SST and SSR
SST : A measure of how well the observations cluster around y
SSE : A measure of how well the observations cluster around ŷ
SST = SSR +SSE
SSR : Sum of the squares due to regression
SSR is explained portion of SST

SSE is unexplained portion of SST
r2 =SSR/SST = 0 =====> the worst fit
r2 = SSR/SST = 1 =====> the best fit
04/02/24 30
Coefficient determination r2 = SSR/SST
SST   y 2  (  y ) 2 / n
• SSE =  (yi- ŷi)2
SSR 
 xy  (  x  y ) / n
2
 x   x  / n
2 2
• Look the following Example

04/02/24 31
Coefficient of Determination
For example, if we calculate ,
SST = 15730
SSE = 1530
SSR = 15730 - 1530 = 14200
r2 = SSR/SST : Coefficient of Determination
1  r2  0
r2 = 14200/15730 = .9027
In other words, 90% of variations in y can be explained by the regression line.
04/02/24 32
Hypothesis testing
04/02/24 33
Estimation of B1 (confidence interval )
Simple linear regression
04/02/24 38
04/02/24 39
04/02/24 40
04/02/24 41
04/02/24 42
04/02/24 43
Multiple Regression
04/02/24 44
04/02/24 45
04/02/24 46
Multiple regression
• The simple linear regression model is easily extended to the case of two or
more explanatory variables. Such a model is called a multiple regression
model, and has the form:
y = a + b1x1 + b2x2 + … + bnxn.
• For example, birth weight depends on maternal age, sex, gestational week
and parity, a model which includes these variables might explain better
the variation in birth weight. We could fit a multiple regression of the
form:
• Birth weight = a + b1Maternal age + b2Sex + b3Gestational age + b4Parity
• After fitting a multiple regression model, we will obtain a point estimate

for each ‘b’ and for the intercept ‘a’.
• Interpretation of the b coefficients is the same as for the slope in simple

linear regression – that is, a change in xi of one unit will produce a change
in y of bi units. We can also test a null hypothesis HO : bi = 0, for each
coefficient
04/02/24 47
Interpretation
• In a multiple regression model, we say that the effect of an
independent variable xi on the dependent variable y has been
adjusted for the other explanatory variables in the model.
• Adjusted estimates are less affected by confounding between
the factors.
• In the birth weight example, after adjusting for sex, gestational
age and parity, the effect of maternal age on birth weight is to
change birth weight by b1 grams for every additional one year
of maternal age.
• In other words if we take two newborns who have the same

sex, of similar gestational age and equal parity, but if the age of
one of the mother’s is one year older than the second then the
newborn of the first mother will have a birth weight of b1
times bigger (for positive b1) or smaller (for negative b1) than
the second newborn.
04/02/24 48
y = a + b1x1 + b2x2 + … + bnxn.
04/02/24 49
Table 10. Birth weight in relation some attributes, multiple
regression analysis.
Characteristics β - coefficient P-value

Maternal age 4.6 <0.05
Gestational age 92.0 <0.001
Period (years)
1976-1979
1980-1989 15.4 >0.1
1990-1996 -81.4 <0.001
Sex of the baby
Males
Females -88.9 <0.001
04/02/24 50
Logistic Regression
• In linear regression, the response variable Y is continuous, and we were interested
to identify a set of explanatory variables that predict its mean value while
explaining the variability of the values.
• Much research in health, however are concerned with a dichotomous response

variable Y (Y assumes only two possible values 1 = “success” and 0 = “failure”)
• Interest is still to identify a set of explanatory variables that predict the mean of
the response variable.
• The mean of the dichotomous random variable (‘p’), is the proportion of times
that it takes the value 1. I.e.
p= Pr(success) = Pr(Y=1)
• Our interest is therefore to estimate the probability ‘p’ and identify explanatory
variables that influence its value.This is possible using a statistical method called
logistic regression.
51
The Model
• Consider one explanatory variable x1
• Our first strategy might be to fit a model of the form

p =  +  1 x1
• This is the standard linear regression in which y is replaced by p.
• However p is a probability which is restricted to taking values between

0 and 1 and the term p =  + 1x1 can result values outside this range,
hence this model is not feasible
52
The Model…
• To solve the above problem we may fit a model
p = exp(+1x1)
• This equation guarantees that the estimate of p is positive, but still can result in a value greater
than 1.
• To accommodate this final constraint, we fit a model of the form

p = e (+1x1) = 1_______
1+e (+1x1) 1+e -(+1x1)
• The logistic regression model can be used to express the relationship between dichotomous
response variable and explanatory variables in any of the following ways:
– In terms of the probability of the event, p
– In terms of the odds the event, p/1-p
– In terms of the log odds or logit of the event, ln (p/1-p)
53
Steps
• Step 1. Observe the data
– Y represents a dichotomous response and X represents
explanatory (categorical or continuous)
– All observations (measurements) must be independent
– The underlying population from which the sample is selected
must be normal.
• Step 2. Define the model:

p= 1_______
1+e -(+1x1)
54
Steps…
• Step3. Estimate the coefficients
p= 1_______
1+e -(a+b1x1)
• Step 4. State the hypotheses

Ho:  = 0
HA :   0 or >0 or <0
SE(b) =
t = b/SE(b) has a t distribution with df = n-2 when Ho is true
Confidence Interval for b calculated b ± t/2SE(b)
55
Steps…
• Step 5. Interpretation of results
• The logistic regression model can provide estimates of the odds ratio given by OR =
exp(b)
• The odds ratio obtained from logistic regression with one dichotomous explanatory
variable is the same as that obtained from 2x2 table. Hence interpretation is similar to
that of 2x2 tables
• The odds ratio obtained from logistic regression with one non-dichotomous categorical
explanatory variable is the same as that obtained from rxc table.
• Hence interpretation is similar to that of rxc tables
• For a continuous explanatory variable, the odds ratio obtained from logistic regression
model corresponds to a unit increase in the variable
56
Multiple logistic regression
• Multiple logistic regression is an extension of the above
to investigate more complicated relationship between
one dichotomous response variable and many
explanatory variables
• The multiple logistic model is given by

p= 1_____ __
1+e -(+1x1+ 2x2+….+ qxq)
• Ln(p/1-p) = a+ b x + b x +….+ b x
1 1 2 2 q q
57
Odds ratio
• The odds ratio obtained from multiple logistic
regression is an adjusted odds ratio
• OR = ebi
• 95% CI of OR = ebi±1.96*SE(bi)
58
SPSS OUTPUT Logistic regression
Variables in the Equation
95.0% C.I.for EXP(B)

B S.E. Wald df Sig. Exp(B) Lower Upper
Step
a
SEXNO(1) -.637 .116 30.202 1 .000 .529 .421 .664
1 Constant -.328 .069 22.396 1 .000 .720
a. Variable(s) entered on step 1: SEXNO.
The most crucial to the interpretation of logistic regression is the

value of Exp (B), which is the change in odds resulting from a unit
change in the predictor
Exp (B) is equal to the OR and the corresponding lower and upper
95% CI are the limits
If the upper and the lower 95% CI doesn’t touch the unity ‘1’, then it is
a sign of significance
Preventive Risk
0 +1
Non-Parametric Tests
• For all statistical tests that we have mentioned up to this point, the population from
which the data were sampled were assumed to be either normally distributed or
approximately so.
• In fact, this property is necessary for the tests to be valid. If the data do not conform the
assumptions made by such traditional techniques, and for small sample sizes, statistical
methods known as nonparametric methods should be used instead.
• Nonparametric tests of hypotheses follow the same general procedure as parametric
tests.
• We give a summary table of equivalent non parametric tests used for the parametric tests
that we have already studied.
Hypothesis Parametric test Non parametric equivalent test

HO : μ1 = μ2 t-test (Unpaired) Mann-Whitney U test
HO : d=0 t-test(Paired) The Sign-test/ Wilcoxon Signed-Rank
test
HO: μ1 = μ2 = μ3 = … ANOVA Kruskal Wallis
HO : r=0 Pearson’s correlation Spearman’s rank/Kendall’s correlation
04/02/24 60

Regression Equation: Independent Variable Predictor Variable Explanatory Variable Dependent Variable Response Variable

Uploaded by

Regression Equation: Independent Variable Predictor Variable Explanatory Variable Dependent Variable Response Variable

Uploaded by

Regression

The typical equation of a straight

• Fitting a simple linear regression model to the data allows us

• For example if we are interested in predicting under-five

• The basic idea of simple linear regression is to find the

2. For each x-value, y is a random variable

algebraically describes the relationship between the two

y-intercept of regression equation 0 b0

Slope of regression equation 1 b1

Equation of the regression line y = 0 + 1 x ^

• 1 > 0 then as x increase the expected value

But we do not know the values for 0 and 1

We have to estimate them using The Least Square Method

We estimate them using sample data.

Simple Linear Equation

b0 and b1 Estimated regression

• Slope for the Estimated Regression Equation

• y -Intercept for the Estimated Regression Equation

xi = value of independent variable for i th observation

xy) – (x) (y)

calculators or computers can

,where y is the mean of the y-values and x is the

Use this sample to find the regression equation.

n(xy) – (x) (y)

n=4 The estimated equation of the regression line is:

• Nation Percentage mortality Rate

 The intercept is 219 and the slope is –2.1.

 Testing a null hypothesis HO : b=0, we found p<0.001. This tells us

2. If there is a significant linear correlation, the

Calculate the value of r

Is Use the regression

Given the sample data in we found that the regression equation is y =

The predicted number of events is 80.0.

Coefficient of determination is a measure for Goodness of Fit.

Given an observation with values of yi and xi.

It is the error in using ŷi to estimate yi.

SSE =  (yi- ŷi)2

SST = SSR +SSE

SSR : Sum of the squares due to regression

SSR is explained portion of SST

r2 =SSR/SST = 0 =====> the worst fit

r2 = SSR/SST = 1 =====> the best fit

• SSE =  (yi- ŷi)2

• Look the following Example

r2 = SSR/SST : Coefficient of Determination

In other words, 90% of variations in y can be explained by the regression line.

• Birth weight = a + b1Maternal age + b2Sex + b3Gestational age + b4Parity

• After fitting a multiple regression model, we will obtain a point estimate

• Interpretation of the b coefficients is the same as for the slope in simple

• In other words if we take two newborns who have the same

Characteristics β - coefficient P-value

• Much research in health, however are concerned with a dichotomous response

• Our first strategy might be to fit a model of the form

• This is the standard linear regression in which y is replaced by p.

• However p is a probability which is restricted to taking values between

• To accommodate this final constraint, we fit a model of the form

• Step 2. Define the model:

• Step 4. State the hypotheses

• Hence interpretation is similar to that of rxc tables

• The multiple logistic model is given by

95.0% C.I.for EXP(B)

The most crucial to the interpretation of logistic regression is the

Hypothesis Parametric test Non parametric equivalent test

You might also like