Quantitative-Methods Summary-Qm-Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

lOMoARcPSD|7690248

Quantitative Methods 1 - Summary - QM - NOTES

Quantitative Methods 1 (University of Melbourne)

StuDocu is not sponsored or endorsed by any college or university


Downloaded by Shannan Richards ([email protected])
lOMoARcPSD|7690248

Quantitative Methods 1
Key definitions
A population consists of all the members of a group about which you want to draw a conclusion.
A sample is the portion of the population selected for analysis.
A parameter is a numerical measure that describes a characteristic of a population.
A statistic is a numerical measure that describes a characteristic of a sample.

Descriptive statistics is collecting, summarising, and presenting data.


Inferential statistics is drawing conclusions about a population based on sample data, i.e. estimating
a parameter based on a statistic.

Inferential Statistics
Estimation: Estimate the population mean income (parameter) using the sample mean income
(statistic).
Hypothesis testing: Test the claim that the population mean income is $80,000.

Defining Data

Categorical (Qualitative)
 Simply classifies data into categories e.g. marital status, hair colour, gender.
Numerical (Discrete)
 Counted items (finite number of items) e.g. number of children, number of people who have
type O blood.
Numerical (Continuous)
 Measured characteristics (infinite number of items) e.g. weight, height.

Graphical Techniques
What is a frequency distribution?
A frequency distribution is a summary table in which data are arranged in numerically ordered
classes or intervals.
The number of observations in each ordered class or interval becomes the corresponding frequency
of that class or interval.

Why use a frequency distribution?


It is a way to summarise numerical data.

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

It condenses the raw data (i.e. large datasets) into a more useful form.
It allows for a quick visual interpretation of the data and first inspection of the shape of the data.

Class Intervals and Class Boundaries


Each data value belongs to one, and only one, class.
Each class grouping has the same width.
Determine the width of each interval by:
Range
Width of Interval 
Number of desired class groupings
General guidelines:
 Usually at least 5 but no more than 15 groupings
 Class boundaries must be mutually exclusive
 Classes must be collectively exhaustive
 Round up the interval width to get desirable endpoints

Graphing Numerical Data: The Histogram


A graph of the data in a frequency distribution is called a histogram.
The class boundaries (or class midpoints) are shown on the horizontal axis.
The vertical axis is either frequency, relative frequency or percentage.
Bars of the appropriate heights are used to represent the frequencies (number of observations)
within each class or the relative frequencies (percentage) of that class.

Scatter Diagrams
Scatter diagrams are used to examine possible relationships between two numerical variables.
In a scatter diagram one variable is measured on the vertical axis (Y) and the other variable is
measured on the horizontal axis (X).

Numerical Descriptive Measures


Describing Data
Graphing data is useful to understand how data can be used to organize, present and summarized.
It is also useful to use numerical methods to summarize data in ways that cannot be easily visualized,
but also in ways that make the data easier to compare.

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

Measures of Central Tendency

Arithmetic Mean in which a sample of size n, the same mean denoted, X̄, is calculated:

X i
X1  X 2   X n
X  i 1

n n
Where Σ means to sum or add up.
This formula is affected by extreme values.

The median is an ordered array, the median is the ‘middle’ number in which 50% of the data is
above and 50% of the data is below.
Its main advantage over the arithmetic mean is that it is not affected by extreme values.
n+1
To find the location of the median, it is found by: L=
2
This formula does not give the value of the median but the position of the median.
Rule 1: if the number of values in the data set is odd, the median is the middle ranked value.
Rule 2: if the number of values in the data set is even, the median is the mean (average) of the two
middle ranked values.

The mode is a measure of central tendency in which is the value that occurs most often (most
frequent) in the data set. It is not affected by extreme values and unlike the mean and median, there
may be no unique (single) mode for a given data set.
It is used for either numerical or categorical (nominal) data.

Quartiles
Quartiles split the ranked data into four segments with an equal number of values per segment.

The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger.
The second quartile, Q2, is the same as the median for which 50% of the observations are smaller
and 50% are larger.
Only 25% of the observations are greater than the third quartile Q 3.

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

Similar to the median, we find a quartile by determining the value in the appropriate position in the
n+1
ranked data: First quartile position: LQ = 4
1

n+1
Second quartile position: LQ = (same as the median)
2
2
3 ( n+1 )
Third quartile position: LQ =
3
4
Where n is the number of observed values (sample size).

Rule 1: If the value is an integer, Q is that ranked value.


Rule 2: If the value is a fractional half i.e. 2.5, 3.5, etc then the quartile is the mean of the
corresponding ranked values.
Rule 3: If the value is neither and integer nor a fractional half i.e. 2.75, 3.25, etc, round the number
to the nearest ranked value and Q is that ranked value.

Measures of Variation

Measures of variation give information on the spread or variability of the data values.

The range is the simplest measure of variation. It is the difference between the largest and the
smallest values in a set of data.

The interquartile range (IQR) is like the median and Q1 and Q3 in which the IQR is a resistant
summary measure (resistant to the presence of extreme values).
We can eliminate outlier problems by using the interquartile range as high- and low-valued
observations are removed by calculations.

IQR=Q3  Q1
The sample variance s2 measures the average scatter around the mean.

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

The sample standard deviation, s, is the most commonly used measure of variation and has the
same units as the original data. This shows the variation about the mean.

The graph below shows the types of variation.

Variance and Standard Deviation


Advantages:
 Each value in the data set is used in the calculation
 Values far from the mean are given extra weight as deviations from the mean are squared

Disadvantages:
 Sensitive to extreme values (outliers)
 Measures of absolute variation, not relative variation i.e. we cannot compare between data
sets with different units or widely different means.

The Z Score
A z-score is a measure of relative standing that takes into consideration both mean & standard
deviation. For each observation in the dataset, we can estimate a z-score on the basis of which we
can identify whether an observation is an outlier. The difference between a given observation and
the mean, divided by the standard deviation.

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

X X
Z
S
Shape of a Distribution
This describes how data are distributed, the measures of shape.
 Symmetric or skewed

Numerical Measures for a Population


The population summary measures are called parameters. The population mean is the sum of the
values in the population divided by the population size, N.

N
∑ Xi X 1 + X 2 ++ X N
μ= i =1 =
N N

Population variance is the average of the squared deviations of values from the mean.

 i
(X  μ) 2

σ 2  i1
N
Population Standard Deviation shows the variation about the mean and is the square root of the
population variance. IT has the same units as the original data.


N
∑ ( X i−μ )2
i =1
σ=
N

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

The Empirical Rule


If the data distribution is approximately bell-shaped, then
The interval μ ±1 σ contains about 68% of the values in the population.
The interval μ ±2 σ contains about 95% of the values in the population
The interval μ ±3 σ contains about 99.7% of the values in the population

Exploratory Data Analysis


Box-and-Whisker Plot: A graphical display of data using the 5 number summary:

The box-and-whisker plots correspond to the distributions as follows:

The sample covariance measures the strength of the linear relationship between two numerical
variables.
n

 (X i
 X )(Yi  Y )
cov ( X , Y )  i 1
n 1

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

A positive covariance means that there is a positive linear relationship and a negative covariance
means there is a negative linear relationship.
By using this formula, it is only concerned with the direction of the relationship and no causal effect
is implied. It is not a measure of relative strength and is affected by units of measurement.

Correlation measures the relative strength of the linear relationship between two variables.

√ √
n n

cov ( X ,Y) ∑ ( X i− X̄ )2
∑ (Y i−Y )2
r= SX= i =1
SY = i =1
S X SY , where n−1 and n−1 .

A feature of correlation coefficient r is that it ranges between -1 and 1 where:


 The closer to -1, the stronger the negative linear relationship
 The closer to 1, the stronger the positive linear relationship
 The closer to 0, the weaker the linear relationship

Basic Probability Theory


Probability is a numerical value that represents the chance, likelihood, possibility that an event will
occur (always between 0 and 1).
An event is each possible outcome of a variable.

Events
Simple event (denoted A)
 An outcome from a sample space with one characteristic
Complement of an event A (denoted A’)
 All outcomes that are not part of event A.
Joint event (denoted A∩B, pronounced A intersect B)
 Involves two or more characteristics simultaneously
Mutually exclusive events
 Events that cannot occur together
Collectively exhaustive events
 One of the events must occur. The set of events covers the entire sample space.

Probability
The probability of any event must be between 0 and 1, inclusively.
0 ≤ P(A) ≤ 1, for any event A
The sum of the probabilities of all mutually exclusive and collectively exhaustive events is 1.
P(A) + P(B) = 1, if A and B are mutually exclusive and collectively exhaustive.

Computing Joint and Marginal Probabilities


The probability of a joint event, A and B:
number of outcomes satisfying A and B
P( A and B )=
total number of elementary outcomes
Computing a marginal (or simple) probability:
P( A )=P( A and B1 )+ P( A and B2 )
Where B1 and B2 are mutually exclusive and collectively exhaustive events.

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

General addition rule


P(A or B) = P(A) + P(B) – P(A and B)

If A and B are mutually exclusive, then

P(A and B) = 0, so the rule can be simplified

P(A or B) = P(A) + P(B) WHERE A AND B ARE MUTUALLY EXCLUSIVE.

Computing Conditional Probabilities


A conditional probability is the probability of one event, given that another event has occurred:

P(A and B)
P(A|B) 
P(B)
Where P(A and B) = joint probability of A and B
P(A) = marginal probability of A
P(B) = marginal probability of B

Statistical Independence
Two events are independent if, and only if:
P(A|B) = P(A)
Events A and B are independent when the probability of one event is not affected by the other
event.

Multiplication Rules
Multiplication rule for two events A and B:
P(A and B) = P(A|B) x P(B)

Note: If A and B are independent then


P(A|B) = P(A) and the multiplication rule simplifies to P(A and B) = P(A) x P(B)

Marginal Probability
Marginal probability for event A:

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

P(A) = P(A and B1) + P(A + B2) = P(A|B1) + P(A|B2) x P(B2)


Where B1 and B2 are mutually exclusive and collectively exhaustive events.

Discrete Probability Theory


A random variable represents a possible numerical value from an uncertain event. Discrete random
variables can only assume a countable number of values.

What is a Probability Distribution?


A probability distribution provides the possible values of the random variable and their
corresponding probabilities. A probability distribution can be in the form of a table, graph, or
mathematical formula.
Requirements of a discrete probability distribution:
∑ P( X =x)=1
0≤P( X=x )≤1
Discrete Random Variable Summary Measures
Expected value (or mean) of a discrete random variable (weighted average).
N
μ= E( X )=∑ X i P( X i )
i=1
Variance of a discrete random variable – definition formula
N
σ =∑ [ X i−E( X )]2 P( X i )
2

i=1
Variance of a discrete random variable – alternative calculation formula
N
σ =∑ X 2 P( X i )−E( X )2
2

i=1 i
Where E[x] = expected value of the discrete random variable x,
Xi = the ith outcome of the discrete random variable x,
P(Xi) = probability of the ith occurrence of x

The Covariance
The covariance measures the direction of a linear relationship between two variables.

Definition formula for covariance


N
σ XY =∑ [ X i−E ( X )][(Y i −E(Y )] P( X i Y i )
i=1
Calculation formula for covariance
N
σ XY =∑ X i Y i P( X i Y i )−E( X )E (Y )
i=1
Where XiYi is the ith outcome of the discrete random variables X and Y respectively,
P(XiYi) = probability of the ith occurrence of X and Y

The Sum of Two Random Variables


Expected value of the sum of two random variables
E( X +Y )=E( X )+E(Y )
Variance of the sum of two random variables
2 2 2
Var ( X +Y )=σ X +Y =σ X +σ Y +2 σ XY
Standard deviation of the sum of two random variables

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

σ X +Y =√ σ 2X +Y

Combinations
The number of combinations of selecting X objects out of n objects is

()
n
X
=n C x =
n!
X!(n−X )!
Where:
n! = n(n-1)(n-2)…(2)(1)
x! = x(x-1)(x-2)…(2)(1)
0! = 1 (by definition)

The Binomial Distribution Formula

n! X n X
P( X )  p (1  p)
X !(n  X )!
Where:
P(X) = probability of X successes in n trials, with probability of success p on each trial
X = number of ‘successes’ in sample
n = sample size (number of trials or observations)
p = probability of ‘success’
1 – p = probability of failure

Characteristics of the Binomial Distribution


Mean
μ=E ( X )=np
Variance and standard deviation
2
σ =np(1-p)
σ=√ np(1-p)
Where:
n = sample size
p = probability of success
(1 – p) = probability of failure

Continuous Probability Distributions


A continuous random variable is a variable that can assume any value on a continuum (can assume
an infinite number of values).

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

The Normal Distribution

Features of a normal distribution:


 Bell-shaped
 Symmetrical
 Mean, median and mode are equal
 Central location is determined by the mean, μ
 Spread is determined by the standard deviation, σ
 The random variable X has an infinite theoretical range: +  to  

Many Normal Distributions


There are many distributions that we can create by varying the parameters μ and σ, we obtain
different normal distributions.

This is because:

The Normal Probability Density Function


The formula for the normal probability density function is:

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

1 −(1/2 )[( X −μ)/σ]2


f ( X )= e
√ 2πσ
Where:
e = the mathematical constant approximated by 2.71828
Π = the mathematical constant approximated by 3.14159
μ = the population mean
σ = the population standard deviation
X = any value of the continuous variable

Finding Normal Probabilities


Probability is measured by the area under the curve.

Note that the probability of any individual value is zero since the X axis has an infinite theoretical
range: +  to  

Probability as Area Under the Curve


The total area under the curve is 1, and the curve is symmetric so half is above the mean and half is
below.

Empirical Rules
What can we say about the distribution of values around the mean?
There are some general rules:

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

μ ± 2σ covers approx 95.44% of observations.

μ ± 3σ covers approx 99.73% of observations.

Translation to the Standardised Normal Distribution


Any normal distribution (with any mean and standard deviation combination) can be transformed
into the standardised normal distribution (z).

X μ
Z
σ
The Standardised Normal Distribution
Is also known as the Z distribution which has a mean of 0 and the standard deviation of 1.

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

Values above the mean have positive z-values and values blow the mean have negative z-values.

The Standardised Normal Table

General Procedure for Finding Probabilities


To find P(a < x < b) when X is distributed normally:
1. Draw the normal curve for the problem in terms of X.
2. Translate X-values to Z-values and put Z values on your diagram.
3. Use the Standardised Normal Table.

Finding the X Value for a Known Probability


1. Draw a normal curve placing all known values on it, such as mean of X and Z.
2. Shade in area of interest and find cumulative probability.
3. Find the Z value for the known probability.
4. Convert to X units using the formula.
X=μ+Zσ
This is the Z formula rearranged in terms of X.

Sampling Distributions and Data Sampling


Recall that a sample is a set of objects or people selected out of a population. Presumably, it is taken
in a process that is representative of the population that you care about. Yet, not all samples are
equal.

A sampling distribution is a distribution of all of the possible values of a statistic for a given sample
size selected from a population.

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

If the Population is Normal


If a population is normal with mean μ and standard deviation σ, the sampling distribution of x̄ is also
normally distributed with:
σ
μ X =μ σ X=
and √n .
Sampling Distribution Properties

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

If the Population is NOT Normal


We can apply the Central Limit Theorem, which states that regardless of the shape of individual
values in the population distribution, as long as the sample size is large enough (generally n ≥ 30)
the sampling distribution of x̄ will be approximately normally distributed with:
σ
μ { x̄=μ¿ σ x̄ =
and √n .
The Central Limit Theorem

Z Formula for Sampling Distribution


If the population is normal OR the Central Limit Theorem is applicable, we can use the normal
distribution and the Z table to find probabilities for the sample mean. The relevant formula is:

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

X  X X  
Z  
X  / n
Where:
X́ = sample mean
μ = population mean
σ = population standard deviation
n = sample size

Sampling Distribution of the Proportion

π is the proportion of items in the population with a characteristic of interest.


p is the sample proportion and provides an estimate of π

X
p
n

number of items in the sample having the characteristic of interest



sample size

Selecting all possible samples of a certain size, the distribution of all possible sample proportions is
the sampling distribution of the proportion.

Confidence Intervals
Point and Interval Estimates
A point estimate is the value of a single sample statistic.
A confidence interval provides a range of values constructed around the point estimate.

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

Point Estimates

Confidence Interval Estimate


An interval gives a range of values:
 Takes into consideration variation in sample statistics from sample to sample
 Based on observations from 1 sample
 Gives information about closeness to unknown population parameters
 Stated in terms of level of confidence
 Can never be 100% confident

Confidence Interval
The general formula for all confidence intervals is:
Point Estimate ± (Critical Value)*(Standard Error)
Represents confidence for which the interval will contain the unknown population parameter.
Common confidence levels = 90%, 95% or 99%:
• Also written (1 - ) = 0.90, 0.95 or 0.99
A relative frequency interpretation:
• In the long run, 90%, 95% or 99% of all the confidence intervals that can be constructed (in
repeated samples) will contain the unknown true parameter.
A specific interval will either contain or will not contain the true parameter.
• No probability involved in a specific interval.

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

Confidence Interval for μ (σ Known)


Assumptions:
 Population standard deviation σ is known
 Population is normally distributed
 If population is not normal, use Central Limit Theorem

Confidence interval estimate


σ
X ±Z
√n
Where X́ is the point estimate
α
Z is the normal distribution critical value for a probability of in each tail
2
σ
is the standard error
√n

Finding the Critical Z Value


Consider a 95% confidence interval

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

Confidence Intervals

Confidence Interval for μ (σ Unknown)


If the population standard deviation σ is unknown, we can substitute the sample standard
deviation, s.
This introduces extra uncertainty, since s is variable from sample to sample. So we use the Student t
distribution instead of the normal distribution:
 The t value depends on degrees of freedom denoted by sample size minus 1
 D.f are number of observations that are free to vary after sample mean has been calculated

The confidence interval estimate when σ is unknown is:


S
X ±t n-1
√n
α
Where t is the critical value of the t distribution with n – 1 degrees of freedom and an area of in
2
each tail.

Student’s t Distribution

Hypothesis Testing
A hypothesis is a statement (assumption) about a population parameter.

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

The Null Hypothesis, H0


States the belief or assumption in the current situation (status quo).
Begin with the assumption that the null hypothesis is true. Similar to the notion of innocent until
proven guilty.
Always contains ‘=‘, ‘≤’ or ‘’ sign.
It’s always about a population parameter

The Alternative Hypothesis, H1


The alternative hypothesis is the opposite of the null hypothesis in which it challenges the status
quo. Can only contain either the ‘<, ‘>’ or ‘≠’ sign.

Reason for Rejecting the Null Hypothesis

The Level of Significance, 


This defines the unlikely values of the sample statistic if the null hypothesis is true.
 Defines rejection region of the sampling distribution
 Designated by  (level of significance)
It is selected by the researcher at the beginning and provides the critical value(s) of the test.

Level of Significance and the Rejection Region

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

Errors in Making Decisions


Type I error
 Rejected a true null hypothesis
 Considered a serious type of error
Type II error
 Failed to reject a false null hypothesis (or accept a null hypothesis when it is false)
The probability of Type I error is 
 Called level of significance of the test
 Set by the researcher in advance
The probability of Type II error is β

Outcome and Probabilities

Z Test of Hypothesis for the Mean (σ Known)

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

Critical Value Approach to Testing


For a two-tailed test for the mean, σ known:
1. Convert sample statistic ( X ), to the test statistic (Z statistic).
2. Determine the critical Z values for a specified level of significance  from a table.
3. Decision rule: if the test statistic falls in the rejection region, reject H 0; otherwise do not
reject H0.

Two-tail Tests

Six Steps in Hypothesis Testing


 State the null hypothesis, H0, and the alternative hypothesis, H1
 Choose the level of significance, , and the sample size, n
 Determine the appropriate test statistic and sampling distribution
 Determine the critical values that divide the rejection and non-rejection regions
 Collect data and compute the value of the test statistic
 Make the statistical decision and state the managerial conclusion
o If the test statistic falls into the non-rejection region, do not reject the null
hypothesis H0. If the test statistic falls into the rejection region, reject the null
hypothesis.
o Express the managerial conclusion in the context of the real-world problem.
The P-value approach to testing

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

P-value: Probability of obtaining a test statistic more extreme.


(≤ or ) than the observed sample value, given H0 is true.
 Also called the observed level of significance
 Smallest value of  for which H0 can be rejected
 Obtain the P-value from table

If p-value <  , reject H0


If p-value   , do not reject H0

One-tail Tests
In many cases, the alternative hypothesis focuses on a particular direction

Lower-tail Tests

Upper-tail Tests

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

t Test of Hypothesis for the mean (σ Unknown)

Simple Linear Regression


Introduction to Regression Analysis
Regression analysis is used to:
 Predict the value of a dependent variable (Y) based on the value of at least one independent
variable (X)
 Explain the impact of changes in an independent variable on the dependent variable

Dependent variable (Y): the variable we wish to predict or explain (response variable)
Independent variable (X): the variable used to explain the dependent variable (explanatory variable)
Simple linear regression:
 Only one independent variable, x
 Relationship between X and Y described by a linear function
 Changes in Y are assumed to be caused by changes in X

Simple Linear Regression Model

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

Simple Linear Regression Equation (Prediction Line)


The simple linear regression equation provides an estimate of the population regression line.

The individual random error terms ei have a mean of zero.

Least Squares Method


b0 and b1 are obtained by finding the values of b 0 and b1 that minimize the sum of the squared
^
differences between actual values (Y) and predicted values ( Y ).

min ∑ (Y i− Y^ i ) =min ∑ (Y i − (b 0 +b 1 X i ))
2 2

b0 is the estimated average value of Y when the value of X is zero.


b1 is the estimated change in the average value of Y as a result of a one-unit change in X.

Interpolation vs. Extrapolation


When using a regression model for prediction, only predict within the relevant range of data.

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

Measures of Variation
Total variation is made up of two parts.

Coefficient of Determination, r2
The coefficient of determination is the portion of the total variation in the dependent variable that
is explained by variation in the independent variable.
The coefficient of determination is also called r-squared and is denoted r2.

SSR regression sum of squares


r2  
SST total sum of squares
2
Note: 0≤r ≤1

Standard Error of the Estimate


The standard deviation of the variation of observations around the regression line is estimated by:


n
∑ (Y i−Y^ i )2
S YX =

SSE
n−2
=
Where SSE = error sum of squares
i =1
n−2

n = sample size

Comparing Standard Errors


SYX is a measure of the variation of observed Y values from the regression line.

The magnitude of SYX should always be judged relative to the size of the Y values in the sample data.

Assumptions of Regression
Use the acronym LINE

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

Linearity – The underlying relationship between X and Y is linear


Independence of errors – Error values are statistically independent
Normality of error – Error values (ε) are normally distributed for any given value of X
Equal variance – The probability distribution of the errors has constant variance

Residual Analysis
The residual for observation i, ei, is the difference between its observed and predicted value.
e i=Y i −Y^ i
Check the assumptions of regression by examining the residuals:
 Examine the linearity assumption
 Evaluate independence assumption
 Evaluate normal distribution assumption
 Examine for constant variance for all levels of X
Graphical Analysis of Residuals:
 Can plot residuals vs X.

Residual Analysis for Independence

Residual Analysis for Normality


A normal probability plot of the residuals can be used to check for normality.

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

Residual Analysis for Equal Variance

Inferences about the Slope


The standard error of the regression slope of coefficient (b 1) is estimated by:
S YX S YX
Sb = =
1
√ SSX √∑( X − X ) i
2

Where Sb1 = estimate of the standard error of the least squares slope

S YX =
√ SSE
n−2 = standard error of the estimate

Inference about the Slope: t Test


t test for a population shape: Is there a linear relationship between X and Y?

Null and alternative hypotheses


H0: β1 = 0 (no linear relationship)
H1: β1 ≠ 0 (linear relationship does exist)

b1 −β1
t=
Sb
1

Test statistic with d.f. = n – 1


Where b1 = regression slope coefficient

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

β1 = hypothesized slope
Sb = standard error of the slope

F Test for Significance


F Test statistic:
SSR
MSR=
k
MSR SSE
F= MSE=
MSE ,where n−k−1

F follows an F distribution with k numerator and (n – k – 1) denominator degrees of freedom.


k = the number of independent (explanatory) variables in the regression model.

Multiple Linear Regression


The Multiple Regression Model

Multiple Regression Equation


The coefficients of the multiple regression model are estimated using sample data.
Multiple regression equation with k independent variables:

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

Are Individual Variables Significant?


Shows if there is a linear relationship between the variable X i and Y.
Hypotheses

Use t tests of individual variable slopes (between X i and Y)


Test statistic:
b j−0
t=
Sb j
, (df = n- k – 1)

Coefficient of Multiple Determination


Reports the proportion of total variation in Y explained by all X variables taken together.
SSR regression sum of squares
r2= =
SST total sum of squares
Adjusted r2
r2 never decreases when a new X variable is added to the model.
 This can be a disadvantage when comparing models.
What is the net effect of adding a new variable?
 We lose a degree of freedom when a new X variable is added
 Did the new X variable add enough explanatory power to offset the loss of one degree of
freedom?
Shows the proportion of variation in Y explained by all X variables adjusted for the number of X
variables used.

[
r 2adj=1− (1−r 2 ) ( n−1
n−k−1 )]
Where n = sample size, k = number of independent variables
 Penalises excessive use of unimportant independent variables
 Smaller than r2
 Useful in comparing among models

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

Is the Model Significant?


F Test for Overall Significance of the Model:
 Shows if there is a linear relationship between all of the X variables considered together and
Y.
Hypotheses:

F Test for Overall Significance


Test statistic:
SSR
MSR k
F= =
MSE SSE
n−k−1
Where F has:
k = numerator
n – (k + 1) = (n – k – 1) degrees of freedom

Residuals in Multiple Regression

Downloaded by Shannan Richards ([email protected])


lOMoARcPSD|7690248

Multiple Regression Assumptions


Errors (residuals):

Assumptions:
 The errors are normally distributed
 Errors have a constant variance
 The model errors are independent

Residual Plots Used in Multiple Regression


These residual plots are used in multiple regression:
• Residuals vs. Yi (Predicted Sales)
• Residuals vs. X1i
• Residuals vs. X2i
• Residuals vs. time (if time series data)

Interaction Effects
Sometimes our model will predict that in addition to individual variables influencing our dependent
variable, some combination of these variables will differentially effect the dependent variable.

Downloaded by Shannan Richards ([email protected])

You might also like