Quantitative-Methods Summary-Qm-Notes
Quantitative-Methods Summary-Qm-Notes
Quantitative-Methods Summary-Qm-Notes
Quantitative Methods 1
Key definitions
A population consists of all the members of a group about which you want to draw a conclusion.
A sample is the portion of the population selected for analysis.
A parameter is a numerical measure that describes a characteristic of a population.
A statistic is a numerical measure that describes a characteristic of a sample.
Inferential Statistics
Estimation: Estimate the population mean income (parameter) using the sample mean income
(statistic).
Hypothesis testing: Test the claim that the population mean income is $80,000.
Defining Data
Categorical (Qualitative)
Simply classifies data into categories e.g. marital status, hair colour, gender.
Numerical (Discrete)
Counted items (finite number of items) e.g. number of children, number of people who have
type O blood.
Numerical (Continuous)
Measured characteristics (infinite number of items) e.g. weight, height.
Graphical Techniques
What is a frequency distribution?
A frequency distribution is a summary table in which data are arranged in numerically ordered
classes or intervals.
The number of observations in each ordered class or interval becomes the corresponding frequency
of that class or interval.
It condenses the raw data (i.e. large datasets) into a more useful form.
It allows for a quick visual interpretation of the data and first inspection of the shape of the data.
Scatter Diagrams
Scatter diagrams are used to examine possible relationships between two numerical variables.
In a scatter diagram one variable is measured on the vertical axis (Y) and the other variable is
measured on the horizontal axis (X).
Arithmetic Mean in which a sample of size n, the same mean denoted, X̄, is calculated:
X i
X1 X 2 X n
X i 1
n n
Where Σ means to sum or add up.
This formula is affected by extreme values.
The median is an ordered array, the median is the ‘middle’ number in which 50% of the data is
above and 50% of the data is below.
Its main advantage over the arithmetic mean is that it is not affected by extreme values.
n+1
To find the location of the median, it is found by: L=
2
This formula does not give the value of the median but the position of the median.
Rule 1: if the number of values in the data set is odd, the median is the middle ranked value.
Rule 2: if the number of values in the data set is even, the median is the mean (average) of the two
middle ranked values.
The mode is a measure of central tendency in which is the value that occurs most often (most
frequent) in the data set. It is not affected by extreme values and unlike the mean and median, there
may be no unique (single) mode for a given data set.
It is used for either numerical or categorical (nominal) data.
Quartiles
Quartiles split the ranked data into four segments with an equal number of values per segment.
The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger.
The second quartile, Q2, is the same as the median for which 50% of the observations are smaller
and 50% are larger.
Only 25% of the observations are greater than the third quartile Q 3.
Similar to the median, we find a quartile by determining the value in the appropriate position in the
n+1
ranked data: First quartile position: LQ = 4
1
n+1
Second quartile position: LQ = (same as the median)
2
2
3 ( n+1 )
Third quartile position: LQ =
3
4
Where n is the number of observed values (sample size).
Measures of Variation
Measures of variation give information on the spread or variability of the data values.
The range is the simplest measure of variation. It is the difference between the largest and the
smallest values in a set of data.
The interquartile range (IQR) is like the median and Q1 and Q3 in which the IQR is a resistant
summary measure (resistant to the presence of extreme values).
We can eliminate outlier problems by using the interquartile range as high- and low-valued
observations are removed by calculations.
IQR=Q3 Q1
The sample variance s2 measures the average scatter around the mean.
The sample standard deviation, s, is the most commonly used measure of variation and has the
same units as the original data. This shows the variation about the mean.
Disadvantages:
Sensitive to extreme values (outliers)
Measures of absolute variation, not relative variation i.e. we cannot compare between data
sets with different units or widely different means.
The Z Score
A z-score is a measure of relative standing that takes into consideration both mean & standard
deviation. For each observation in the dataset, we can estimate a z-score on the basis of which we
can identify whether an observation is an outlier. The difference between a given observation and
the mean, divided by the standard deviation.
X X
Z
S
Shape of a Distribution
This describes how data are distributed, the measures of shape.
Symmetric or skewed
N
∑ Xi X 1 + X 2 ++ X N
μ= i =1 =
N N
Population variance is the average of the squared deviations of values from the mean.
i
(X μ) 2
σ 2 i1
N
Population Standard Deviation shows the variation about the mean and is the square root of the
population variance. IT has the same units as the original data.
√
N
∑ ( X i−μ )2
i =1
σ=
N
The sample covariance measures the strength of the linear relationship between two numerical
variables.
n
(X i
X )(Yi Y )
cov ( X , Y ) i 1
n 1
A positive covariance means that there is a positive linear relationship and a negative covariance
means there is a negative linear relationship.
By using this formula, it is only concerned with the direction of the relationship and no causal effect
is implied. It is not a measure of relative strength and is affected by units of measurement.
Correlation measures the relative strength of the linear relationship between two variables.
√ √
n n
cov ( X ,Y) ∑ ( X i− X̄ )2
∑ (Y i−Y )2
r= SX= i =1
SY = i =1
S X SY , where n−1 and n−1 .
Events
Simple event (denoted A)
An outcome from a sample space with one characteristic
Complement of an event A (denoted A’)
All outcomes that are not part of event A.
Joint event (denoted A∩B, pronounced A intersect B)
Involves two or more characteristics simultaneously
Mutually exclusive events
Events that cannot occur together
Collectively exhaustive events
One of the events must occur. The set of events covers the entire sample space.
Probability
The probability of any event must be between 0 and 1, inclusively.
0 ≤ P(A) ≤ 1, for any event A
The sum of the probabilities of all mutually exclusive and collectively exhaustive events is 1.
P(A) + P(B) = 1, if A and B are mutually exclusive and collectively exhaustive.
P(A and B)
P(A|B)
P(B)
Where P(A and B) = joint probability of A and B
P(A) = marginal probability of A
P(B) = marginal probability of B
Statistical Independence
Two events are independent if, and only if:
P(A|B) = P(A)
Events A and B are independent when the probability of one event is not affected by the other
event.
Multiplication Rules
Multiplication rule for two events A and B:
P(A and B) = P(A|B) x P(B)
Marginal Probability
Marginal probability for event A:
i=1
Variance of a discrete random variable – alternative calculation formula
N
σ =∑ X 2 P( X i )−E( X )2
2
i=1 i
Where E[x] = expected value of the discrete random variable x,
Xi = the ith outcome of the discrete random variable x,
P(Xi) = probability of the ith occurrence of x
The Covariance
The covariance measures the direction of a linear relationship between two variables.
σ X +Y =√ σ 2X +Y
Combinations
The number of combinations of selecting X objects out of n objects is
()
n
X
=n C x =
n!
X!(n−X )!
Where:
n! = n(n-1)(n-2)…(2)(1)
x! = x(x-1)(x-2)…(2)(1)
0! = 1 (by definition)
n! X n X
P( X ) p (1 p)
X !(n X )!
Where:
P(X) = probability of X successes in n trials, with probability of success p on each trial
X = number of ‘successes’ in sample
n = sample size (number of trials or observations)
p = probability of ‘success’
1 – p = probability of failure
This is because:
Note that the probability of any individual value is zero since the X axis has an infinite theoretical
range: + to
Empirical Rules
What can we say about the distribution of values around the mean?
There are some general rules:
X μ
Z
σ
The Standardised Normal Distribution
Is also known as the Z distribution which has a mean of 0 and the standard deviation of 1.
Values above the mean have positive z-values and values blow the mean have negative z-values.
A sampling distribution is a distribution of all of the possible values of a statistic for a given sample
size selected from a population.
X X X
Z
X / n
Where:
X́ = sample mean
μ = population mean
σ = population standard deviation
n = sample size
X
p
n
Selecting all possible samples of a certain size, the distribution of all possible sample proportions is
the sampling distribution of the proportion.
Confidence Intervals
Point and Interval Estimates
A point estimate is the value of a single sample statistic.
A confidence interval provides a range of values constructed around the point estimate.
Point Estimates
Confidence Interval
The general formula for all confidence intervals is:
Point Estimate ± (Critical Value)*(Standard Error)
Represents confidence for which the interval will contain the unknown population parameter.
Common confidence levels = 90%, 95% or 99%:
• Also written (1 - ) = 0.90, 0.95 or 0.99
A relative frequency interpretation:
• In the long run, 90%, 95% or 99% of all the confidence intervals that can be constructed (in
repeated samples) will contain the unknown true parameter.
A specific interval will either contain or will not contain the true parameter.
• No probability involved in a specific interval.
Confidence Intervals
Student’s t Distribution
Hypothesis Testing
A hypothesis is a statement (assumption) about a population parameter.
Two-tail Tests
One-tail Tests
In many cases, the alternative hypothesis focuses on a particular direction
Lower-tail Tests
Upper-tail Tests
Dependent variable (Y): the variable we wish to predict or explain (response variable)
Independent variable (X): the variable used to explain the dependent variable (explanatory variable)
Simple linear regression:
Only one independent variable, x
Relationship between X and Y described by a linear function
Changes in Y are assumed to be caused by changes in X
min ∑ (Y i− Y^ i ) =min ∑ (Y i − (b 0 +b 1 X i ))
2 2
Measures of Variation
Total variation is made up of two parts.
Coefficient of Determination, r2
The coefficient of determination is the portion of the total variation in the dependent variable that
is explained by variation in the independent variable.
The coefficient of determination is also called r-squared and is denoted r2.
√
n
∑ (Y i−Y^ i )2
S YX =
√
SSE
n−2
=
Where SSE = error sum of squares
i =1
n−2
n = sample size
The magnitude of SYX should always be judged relative to the size of the Y values in the sample data.
Assumptions of Regression
Use the acronym LINE
Residual Analysis
The residual for observation i, ei, is the difference between its observed and predicted value.
e i=Y i −Y^ i
Check the assumptions of regression by examining the residuals:
Examine the linearity assumption
Evaluate independence assumption
Evaluate normal distribution assumption
Examine for constant variance for all levels of X
Graphical Analysis of Residuals:
Can plot residuals vs X.
Where Sb1 = estimate of the standard error of the least squares slope
S YX =
√ SSE
n−2 = standard error of the estimate
b1 −β1
t=
Sb
1
β1 = hypothesized slope
Sb = standard error of the slope
[
r 2adj=1− (1−r 2 ) ( n−1
n−k−1 )]
Where n = sample size, k = number of independent variables
Penalises excessive use of unimportant independent variables
Smaller than r2
Useful in comparing among models
Assumptions:
The errors are normally distributed
Errors have a constant variance
The model errors are independent
Interaction Effects
Sometimes our model will predict that in addition to individual variables influencing our dependent
variable, some combination of these variables will differentially effect the dependent variable.