Lecture set 1

ECONS303
Applied Quantitative Research Methods

Lecture 1: Review of Statistics
Lecturer: Dr. Yong Soo Keong (Email: syong@waikato.ac.nz)

Tutor: Xu Shuo (Email: Shuo.Xu@msci.com); Wan Yuan (Email: 13636313396@139.com)
Introduction to Econometrics
Fourth Edition, Global Edition
Chapters 1, 2 and 3
The statistical analysis of
economic (and related)
data
Brief Overview of the Course
• Economics suggests important relationships, often with policy
implications, but virtually never suggests quantitative
magnitudes of causal effects.
– What is the quantitative effect of reducing class size on student
achievement?
– How does another year of education change earnings?
– What is the price elasticity of cigarettes?
– What is the effect on output growth of a 1 percentage point increase in
interest rates by the Fed?
This course is about using data to measure
causal effects.
• Ideally, we would like an experiment
– What would be an experiment to estimate the effect of class size on
standardized test scores?
• But almost always we only have observational
(nonexperimental) data.
– returns to education
– cigarette prices
– monetary policy
• Most of the course deals with difficulties arising from using
observational data to estimate causal effects
– confounding effects (omitted factors)
– simultaneous causality
– “correlation does not imply causation”
In this course you will:
• Learn methods for estimating causal effects using observational
data
• Learn methods for prediction – for which knowing causal effects
is not necessary – including forecasting using time series data;
• Focus on applications – theory is used only as needed to
understand the whys of the methods;
• Learn to evaluate the regression analysis of others – this means
you will be able to read/understand empirical economics papers.
• Get some hands-on experience with regression analysis in your
problem sets.
Review of Probability and Statistics
(SW Chapters 2, 3)
• Empirical problem: Class size and educational output
– Policy question: What is the effect on test scores (or some other outcome
measure) of reducing class size by one student per class? by 8
students/class? (class size measured by student-teacher ratio (STR))
– We must use data to find out (is there any way to answer this without
data?)
What comes next…
• The mechanics of estimation, hypothesis testing, and confidence
intervals should be familiar
• These concepts extend directly to regression and its variants.
• Before turning to regression, however, we will review some of
the underlying theory of estimation, hypothesis testing, and
confidence intervals.
Review of Statistical Theory
1. The probability framework for statistical inference
2. Estimation
3. Testing
4. Confidence Intervals
The probability framework for statistical inference

a) Population, random variable, and distribution
b) Moments of a distribution (mean, variance, standard deviation,
covariance, correlation)
c) Conditional distributions and conditional means
d) Distribution of a sample of data drawn randomly from a
population: Y1,…, Yn
(a) Population, random variable, and
distribution
Population
• The group or collection of all possible entities of interest (school
districts)
• We will think of populations as infinitely large (∞ is an
approximation to “very big”)
Random variable Y
• Numerical summary of a random outcome (e.g., district average
test score, district STR)
Population distribution of Y
• The probabilities of different values of Y that occur in the
population, for ex. Pr[Y = 650] (when Y is discrete)
• or: The probabilities of sets of these values, for ex. Pr[640 ≤ Y ≤
660] (when Y is continuous).
(b) Moments of a population distribution: mean,
variance, standard deviation, covariance, correlation
(1 of 3)
mean = expected value (expectation) of Y

= E(Y )
= μY
= long-run average value of Y over repeated realizations of Y
variance = E(Y – μY)2
  Y2
= measure of the squared spread of the distribution
standard deviation  variance   Y
(2 of 3)
E Y  Y  
 3
skewness   
3
Y
= measure of asymmetry of a distribution
• skewness = 0: distribution is symmetric
• skewness > (<) 0: distribution has long right (left) tail
E Y  Y  
 4
kurtosis   
4
Y
= measure of mass in tails
= measure of probability of large extreme values (outliers)
• kurtosis = 3: normal distribution
• skewness > 3: heavy tails (“leptokurtotic”)
(3 of 3)
2 random variables: joint distributions and
covariance (1 of 2)
• Random variables X and Z have a joint distribution
• The covariance between X and Z is
cov(X,Z) = E[(X – μX)(Z – μZ)] = σXZ
• The covariance is a measure of the linear association between X
and Z; its units are units of X × units of Z
• cov(X,Z) > 0 means a positive relation between X and Z
• If X and Z are independently distributed, then cov(X,Z) = 0
• The covariance of a RV with itself is its variance:
cov( X , X )  E[( X   X )( X   X )]  E[( X   X ) 2 ]   X2

The correlation coefficient is defined in
terms of the covariance:
cov( X , Z )  XZ
corr( X , Z )    rXZ
var( X ) var( Z )  X  Z
• –1 ≤ corr(X,Z) ≤ 1
• corr(X,Z) = 1 mean perfect positive linear association
• corr(X,Z) = –1 means perfect negative linear association
• corr(X,Z) = 0 means no linear association
2 random variables: joint distributions and
covariance (2 of 2)
The covariance between Test Score and STR is negative:
(c) Conditional distributions and
conditional means (1 of 3)
Conditional distributions
• The distribution of Y, given value(s) of some other random variable, X
• Ex: the distribution of test scores, given that Student-Teacher Ratio (STR) < 20
Conditional expectations and conditional moments

• conditional mean = mean of conditional distribution
= E(Y | X = x) (important concept and notation)
• conditional variance = variance of conditional distribution

• Example: E(Test score|STR < 20) = the mean of test scores among districts with small
class sizes
The difference in means is the difference between the means of
two conditional distributions
Δ = E(Test score|STR < 20) – E(Test score|STR ≥ 20)
This measures the difference in test scores associated with differences in the
class size.
• If E(X | Z) = const, then corr(X,Z) = 0 (not necessarily vice versa however)
The conditional mean is a (possibly new) term for the familiar

idea of the group mean
The conditional mean plays a key role in prediction:
• Suppose you want to predict a value of Y, and you are given the
value of a random variable X that is related to Y. That is, you
want to predict Y given the value of X.
– For example, you want to predict someone’s income, given their years of
education.
• A common measure of the quality of a prediction m of Y is the
mean squared prediction error (MSPE), given X, E[(Y –
m)2|X]
• Of all possible predictions m that depend on X, the conditional
mean E(Y|X) has the smallest MSPE (optional proof is in
Appendix 2.2).
(d) Distribution of a sample of data drawn
randomly from a population: Y1,…, Yn
We will assume simple random sampling
• Choose and individual (district, entity) at random from the
population.
• Modern research commonly uses the computer to generate random
numbers to choose entities (e.g., household units in survey) for
analysis.
Distribution of Y1,…, Yn under simple random sampling
Randomness and data
The data set is (Y1, Y2,…, Yn), where Yi = value of Y for the ith individual (such as
district, entity) sampled.
• Under simple random sampling, the value of Y1 provides no information
about Y2 => the conditional distribution of Y2 given Y1, is the same as the
marginal distribution of Y2.
– Hence, Y1 and Y2 are independently distributed
• Because Y1 and Y2 are randomly drawn from the same population (i.e., come
from the same distribution) this implies that Y1, Y2 are identically distributed
– That is, under simple random sampling, Y1 and Y2 are independently and
identically distributed (i.i.d.).
Hence, we said the data set {Yi}, i = 1,…, n, are i.i.d.
This framework allows rigorous statistical inferences about
moments of population distributions using a sample of data
from that population…
2. Estimation
3. Testing
4. Confidence Intervals
Estimation
Ῡ is the natural estimator of the mean. But:
a) What are the properties of Ῡ ?
b) Why should we use Ῡ rather than some other estimator?
 Y1 (the first observation)
 maybe unequal weights – not simple average
 median(Y1,…, Yn)
The starting point is the sampling distribution of Ῡ …
(a) The sampling distribution of Ῡ (1 of 3)
Ῡ is a random variable, and its properties are determined by the

sampling distribution of Ῡ
– The individuals in the sample are drawn at random(e.g., households).
– Thus the values of (Y1, …, Yn) are random (e.g., Y - household income)
– Thus functions of (Y1, …, Yn), such as Ῡ , are random: had a different
sample been drawn, they would have taken on a different value
– The distribution of Ῡ over different possible samples of size n is called
the sampling distribution of Ῡ .
– The mean and variance of Ῡ are the mean and variance of its sampling
distribution, denoted E(Ῡ ) and var(Ῡ ).
– The concept of the sampling distribution underpins all of econometrics.
Example: Suppose Y takes on 0 or 1 (a Bernoulli random variable) with

the probability distribution,
Pr[Y = 0] = .22, Pr(Y = 1) = .78
Then
E(Y ) =𝑌 = p × 1 + (1 – p) × 0 = p = .78
 Y2  E[Y  E (Y )]2  p (1  p )
= .78 × (1 – .78) = 0.1716
The sampling distribution of Ῡ depends on n.

Consider n = 2. The sampling distribution of Ῡ is,
– Pr(Ῡ = 0) = .222 = .0484
– Pr(Ῡ = 1) = .782 = .6084
The sampling distribution of Ῡ when Y is Bernoulli ( p = .78):

Things we want to know about the sampling
distribution:
• What is the mean of sample mean Ῡ ?
– If E(Ῡ ) = true population mean μY, then Ῡ is an unbiased estimator of μ
• What is the variance of sample mean Ῡ ?

– How does var(Ῡ ) depend on n?
– Does Ῡ become close to μY when n is large?
– Law of large numbers: Ῡ is a consistent estimator of μ
– Ῡ is approximately normally distributed for n large (Central Limit
Theorem)
Things we want to know about the sampling
distribution:
Desirable properties of an estimator such as the sample mean Ῡ as
an estimator of the population mean μY
Consistency Unbiasedness
The mean and variance of the sampling
distribution of Ῡ (3 of 3)
E (Y )  Y
 Y2
var(Y ) 
n
Implications:
1. Ῡ is an unbiased estimator of μY (that is, E(Ῡ ) = μY)
2. var(Ῡ ) is inversely proportional to n
1. the spread of the sampling distribution is proportional to 1/ n
2. Thus the sampling uncertainty associated with Y is proportional
to 1/ n (larger samples, less uncertainty, but square-root law)
The sampling distribution of Ῡ when n is
large
For small sample sizes, the distribution of Ῡ is complicated, but
if n is large, the sampling distribution is simple!
1. As n increases, the distribution of Ῡ becomes more tightly centered
around μY (the Law of Large Numbers)
2. Moreover, the distribution of Ῡ approximates μY and becomes
normally distributed (the Central Limit Theorem)
The Law of Large Numbers:
An estimator is consistent if the probability that its falls within an
interval of the true population value tends to one as the sample
size increases.
If (Y1 , , Yn ) are i.i.d. and  Y2  , then Y is a consistent estimator
of Y , that is,
Pr[| Y  Y |   ]  1 as n  
p
which can be written, Y  Y
p
(“Y  Y ” means “Y converges in probability to Y ”).
 Y2
(the math : as n  , var(Y )   0, which implies that
n
Pr[|Y  Y |   ]  1.)
The Central Limit Theorem (CLT) (1 of 3)
If (Y1 , , Yn ) are i.i.d. and 0   Y2  , then when n is large, the
distribution of Y is well approximated by a normal distribution.
 Y2
 Y is approximately distributed N ( Y , ) (“normal distribution with
n
mean Y and variance  /n”)
2
Y
 n (Y  Y )/ Y is Y approximately distributed N (0,1) (standard normal)
Y  E (Y ) Y  Y
 That is, “standardized” Y ZY   is approximately
var (Y ) Y / n
distributed as N (0, 1)
– The larger is n, the better is the approximation.
The Central Limit Theorem (CLT) (3 of 3)
Y  E (Y )
Sampling distribution of ZY = :
var(Y )
Summary: The Sampling Distribution of Ῡ
For Y1 , , Yn i.i.d. with 0   Y2  ,
• The exact (finite sample) sampling distribution of Y has mean Y
(“Y is an unbiased estimator of Y ”) and variance  Y2 /n
• Other than its mean and variance, the exact distribution of Ῡ is
complicated and depends on the distribution of Y (the population
distribution)
• When n is large, the sampling distribution simplifies:
p
 Y  Y (Law of large numbers)
Y  E (Y )
 is approximately N (0,1) (CLT)
var(Y )
2. Estimation
3. Hypothesis Testing
4. Confidence intervals
Hypothesis Testing
The hypothesis testing problem (for the mean): make a
provisional decision based on the evidence at hand whether a null
hypothesis is true, or instead that some alternative hypothesis is
true. That is, test
– H0: E(Y ) = μY,0 vs. H1: E(Y ) > μY,0 (1-sided, >)
– H0: E(Y ) = μY,0 vs. H1: E(Y ) < μY,0 (1-sided, <)
– H0: E(Y ) = μY,0 vs. H1: E(Y ) ≠ μY,0 (2-sided)
Some terminology for testing statistical
hypotheses (1 of 2)
p-value = probability of drawing a statistic (e.g.Ῡ ) at least as
adverse to the null as the value actually computed with your data,
assuming that the null hypothesis is true.
The significance level of a test is a pre-specified probability of
incorrectly rejecting the null, when the null is true.
Note
A p-value is a measure of the probability that an observed difference (or simply
an observed value that differs from the hypothesized value) could have occurred
just by random chance.
In corollary, significance in statistics implies some observed difference in values

is unlikely to occurred by random chance. It does NOT implies importance in the
normal usage in everyday life.
Calculating the p-value with σY known:
• For large n, p-value = the probability that a N(0,1) random

variable falls outside |(Ῡ act – μY,0)/σῩ |
• In practice, σῩ is unknown – it must be estimated
Some terminology for testing statistical
hypotheses (2 of 2)
• To compute the p-value, you need the to know the sampling
distribution of Ῡ , which is complicated if n is small.
• If n is large, you can use the normal approximation (CLT):
Y  Y ,0 Y act  Y ,0
p -value  PrH 0 [| || |]
Y Y
 Y act  Y ,0 
 2 -| |
 Y 
where  is the standard normal cumulative distribution function and Ῡ act is
the value of Ῡ actually observed (nonrandom)
The p-value is the area in the tails of a standard normal distribution outside the
plus/minus region of the distribution curve.
If the population std. deviation σY is unknown, then it must be
estimated using the sample variance of Y
Estimator of the variance of Y:
n
1
sY2  
n  1 i 1
(Yi  Y ) 2
 “sample variance of Y ”
Fact:
If (Y1,…,Yn) are i.i.d. and the fourth moment E(Y4) < ∞ (kurtosis),
then p
2
sY  Y
2
• For proof, see Appendix 3.3
• Technical note: we assume E(Y4) < ∞ because here the average is

not of Yi, but of its squared deviation; see App. 3.3
Computing the p-value with  estimated by s : 2
Y
2
Y
p -value  PrH 0 [|Y  Y ,0 |  |Y act  Y ,0 |],

Y  Y ,0 Y act  Y ,0
 PrH 0 [| || |] using  Y
Y / n Y / n
Y  Y ,0 Y act  Y ,0
 PrH 0 [| || |] using sY for large n
sY / n sY / n
so
p-value  PrH 0 [|t|  |t act |] ( 2 estimated by s 2 )
Y Y
 probability under the tails of a normal distribution outside |t act |

Y  Y ,0
where t  (the usual t -statistic)
sY / n
What is the link between the p-value and the
significance level?
• The significance level is pre-specified. For example, if the pre-
specified significance level is 5%,
– you reject the null hypothesis if |t| ≥ 1.96.

– Equivalently, you reject if p ≤ 0.05.
– The p-value is sometimes called the marginal significance level.
– Often, it is better to communicate the p-value than simply whether a test
rejects or not – the p-value contains more information than the “yes/no”
statement about whether the test rejects.
t-table and the degrees of freedom?
Digression: the Student t distribution
If Yi , i  1, , n is i.i.d. N ( Y ,  Y2 ), then the t -statistic has the

Student t -distribution with n  1 degrees of freedom.
The critical values of the Student t-distribution is tabulated in the
back of all statistics books. Remember the recipe?
1. Compute the t-statistic
2. Compute the degrees of freedom, which is n – 1
3. Look up the 5% critical value
4. If the t-statistic exceeds (in absolute value) this critical value, reject the
null hypothesis.
Comments on this recipe and the Student
t-distribution (1 of 5)
1. The theory of the t-distribution was one of the early triumphs
of mathematical statistics. It is astounding, really: if Y is i.i.d.
normal, then you can know the exact, finite-sample distribution
of the t-statistic – it is the Student t.
So, you can construct confidence intervals (using the Student t
critical value) that have exactly the "right coverage rate" – right
in the sense that it provides a certain range that is likely to
include the true values - no matter what the sample size.
This result was really useful in times when “computer” was a job
title, data collection was expensive, and the number of
observations was perhaps a dozen (small sample size).
2. If the sample size is moderate (several dozen) or large
(hundreds or more), the difference between the t-distribution
and N(0,1) critical values is negligible.
Here are some 5% critical values for 2-sided tests:
degrees of freedom 5% t-distribution critical
(n – 1) value
10 2.23
20 2.09
30 2.04
60 2.00
∞ 1.96
3. So, the Student-t distribution is only relevant when the sample size is very
small and the population distribution of Y is normal. In economic data,
normal distributions are the exception.
However, even if the data are not normally distributed, the normal
approximation of the t-statistic is valid if the sample size is large.
Hence, inferences – hypothesis testing and confidence intervals –
about the mean of a distribution should be based on the large-
sample normal approximation.
4. You might not know this. Consider the t-statistic testing the
hypothesis that two means (groups s, l) are equal:
Ys  Yl Ys  Yl
t 
ss2
 sl2 SE (Ys  Yl )
ns nl
Even if the population distribution of Y in the two groups is normal,

this statistic doesn’t have a Student t distribution!
There is a statistic testing this hypothesis that has a normal
distribution, the “pooled variance” t-statistic – see SW (Section
3.6) – however the pooled variance t-statistic is only valid if the
variances of the normal distributions are the same in the two
groups.
The Student-t distribution – Summary
• The assumption that Y is distributed N ( Y ,  Y2 ) is rarely plausible
in practice (Income? Number of children?)
• For n > 30, the t-distribution and N(0,1) are very close (as n grows large,
the tn–1 distribution converges to N(0,1)).
• The t-distribution is particularly useful from days when sample sizes were
small and “computers” were people.
• For historical reasons, statistical software typically uses the t-distribution
to compute p-values – but this is irrelevant when the sample size is
moderate or large.
• For these reasons, in this class we will focus on the large-n approximation
given by the CLT.
2. Estimation
3. Testing
4. Confidence intervals
Confidence Intervals (1 of 2)
• A 95% confidence interval for μY is an interval or range that contains
the true value of μY in 95% of repeated samples.
• Digression: What is random here? The values of Y1,...,Yn and thus any
functions of them – including the confidence interval. The confidence
interval will differ from one sample to the next. The population
parameter, μY, is not random (we just don’t know it and it needs to be
estimated)
Confidence Intervals (2 of 2)
A 95% confidence interval can always be constructed as the
set of values of μY not rejected by a hypothesis test with a 5%
significance level.
Y  Y Y  Y
{Y :  1.96}  {Y : 1.96   1.96}
sY / n sY / n
sY s
 {Y : 1.96  Y  Y  1.96 Y }
n n
sY sY
 {Y  (Y  1.96 , Y  1.96 )}
n n
This confidence interval relies on the large-n results that Y is

p
approximately normally distributed and s   Y2 .2
Y
Summary:
From the two assumptions of:
1. simple random sampling of a population, that is, {Yi, i = 1,…,n} are i.i.d.
2. 0 < E(Y4) < ∞ (finite kurtosis => no extreme outliers)
we developed, for large samples (large n):

– Theory of estimation (sampling distribution of Ῡ )
– Theory of hypothesis testing (large-n distribution of t-statistic and
computation of the p-value)
– Theory of confidence intervals
Are assumptions (1) & (2) plausible in practice? Yes

APPENDIX
The mean and variance of the sampling distribution
of Ῡ (1 of 3)
• General case – that is, for Yi i.i.d. from any distribution, not just
Bernoulli:
1 n 1 n 1 n
• mean: E (Y )  E (  Yi )   E (Yi )   Y  Y
n i 1 n i 1 n i 1
• Variance: var(Y )  E[Y  E (Y )]2

 E[Y  Y ]2
2
 1 n
 
 E   Yi   Y 
 n i 1  
2
1 n

 E   (Yi  Y ) 
 n i 1 
The mean and variance of the sampling
distribution of Ῡ (2 of 3)
2
1 n

so var(Y )  E   (Yi  Y ) 
 n i 1 
  1 n  1 n  
 E    (Yi  Y )     (Y j  Y )  
  n i 1   n j 1  
1 n n
 2  E (Yi  Y )(Y j  Y ) 
n i 1 j 1
1 n n
 2  cov(Yi , Y j )
n i 1 j 1
1 n 2
 2  Y
n i 1
 Y2

n

Lecture set 1

Uploaded by

Lecture set 1

Uploaded by

ECONS303

Applied Quantitative Research Methods

Lecturer: Dr. Yong Soo Keong (Email: syong@waikato.ac.nz)

The probability framework for statistical inference

mean = expected value (expectation) of Y

cov( X , X )  E[( X   X )( X   X )]  E[( X   X ) 2 ]   X2

Conditional expectations and conditional moments

• conditional variance = variance of conditional distribution

Δ = E(Test score|STR < 20) – E(Test score|STR ≥ 20)

• If E(X | Z) = const, then corr(X,Z) = 0 (not necessarily vice versa however)

The conditional mean is a (possibly new) term for the familiar

Ῡ is a random variable, and its properties are determined by the

Example: Suppose Y takes on 0 or 1 (a Bernoulli random variable) with

The sampling distribution of Ῡ depends on n.

The sampling distribution of Ῡ when Y is Bernoulli ( p = .78):

• What is the variance of sample mean Ῡ ?

 n (Y  Y )/ Y is Y approximately distributed N (0,1) (standard normal)

In corollary, significance in statistics implies some observed difference in values

• For large n, p-value = the probability that a N(0,1) random

• For proof, see Appendix 3.3

• Technical note: we assume E(Y4) < ∞ because here the average is

p -value  PrH 0 [|Y  Y ,0 |  |Y act  Y ,0 |],

 probability under the tails of a normal distribution outside |t act |

– you reject the null hypothesis if |t| ≥ 1.96.

Digression: the Student t distribution

If Yi , i  1, , n is i.i.d. N ( Y ,  Y2 ), then the t -statistic has the

Even if the population distribution of Y in the two groups is normal,

This confidence interval relies on the large-n results that Y is

we developed, for large samples (large n):

Are assumptions (1) & (2) plausible in practice? Yes

• Variance: var(Y )  E[Y  E (Y )]2

You might also like