Unit 8. Data Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

Data analysis

Feyera Senbeta (PhD)


The Meaning of Statistics
Several Meanings
 Collections of numerical data

 Summary measures calculated from a


collection of data

 Activity of using and interpreting a collection of


numerical data
A Meaningful Statistic (Significant)?
 Statistics, descriptive or inferential are NOT a
substitute for good judgment
 Decide what level or value of a statistic is
meaningful
 State judgment before gathering and analyzing
data
 Examples:
 Score on performance test of 80% is passing
 Pre/post rules instruction reduces incidents by
50%
Interpretation of Meaning
 Population Measure (statistic)
 There is no sampling error
 The number you have is “real”
 Judge against pre-set standard

 Inferential Measure (statistic)


 Tellsyou how sure (confident) you can be
the number you have is real
 Judge against pre-set standard and state
how certain the measure is
Statistics

Descriptive Statistics
 Gives numerical and
graphic procedures to Inferential Statistics
summarize a collection  Provides procedures
of data in a clear and to draw inferences
understandable way about a population
from a sample
Descriptive and Inferential
Statistics
 Descriptive statistics: Mathematical methods (such as
mean, median, standard deviation) that summarize and
interpret some of the properties of a set of data (sample)
but do not infer the properties of the population from
which the sample was drawn.

 Mathematical methods (such as hypothesis


development) that employ probability theory for deducing
(inferring) the properties of a population from the
analysis of the properties of a set of data (sample) drawn
from it.

6
Did it happen by chance?
 How do you know if something caused or
correlates with something else?
 The appropriate Statistic will tell you:
 If there is a difference from some expected value

 If the difference is statistically significant or merely


due to random chance

1/24/2013 7
Descriptive Statistics

Summarize or describe the


important characteristics of a
known set of population data

1/24/2013 8
Descriptive Statistics

Design Descriptive Statistics


Survey Studies Percentages, measures of
central tendency and variation

Causal comparative studies Measures of central tendency &


variation, percentages, standard
scores
Experimental Measures of central tendency &
variation, percentages, standard
scores, effect sizes
Types of descriptive statistics
 Statistic is a quantitative index that describes
performance of a sample or samples

 Parameter is a quantitative index describing the


performance of a population

 Measures of central tendency are used to determine the


typical or average value among a group of values

 Measures of variability indicate how spread out the


values are

1/24/2013 10
Descriptive Statistics (Vocabulary)
 Central tendency
 Mode
 Median
 Mean
 Variation
 Range
 Standard deviation
 Normal distribution
 Standard score
 Correlation
 Regression
Descriptive Measures
 Central Tendency measures. They are
computed to give a “center” around which the
measurements in the data are distributed.

 Variation or Variability measures. They


describe “data spread” or how far away the
measurements are from the center.

 Relative Standing measures. They describe


the relative position of specific measurements in the
data.
Measures of Central Tendency

 Mean:
Sum of all measurements divided by the number
of measurements.

 Median:
A number such that at most half of the
measurements are below it and at most half of the
measurements are above it.

 Mode:
The most frequent measurement in the data.
Example of Mean

Measurements Deviation
x x - mean
 MEAN = 40/10 = 4
3 -1
5 1
5 1  Notice that the sum of the
1 -3 “deviations” is 0.
7 3
2 -2
6 2  Notice that every single
7 3 observation intervenes in
0 -4
4 0
the computation of the
40 0
mean.
Example of Median
Measurements Measurements
Ranked
 Median: (4+5)/2 =
x x 4.5
3 0
5 1
5 2
 Notice that only the two
1 3
central values are used
7 4 in the computation.
2 5
6 5
7 6
 The median is not
0 7 sensible to extreme
4 7 values
40 40
Example of Mode
Measurements

x
3
5  In this case the data have
5 tow modes:
1
7  5 and 7
2  Both measurements are
6
7 repeated twice
0
4
Example of Mode
Measurements
x
3
5
 Mode: 3
1
1
4
7  Notice that it is possible for a
3 data not to have any mode.
8
3
Graphing data
 Provides a quick view of the what your data is telling
you.

 There are various types of graphs which are used in


statistics including bar graphs, histograms, scatter
plots, pie charts, frequency polygons etc.

1/24/2013 18
Example group of test scores

1/24/2013 19
Frequency Polygon and Pie Chart

1/24/2013 20
Common mistakes
 Use of one dataset as graph & table
 Use of one dataset as frequency and %
histogram
 Which graph to use histogram, pie chart,
linear
 The importance of making graph
Sample Bar Graph

1/24/2013 22
Sample Histogram

1/24/2013 23
Sample Scatter Plot

1/24/2013 24
Frequency Distributions
Frequency distributions are like frequency
polygons; however, instead of straight lines,
a frequency distribution uses a smooth
curve to connect the points and, similar to a
graph, is plotted on two axes.

1/24/2013 25
J Shaped Curve

1/24/2013 26
Bimodal Curve with Two Peaks

1/24/2013 27
Positively Skewed Bell Curve

1/24/2013 28
Negatively Skewed Bell Curve

1/24/2013 29
Symmetric Bell Curve/Normal
Distribution

1/24/2013 30
What is the Normal Distribution ?
•Where did it come from and why is it so special?

• It is just about anything you measure turns out


to be normally distributed, at least approximately
so.

•That is, usually most of the observations cluster


around the mean, with progressively fewer
observations out towards the extremes

1/24/2013 31
Sample Histogram

1/24/2013 32
Just about any histogram can be
converted into a line graph

1/24/2013 33
Which can be used to plot a
normal distribution

1/24/2013 34
But how do we get from the
normal to the standard normal?

1/24/2013 35
Measures of variability
 Range – Difference between the highest and
lowest values (high value -low value = range)
 Variance S2
 Standard Deviation S
 variation of values about the mean

1/24/2013 36
Measures of variation – range
 Range= highest value-lowest value

Bank waiting time values:

Values of 4, 7, 7 the range is 7-4 or 3

With values of 1, 3, 14, the range is 14-1 or 13

1/24/2013 37
Other key measures of variation
 S2= Variance

 S Standard Deviation

1/24/2013 38
Measures of variation –
standard deviation

x
6
6
6

1/24/2013 39
The Z statistic will allow you to
standardize a normal
distribution

1/24/2013 40
Inferential Statistics
 To generalize or predict how a large
group will behave based upon
information taken from a part of the
group is called INFERENCE
 Techniques which tell us how much
confidence we can have when we
GENERALIZE from a sample to a
population
Inferential Statistics (Vocabulary)
 Hypothesis
 Null hypothesis
 Alternative hypothesis

 ANOVA
 Level of significance
 Type I error
 Type II error
Collecting a random sample
 Goal: to understand characteristics about a population

 Examples:
 What’s the average household income of the 09 Kebele
resident?

 What proportion of people living in Dire Dawa Town have had


malaria?
Estimating the mean
 One of the most common goals of statistical
inference is estimating a population mean
with a sample mean
Central Limit Theorem
 When we have n independent, identically distributed
(X1..Xn) random variables, the mean of those random
variables approaches a normal distribution with mean =
µ and variance = 2 , as n gets large.
n

 Independence of random variables means that the value


of one observation has no effect on the value of another
observation.

 Identical distribution of random variables means that


each random variable comes from the same population
(e.g., roll of a die, coin flip).
Simple random sampling
 Each observation drawn does not depend on others
drawn
 Thus observations are independent

 Each observation (i.e., each random variable) is


identically distributed
 The population has a distribution that doesn’t change (each
observation is randomly drawn from an identical distribution –
the distribution of the population).

 So the Central Limit Theorem applies!


(when n is large)
What does this mean?
Suppose we take a sample of n=50
observations from a population that frequency
has this distribution:

0 10 20 30
Mean (µ) = 20
2
Variance ( ) = 100
Std. dev ( ) = 10

We then find the mean of this sample (suppose this mean = 19). Take
another sample of 50 observations and find the mean (suppose it’s 24).
Do this many times, and we’ll come up with a distribution of means. The
Central Limit Theorem tells us this distribution will always look like the
next slide (as long as n is “large”, and 50 is large enough):
The normal curve

16 18 20 22 24
x
2
Mean (µ) = 20 Sample size (n) = 50 variance of sample mean = =2
n
Symbols
 Population Parameter: µ

 Estimate: ẋ

 Expected: E
Basic Types of Inference
 Point Inference
 The value of a population parameter µ is estimated using a
single value ẋ

Examples: mean, standard deviation, etc.

 Interval Inference
 Attaching a probability to an estimate (i.e., making a
confidence interval)

 Example: we are 95% confident that µ is between 10 and 20


Judging the Quality of the
Estimator
 ˆ )and
Bias – the difference between E (Θ Θ
(i.e., Bias = E (Θ
ˆ )−Θ
)

 Bias may be positive or negative (e.g., a


positively biased estimator would indicate the
population parameter is higher than it actually is)

 Efficiency – how clustered the distribution of


is (i.e., how “peaked” is its distribution) Θ̂
Point Estimates (inferring population
parameters from samples)
 Population Mean: µ=x

 Population Proportions: π = P = X /n

 Population Variance: σ 2 = s2

 Population Standard Deviation: σ = s


Confidence Intervals
 The degree of confidence we have in our estimates defined
by a percentage

 Common examples: 90, 95, or 99% confident

 The confidence interval is defined with the α symbol

 In confidence intervals, alpha (α) is the proportion of time


your confidence interval is wrong

 The typical usage is: zα / 2


 Why do we divide by 2?
Confidence Interval Example
 What is the 95% confidence interval for a normally distributed
variable?

 α= 1 - desired confidence interval

 α= 1 – 0.95 = 0.05

 Remember that we divide α by 2 since we have uncertainty both


above and below the mean (i.e., 2 tails)

 Therefore we use z0.025 for the 95% confidence interval

 From the z-table we find that z0.025 = 1.96

 What does this mean?


Interval Estimation (making confidence
intervals for population parameters estimated
from samples)
 Case #1 estimating an interval for µ when X is
normally distributed and we know σ

 This is the simplest case because normality


allows us to use the z-table

 This is also unlikely since it requires knowing the


distribution and the σ (which implies knowing µ
already)
Example #1: Create a confidence
interval for µ
 A town is considering building a new bridge over a
river. The primary goal is to reduce workers’
commute times from a particular community. A
random sample of workers in that community are
asked to estimate their reduction in commute time if
the bridge were built.

Our goal is to estimate the mean reduction in


commute time for the whole community if the bridge
were built. Create a 95% confidence interval for this
mean.
Example #1 Data
 n = 100 workers are sampled
 x = 17 minutes
 σ = 30 minutes
 What is the 95% confidence interval for
the mean?
Constructing a confidence interval
 Construct a 95% confidence interval around the sample mean
σ σ
P( X − 1.96 ≤ µ ≤ X + 1.96 ) = 0.95
n n

30 30
P(17 − 1.96 ≤ µ ≤ 17 + 1.96 ) = 0.95
100 100

P(17 − 1.96 * 3 ≤ µ ≤ 17 + 1.96 * 3) = 0.95

P(17 − 5.88 ≤ µ ≤ 17 + 5.88) = 0.95

 So we can say that the 95% C.I. is 17 +/- 5.88 or 11.12, 22.88
Example #1 Questions
 What would happen to our interval if we
used a 99% confidence interval instead?

 What would happen to our confidence


interval if we sampled 200 people instead
of 100 people?
Interval Estimation (making confidence
intervals for population parameters estimated
from samples)
 Case #2 estimating an interval for µ when X is
not normally distributed and we know σ

 In this case the n matters a lot, why?

 This is also unlikely since it requires knowing the


distribution and the σ (which implies knowing µ
already)
Interval Estimation (making confidence
intervals for population parameters estimated
from samples)
 Case #3 estimating an interval for µ when σ and
the distribution are unknown

 What should we used instead of σ?

 Can we use the z-table in this case?

 This case is what we see most commonly


t-distribution vs. z-
distribution
 When we only have s (and not σ) we use the t-
distribution rather than the z-distribution

 To do so we use the t-table

 How are they different?


 The t-distribution changes depending on the degrees of
freedom (n-1)
 This is reflected in the table and in the symbol tα / 2,n −1
 The t-distribution accounts for more uncertainty (i.e., wider
confidence intervals) since s is just an estimate for σ
t-distribution vs. z-distribution
 As n approaches infinity t and z become equal

 This means that even when we have s instead of σ we can use the z-
distribution if n is large
 Central Limit Theorem: “…as n gets large.”
 What is “large”?
 Rule of thumb: 30

 For n less than 30, the distribution of x does not follow the normal
distribution accurately enough.

 But the distribution of x does closely follow a t-distribution for sample


sizes of less than 30.

 For this class use the t-distribution any time you have s instead of σ
Example #2

 n = 16
 x = 30
 s2 = 1600
 What is the 95% C.I. for the mean?
Example #2
 s = 40
 Degrees of freedom = n – 1 = 15
 tα / 2,n −1 = t0.05 / 2,16 −1 = t0.025,15 = 2.131(from the t-table)
s s
P ( X − 2.131 ≤ µ ≤ X + 2.131 ) = 0.95
n n

40 40
P (30 − 2.131 ≤ µ ≤ 30 + 2.131 ) = 0.95
16 16

P (30 − 2.131 *10 ≤ µ ≤ 30 + 2.131 *10) = 0.95

P (30 − 21.31 ≤ µ ≤ 30 + 21.31) = 0.95

 The 95% confidence interval for the mean is (8.69, 51.31)


Interval Estimation (making confidence intervals
for population parameters estimated from
samples)
 Case #4 estimating an interval for a proportion π
based on a sample proportion p

 Remember that p = x/n


 In other word, p = the number of “successes” divided by
the number of samples
 For example: the proportion of people over 6ft tall

 In this case we don’t need s or σ, but we do need


the standard deviation of p: π (1 − π )
σp =
n

Which we estimate as: p (1 − p )


 sp =
n
Interval Estimation (making confidence intervals
for population parameters estimated from
samples)
 Case #4 continued
p (1 − p) p(1 − p )
 Equation: p − zα / 2 ≤ π ≤ p + zα / 2
n n
 We use the z-distribution for estimating an interval for a
proportion π based on a sample proportion p

 This also limits us to using only large samples (in this case n >
100)

 For smaller samples, we calculate the entire distribution using


the binomial mass function: P ( x ) = C xnπ x (1 − π )(i.e.,
n− x
solve for
all x values)
Example #3
 n = 150 people at a convention
 63 people sampled were over 6 feet tall
 What is the 99% C.I. for the true
proportion of all people ≥6 ft tall at the
convention?
Example #3
 p = 63/150 = 0.42
 99% C.I. -> z α /2 = z0.005 = 2.58 (from the z-table)

p(1 − p) p(1 − p)
p − zα / 2 ≤ π ≤ p + zα / 2
n n

0.42 * 0.58 0.42 * 0.58


0.42 − 2.58 ≤ π ≤ 0.42 + 2.58
150 150

0.42 − 2.58 * 0.04 ≤ π ≤ 0.42 + 2.58 * 0.04

0.42 − 0.104 ≤ π ≤ 0.42 + 0.104


 The 99% confidence interval for p = 0.42 is (0.316, 0.524)

You might also like