Unit 8. Data Analysis
Unit 8. Data Analysis
Unit 8. Data Analysis
Descriptive Statistics
Gives numerical and
graphic procedures to Inferential Statistics
summarize a collection Provides procedures
of data in a clear and to draw inferences
understandable way about a population
from a sample
Descriptive and Inferential
Statistics
Descriptive statistics: Mathematical methods (such as
mean, median, standard deviation) that summarize and
interpret some of the properties of a set of data (sample)
but do not infer the properties of the population from
which the sample was drawn.
6
Did it happen by chance?
How do you know if something caused or
correlates with something else?
The appropriate Statistic will tell you:
If there is a difference from some expected value
1/24/2013 7
Descriptive Statistics
1/24/2013 8
Descriptive Statistics
1/24/2013 10
Descriptive Statistics (Vocabulary)
Central tendency
Mode
Median
Mean
Variation
Range
Standard deviation
Normal distribution
Standard score
Correlation
Regression
Descriptive Measures
Central Tendency measures. They are
computed to give a “center” around which the
measurements in the data are distributed.
Mean:
Sum of all measurements divided by the number
of measurements.
Median:
A number such that at most half of the
measurements are below it and at most half of the
measurements are above it.
Mode:
The most frequent measurement in the data.
Example of Mean
Measurements Deviation
x x - mean
MEAN = 40/10 = 4
3 -1
5 1
5 1 Notice that the sum of the
1 -3 “deviations” is 0.
7 3
2 -2
6 2 Notice that every single
7 3 observation intervenes in
0 -4
4 0
the computation of the
40 0
mean.
Example of Median
Measurements Measurements
Ranked
Median: (4+5)/2 =
x x 4.5
3 0
5 1
5 2
Notice that only the two
1 3
central values are used
7 4 in the computation.
2 5
6 5
7 6
The median is not
0 7 sensible to extreme
4 7 values
40 40
Example of Mode
Measurements
x
3
5 In this case the data have
5 tow modes:
1
7 5 and 7
2 Both measurements are
6
7 repeated twice
0
4
Example of Mode
Measurements
x
3
5
Mode: 3
1
1
4
7 Notice that it is possible for a
3 data not to have any mode.
8
3
Graphing data
Provides a quick view of the what your data is telling
you.
1/24/2013 18
Example group of test scores
1/24/2013 19
Frequency Polygon and Pie Chart
1/24/2013 20
Common mistakes
Use of one dataset as graph & table
Use of one dataset as frequency and %
histogram
Which graph to use histogram, pie chart,
linear
The importance of making graph
Sample Bar Graph
1/24/2013 22
Sample Histogram
1/24/2013 23
Sample Scatter Plot
1/24/2013 24
Frequency Distributions
Frequency distributions are like frequency
polygons; however, instead of straight lines,
a frequency distribution uses a smooth
curve to connect the points and, similar to a
graph, is plotted on two axes.
1/24/2013 25
J Shaped Curve
1/24/2013 26
Bimodal Curve with Two Peaks
1/24/2013 27
Positively Skewed Bell Curve
1/24/2013 28
Negatively Skewed Bell Curve
1/24/2013 29
Symmetric Bell Curve/Normal
Distribution
1/24/2013 30
What is the Normal Distribution ?
•Where did it come from and why is it so special?
1/24/2013 31
Sample Histogram
1/24/2013 32
Just about any histogram can be
converted into a line graph
1/24/2013 33
Which can be used to plot a
normal distribution
1/24/2013 34
But how do we get from the
normal to the standard normal?
1/24/2013 35
Measures of variability
Range – Difference between the highest and
lowest values (high value -low value = range)
Variance S2
Standard Deviation S
variation of values about the mean
1/24/2013 36
Measures of variation – range
Range= highest value-lowest value
1/24/2013 37
Other key measures of variation
S2= Variance
S Standard Deviation
1/24/2013 38
Measures of variation –
standard deviation
x
6
6
6
1/24/2013 39
The Z statistic will allow you to
standardize a normal
distribution
1/24/2013 40
Inferential Statistics
To generalize or predict how a large
group will behave based upon
information taken from a part of the
group is called INFERENCE
Techniques which tell us how much
confidence we can have when we
GENERALIZE from a sample to a
population
Inferential Statistics (Vocabulary)
Hypothesis
Null hypothesis
Alternative hypothesis
ANOVA
Level of significance
Type I error
Type II error
Collecting a random sample
Goal: to understand characteristics about a population
Examples:
What’s the average household income of the 09 Kebele
resident?
0 10 20 30
Mean (µ) = 20
2
Variance ( ) = 100
Std. dev ( ) = 10
We then find the mean of this sample (suppose this mean = 19). Take
another sample of 50 observations and find the mean (suppose it’s 24).
Do this many times, and we’ll come up with a distribution of means. The
Central Limit Theorem tells us this distribution will always look like the
next slide (as long as n is “large”, and 50 is large enough):
The normal curve
16 18 20 22 24
x
2
Mean (µ) = 20 Sample size (n) = 50 variance of sample mean = =2
n
Symbols
Population Parameter: µ
Estimate: ẋ
Expected: E
Basic Types of Inference
Point Inference
The value of a population parameter µ is estimated using a
single value ẋ
Interval Inference
Attaching a probability to an estimate (i.e., making a
confidence interval)
Population Proportions: π = P = X /n
Population Variance: σ 2 = s2
α= 1 – 0.95 = 0.05
30 30
P(17 − 1.96 ≤ µ ≤ 17 + 1.96 ) = 0.95
100 100
So we can say that the 95% C.I. is 17 +/- 5.88 or 11.12, 22.88
Example #1 Questions
What would happen to our interval if we
used a 99% confidence interval instead?
This means that even when we have s instead of σ we can use the z-
distribution if n is large
Central Limit Theorem: “…as n gets large.”
What is “large”?
Rule of thumb: 30
For n less than 30, the distribution of x does not follow the normal
distribution accurately enough.
For this class use the t-distribution any time you have s instead of σ
Example #2
n = 16
x = 30
s2 = 1600
What is the 95% C.I. for the mean?
Example #2
s = 40
Degrees of freedom = n – 1 = 15
tα / 2,n −1 = t0.05 / 2,16 −1 = t0.025,15 = 2.131(from the t-table)
s s
P ( X − 2.131 ≤ µ ≤ X + 2.131 ) = 0.95
n n
40 40
P (30 − 2.131 ≤ µ ≤ 30 + 2.131 ) = 0.95
16 16
This also limits us to using only large samples (in this case n >
100)
p(1 − p) p(1 − p)
p − zα / 2 ≤ π ≤ p + zα / 2
n n