4 Numerical Methods For Describing Data

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 50

Chapter 4

Numerical Methods for


Describing Data
Population characteristic -
• Fixed value about a population
Typical
• Is unknown
this a value that is known?
Can we find it out?
Statistic -
• Value calculated from a sample
Measures of Central Tendency
• Mode – the observation that occurs the
most often
– Can be more than one mode

– If all values occur only once – there is


no mode

– Not used as often as mean & median


Measures of Central Tendency
Median - the middle value of the data; it
divides the observations in half

To find: list the observations in numerical


order
single middle value is n is odd
sample median  
average of the two middle values if n is even

Where n = sample size


Suppose we catch a sample of 5 fish from the
lake. The lengths of the fish (in inches) are
listed below. Find the median length of fish.

3 4 5 8 10
Suppose we caught a sample of 6 fish from the
lake. The median length is …

5.5
3 4 5 6 8 10
Measures of Central Tendency
Mean is the arithmetic average.

– Use  to represent a population mean


– Use x to represent a sample mean

Formula:

x  x
n
Suppose we caught a sample of 6 fish from
the lake. Find the mean length of the fish.

3  4  5  6  8  10

x 6
6

3 4 5 6 8 10
Now find how each observation deviates
from the mean.
x (x - x)
3 -3
3-6
4 -2
5 -1
6 0
8 2
10 4
Sum 0
Imagine a ruler with pennies placed at
3”, 4”, 5”, 6”, 8” and 10”.

To balance the
ruler on your
finger, you would
need to place your
finger at the mean
of 6.
The mean is the
balance point of a
distribution
What happens to the median & mean if
the length of 10 inches was 15 inches?

The median is . . . 5.5


The mean is . . . 6.833
3  4  5  6  8  15
6

3 4 5 6 8 15
What happens to the median & mean if
the 15 inches was 20?

The median is . . . 5.5


The mean is . . . 7.667
2  4  5  6  8  20
6

3 4 5 6 8 20
Some statistics that are not affected by
extreme values . . .

Is the median resistant affected by


extreme values?
NO

Is the mean affected by extreme values?


YES
Suppose we caught a sample of 20 fish with
the following lengths. Create a histogram
for the lengths of fish. (Use a class width of 1.)

Mean = 6.5
Median = 6.5

3 5 6 10 6 7 7 8 4 5
6 4 7 5 9 9 8 7 6 8
Suppose we caught a sample of 20 fish with
the following lengths. Create a histogram
for the lengths of fish. (Use a class width 1.)

Mean = 6.8
Median = 5.5

3 5 6 10 15 7 3 3 4 5
6 4 12 5 3 4 8 13 11 9
Suppose we caught a sample of 20 fish with
the following lengths. Create a histogram
for the lengths of fish. (Use a class width of 1.)

Mean = 7.75
Median = 8.5

3 5 6 10 10 7 10 8 9 5
6 4 9 10 9 9 10 7 10 8
Recap:
• In a symmetrical distribution, the mean
and median are equal.
• In a skewed distribution, the mean is
pulled in the direction of the skewness.

• In a symmetrical distribution, you should


report the mean!
• In a skewed distribution, the median
should be reported as the measure of
center!
Trimmed mean:
Purpose is to remove outliers from a data
set

To calculate a trimmed mean:


• Multiply the percent to trim by n
• Truncate that many observations from
BOTH ends of the distribution (when
listed in order)
• Calculate the mean with the shortened
data set
Find the mean of the following set of data.

12 14 19 20 22 24 25 26 26 50

Mean = 23.8

14  19  20  22  24  25  26  26
xT   22
8
What values are used to describe
categorical data?
Suppose that each person in a sample of 15 cell
phone users is asked if he or she is satisfied
with the cell phone service.

Here are the responses:


Y N Y Y Y N N Y Y
N Y Y Y N N
9 60% of the sample was
pˆ   0.6 satisfied with their cell
15 phone
number service.
of successes
pˆ 
n
Why is the study of variability
important?
• There is variability in virtually everything

• Allows us to distinguish between usual &


unusual values

• Reporting only a measure of center doesn’t


provide a complete picture of the
distribution.
20 30 40 50 60 70

20 30 40 50 60 70

20 30 40 50 60 70

Notice that these three data sets all


have the same mean and median (at 45),
but they have very different amounts of
variability.
Measures of Variability
The simplest numeric measure of variability
is range.

Range =
largest observation – smallest observation
The first two data
20 30 40 50 60 70 sets have a range of
50 (70-20) but the
20 30 40 50 60 70 third data set has a
20 30 40 50 60 70
much smaller range
of 10.
Measures of Variability
Another measure of the variability in a
data set uses the deviations from the
mean (x – x).

The estimated average of the deviations


squared is called the variance.

2
2 x  x 
s 
n 1
When calculating sample variance, we use
degrees of freedom (n – 1) in the
denominator instead of n because this
tends to produce better estimates.

Degrees of freedom will be revisited


again in Chapter 8.
Remember the sample of 6 fish that we
caught from the lake . . .
Find the variance of the length of fish.

x (x - x) (x - x)2
3 -3 9
4 -2 4
5 -1 1
6 0 0
8 2 4
10 4 16
Sum 0 34 s2 = 6.5
Measures of Variability
The square root of variance is called standard
deviation.

A typical deviation from the mean is the


standard deviation.

s2 = 6.8 inches2 so s = 2.608 inches


The fish in our sample deviate from the mean of 6
by an average of 2.608 inches.
Calculation of standard
deviation of a sample

2
x  x 
s 
n 1
Measures of Variability
Interquartile range (iqr) is the range of
the middle half of the data.

Lower quartile (Q1) is the median of the


lower half of the data
Upper quartile (Q3) is the median of the
upper half of the data

iqr = Q3 – Q1
The Chronicle of Higher Education (2009-2010
issue) published the accompanying data on the
percentage of the population with a bachelor’s or
higher degree in 2007 for each of the 50 states
and the District of Columbia.

21 27 26 19 30 35 35 26 47 26
27 30 24 29 22 24 29 20 20 27
35 38 25 31 19 24 27 27 23 34
25 32 26 24 22 28 26 30 23 25
22 25 29 33 34 30 17 25 23 34
26

Find the interquartile range for this set of data.


21
17 27
19 26
19 19
20 30
20 35
21 35
22 26
22 47
22 26
23
27
23 30
23 24 29
24 22
24 24 29
25 20
25 20
25 27
25
35
25 38
26 25
26 31
26 19
26 26
24 27
26 27 23
27 34
27
25
27 32
27 26
28 24
29 22
29 28
29 26
30 30 23
30 25
30
22
31 25
32 29
33 33
34 34 30
34 17
35 25
35 23
35 34
38
26
47

iqr = 30 – 24 = 6
Another graph- Boxplots
What are some advantages of boxplots?

• ease of construction
• convenient handling of outliers
• construction is not subjective (like
histograms)
• Used with medium or large size data
sets (n > 10)
• useful for comparative displays
Boxplots
When to Use Univariate numerical data

How to construct a Skeleton Boxplot


– Calculate the five number summary
– Draw a horizontal (or vertical) scale
– Construct a rectangular box from the lower
quartile (Q1) to the upper quartile (Q3)
– Draw lines from the lower quartile to the smallest
observation and from the upper quartile to the
largest observation

To describe
– comment on the center, spread, and shape of the
distribution and if there is any unusual features
Remember the data on the percentage of the
population with a bachelor’s or higher degree in
2007 for each of the 50 states and the District of
Columbia.

17 19 19 20 20 21 22 22 22 23
23 23 24 24 24 24 25 25 25 25
25 26 26 26 26 26 26 27 27 27
27 27 28 29 29 29 30 30 30 30
31 32 33 34 34 34 35 35 35 38
47

10 20 30 40 50

Percentages
Modified boxplots
To display outliers:
• Identify mild & extreme outliers
An observation is an outliers if it is more than
1.5(iqr) away from the nearest quartile.

1  1.5 iqr
AnQoutlier  and Q3 if
is extreme  1it  iqrmore
.5is  than 3(iqr)
away from the nearest quartile.

Q1  3 iqr  and Q3  3 iqr 


• whiskers extend to largest (or smallest) data
observation that is not an outlier
Remember the data on the percentage of the
population with a bachelor’s or higher degree in
2007 for each of the 50 states and the District of
Columbia.

17 19 19 20 20 21 22 22 22 23
23 23 24 24 24 24 25 25 25 25
25 26 26 26 26 26 26 27 27 27
27 27 28 29 29 29 30 30 30 30
31 32 33 34 34 34 35 35 35 38
47
24-1.5(6) = 15
30+1.5(6) = 39
30+3(6) = 48
10 20 30 40 50

Percentages
Symmetrical boxplots Approximately symmetrical boxplot

Skewed boxplot
The 2009-2010 salaries of NBA players
published on the web site hoopshype.com were
used to construct the comparative boxplot of
salary data for five teams.
Interpreting Center & Variability
Chebyshev’s Rule –

The percentage of observations that are


within k standard deviations of the mean is at
least  1
1001  2 %
 k 
where k > 1
 1 If k = 2, then at least
1001  2 %  75% 75% of the observations
 2 
are within 2 standard
deviations of the mean.
For a sample of families with one preschool child,
it was reported that the mean child care time per
week was approximately 36 hours with a standard
deviation of approximately 12 hours.

Using Chebyshev’s rule, at least 75% of the


sample observations must be between 12 and 60
hours (within 2 standard deviations of the mean).

At most, what percent of the


observations are greater than
72 hours?
What’s my area?
Input the following command into a graphing calculator
in order to graph a normal curve with a mean of 20 and
standard deviation of 3.

Y1 = normalpdf(X,20,3) (Window x: [10,30] y: [0,0.2])

Use the command 2nd trace, 7 to find the area under


the curve for the: (Round to 3 decimal places.)

Lower limit: 17 Upper limit: 23 Area: ________


Lower limit: 14 Upper limit: 26 Area: ________
Lower limit: 11 Upper limit: 29 Area: ________
What’s my area?
Graph a normal curve with a mean of 50 and standard
deviation of 5.

Y1 = normalpdf(X,50,5) (x: [30,70] y: [0,0.1])

Find the area under the curve for the following:

Lower limit: 45 Upper limit: 55 Area: ________


Lower limit: 40 Upper limit: 60 Area: ________
Lower limit: 35 Upper limit: 65 Area: ________
Interpreting Center & Variability
Empirical Rule-

• Approximately 68% of the 99.7%


68%
95%
observations are
within 1 standard deviation of the mean

• Approximately 95% of the observations are


within 2 standard deviation of the mean

• Approximately 99.7% of the observations are


within 3 standard deviation of the mean
The height of male students at PWSH is
approximately normally distributed with a
mean of 71 inches and standard deviation of
2.5 inches.

a)What percent of the male


students are shorter than
66 inches?
About 2.5%
b) Taller than 73.5 inches?
About 16%
c) Between 66 & 73.5 inches?
About 81.5%
Measures of Relative Standing
Z-score

A z-score tells us how many standard


deviations the value is from the mean.

value - mean
z - score 
standard deviation
What do these z-scores mean?

-2.3 2.3 standard deviations below the mean

1.8 1.8 standard deviations above the mean

-4.3 4.3 standard deviations below the mean


Sally is taking two different math achievement
tests with different means and standard
deviations. The mean score on test A was 56
with a standard deviation of 3.5, while the mean
score on test B was 65 with a standard deviation
of 2.8. Sally scored a 62 on test A and a 69 on
test B. On which test did Sally score the best?

Z-score on test A Z-score on test B

62  56 69  65
z  1.714 z  1.429
3 .5 2. 8

She did better on test A.


Measures of Relative Standing
Percentiles

A percentile is a value in the data set where


r percent of the observations fall AT or
BELOW that value
In addition to weight and length, head
circumference is another measure of health in
newborn babies. The National Center for Health
Statistics reports the following summary values
for head circumference (in cm) at birth for boys.
Head circumference (cm) 32.2 33.2 34.5 35.8 37.0 38.2 38.6
Percentile 5 10 25 50 75 90 95

What percent of newborn boys had head


circumferences greater than 37.0 cm? 25%

10% of newborn babies have head


circumferences bigger than what value?
38.2 cm

You might also like