Chapter Four: Numerical Descriptive Techniques
Chapter Four: Numerical Descriptive Techniques
4.1
Introduction…
Recall Chapter 2, where we used graphical techniques to
describe data:
Measures of Variability
Range, Standard Deviation, Variance, Coefficient of Variation
4.3
Measures of Central Location…
The arithmetic mean, a.k.a. average, shortened to mean, is
the most popular & useful measure of central location.
4.4
Notation…
When referring to the number of observations in a
population, we use uppercase letter N
Population Sample
Size N n
Mean
4.6
Arithmetic Mean…
4.7
Statistics is a pattern language…
Population Sample
Size N n
Mean
4.8
The Arithmetic Mean…
…is appropriate for describing measurement data, e.g.
heights of people, marks of student papers, etc.
4.9
Measures of Central Location…
The median is calculated by placing all the observations in
order; the observation that falls in the middle is the median.
A set of data may have one mode (or modal class), or two, or
more modes.
Mode is a useful for all data types, though mainly used for
nominal data.
For large data sets the modal class is much more relevant
than a single-value mode.
A modal class
Frequency
Variable
4.12
=MODE(range) in Excel…
Note: if you are using Excel for your data analysis and your
data is multi-modal (i.e. there is more than one mode), Excel
only calculates the smallest one.
4.13
Mean, Median, Mode…
If a distribution is symmetrical,
the mean, median and mode may coincide…
median
mode
mean
4.14
Mean, Median, Mode…
If a distribution is asymmetrical, say skewed to the left or to
the right, the three measures may differ. E.g.:
median
mode
mean
4.15
Mean, Median, Mode…
If data are symmetric, the mean, median, and mode will be
approximately the same.
4.16
Mean, Median, & Modes for Ordinal & Nominal Data…
4.17
Geometric Mean…
The geometric mean is used when the variable is a growth
rate or rate of change, such as the value of an investment
over periods of time.
4.18
Finance Example…
Suppose a 2-year investment of $1,000 grows by 100% to $2,000 in
the first year, but loses 50% from $2,000 back to the original $1,000 in
the second year. What is your average return?
Solving for the geometric mean yields a rate of 0%, which is correct.
4.20
Measures of Variability…
Measures of central location fail to tell the whole story about
the distribution; that is, how much are the observations
spread out around the mean value?
4.21
Range…
The range is the simplest measure of variability, calculated
as:
E.g.
Data: {4, 4, 4, 4, 50} Range = 46
Data: {4, 8, 15, 24, 39, 50} Range = 46
The range is the same in both cases,
but the data sets have very different distributions…
4.22
Range…
Its major advantage is the ease with which it can be
computed.
4.23
Variance…
Variance and its related measure, standard deviation, are
arguably the most important statistics. Used to measure
variability, they also play a vital role in almost all statistical
inference procedures.
4.24
Statistics is a pattern language…
Population Sample
Size N n
Mean
Variance
4.25
Variance…
population mean
4.26
Variance…
As you can see, you have to calculate the sample mean (x-
bar) in order to calculate the sample variance.
4.27
Application…
Example 4.7. The following sample consists of the number
of jobs six students applied for: 17, 15, 23, 7, 9, 13.
Finds its mean and variance.
…as opposed to or 2
4.28
Sample Mean & Variance…
Sample Mean
Sample Variance
4.29
Standard Deviation…
The standard deviation is simply the square root of the
variance, thus:
4.30
Statistics is a pattern language…
Population Sample
Size N n
Mean
Variance
Standard
Deviation
4.31
Standard Deviation…
Consider Example 4.8 where a golf club manufacturer has
designed a new club and wants to determine if it is hit more
consistently (i.e. with less variability) than with an old club.
Using Tools > Data Analysis… > Descriptive Statistics in Excel,
we produce the following tables for interpretation…
4.32
Interpreting Standard Deviation…
The standard deviation can be used to compare the variability of
several distributions and make a statement about the general shape
of a distribution. If the histogram is bell shaped, we can use the
Empirical Rule, which states:
4.33
The Empirical Rule…
Approximately 68% of all observations fall
within one standard deviation of the mean.
4.34
Coefficient of Variation…
The coefficient of variation of a set of observations is the
standard deviation of the observations divided by their mean,
that is:
4.35
Statistics is a pattern language…
Population Sample
Size N n
Mean
Variance
Standard
S
Deviation
Coefficient of
Variation CV cv
4.36
Coefficient of Variation…
This coefficient provides a
proportionate measure of variation, e.g.
4.37
Measures of Variability…
If data are symmetric, with no serious outliers, use range and
standard deviation.
4.38
Measures of Relative Standing & Box Plots
Measures of relative standing are designed to provide
information about the position of particular values relative to
the entire data set.
4.39
Quartiles…
We have special names for the 25th, 50th, and 75th percentiles,
namely quartiles.
We can also convert percentiles into quintiles (fifths) and deciles (tenths).
4.40
Commonly Used Percentiles…
First (lower) decile = 10th percentile
First (lower) quartile, Q1, = 25th percentile
Second (middle)quartile,Q2, = 50th percentile
Third quartile, Q3, = 75th percentile
Ninth (upper) decile = 90th percentile
4.41
Location of Percentiles…
The following formula allows us to approximate the location
of any percentile:
4.42
Location of Percentiles…
Recall the data from example 4.1:
0 0 5 7 8 9 12 14 22 33
4.43
Location of Percentiles…
What about the upper quartile?
0 0 5 7 8 9 12 14 22 33
It is located one-quarter of the distance between the eighth and the ninth
observations, which are 14 and 22, respectively. One-quarter of the distance
is: (.25)(22 - 14) = 2, which means the 75th percentile is at: 14 + 2 = 16
4.44
Location of Percentiles…
Please remember…
position
2.75 16
0 0 | 5 7 8 9 12 14 | 22 33
position
3.75 8.25
Lp determines the position in the data set where the percentile value lies,
not the value of the percentile itself.
4.45
Interquartile Range…
The quartiles can be used to create another measure of
variability, the interquartile range, which is defined as
follows:
Interquartile Range = Q3 – Q1
4.46
Box Plots…
The box plot is a technique that graphs five statistics:
• the minimum and maximum observations, and
Whisker
Whisker (1.5*(Q3–Q1))
4.48
Measures of Linear Relationship…
We now present two numerical measures of linear
relationship that provide information as to the strength &
direction of a linear relationship between two variables (if
one exists).
4.49
Covariance…
population mean of variable X, variable Y
4.51
Statistics is a pattern language…
Population Sample
Size N n
Mean
Variance S2
Standard
Deviation S
Coefficient of
Variation CV cv
Covariance Sxy
4.52
Covariance Illustrated…
Consider the following three sets of data (textbook §4.5)…
In each set, the values of X are the same, and the value for Y are the same;
the only thing that’s changed is the order of the Y’s.
In set #1, as X increases so does Y; Sxy is large & positive
In set #2, as X increases, Y decreases; Sxy is large & negative
In set #3, as X increases, Y doesn’t move in any particular way; Sxy is “small”
4.53
Covariance… (Generally speaking)
When two variables move in the same direction (both increase or both
decrease), the covariance will be a large positive number.
4.54
Coefficient of Correlation…
The coefficient of correlation is defined as the covariance
divided by the standard deviations of the variables:
Greek letter
“rho”
Variance S2
Standard
Deviation S
Coefficient of
Variation CV cv
Covariance Sxy
Coefficient of
Correlation r
4.56
Coefficient of Correlation…
The advantage of the coefficient of correlation over covariance is that
it has fixed range from -1 to +1, thus:
If the two variables are very strongly positively related, the coefficient
value is close to +1 (strong positive linear relationship).
If the two variables are very strongly negatively related, the coefficient
value is close to -1 (strong negative linear relationship).
4.57
Coefficient of Correlation…
or r = 0 No linear relationship
4.58
Coefficient of Correlation (Application)
Consider Example 4.16, where MBA grade point averages
are compared with GMAT scores. Is the GMAT score a good
predictor of MBA success?
Excel:
Tools > Data Analysis… > Covariance
Tools > Data Analysis… > Correlation
4.59
GMAT & GPA Interpretation…
4.60
Least Squares Method…
Recall, the slope-intercept equation for a line is expressed in these
terms:
y = mx + b
Where:
m is the slope of the line
b is the y-intercept.
4.61
The Least Squares Method…
…produces a straight line drawn through the points so that
the sum of squared deviations between the points and the
line is minimized. This line is represented by the equation:
4.62
The Least Squares Method…
The coefficients b0 and b1 are given by:
4.63
Least Squares Line…
Example 4.17 :: find the least squares line for the previous
example (e.g. MBA / GMAT).
4.64
Least Squares Line…
4.65