Instructions For Chapter 3 Prepared by Dr. Guru-Gharana: Terminology and Conventions
Instructions For Chapter 3 Prepared by Dr. Guru-Gharana: Terminology and Conventions
measures of Central Tendency are the Mean, the Median and the Mode. The Mean is by far the
most popular and the most widely used measure, followed by the Median.
The Mean
The Mean is also known as the Arithmetic Mean or the Average or the Expected Value (in
Advanced Statistics). Sometimes, the name may be quite different depending on the context. For
example, the Per Capita Income of a country or city is in fact the Mean income.
The (general) formula for the Mean is very simple: it is the sum of all values divided by the total
n
Xi
i =1
n
or X (with bar over it) for the sample mean. Thus for the
sample,
Xi
i =1
n
If you have problem typing X with bar over it, just type X-bar in your answers or copy it from
my Instructions.
Example1: Suppose a variable X has 12 values for twelve months of any year:
2,5,6,8,9,10,12,13,10,7,5, and 3, then the sample mean (or the annual average) is:
X = (2+5+6+8+9+10+12+13+10+7+5+3) 12 = 7.5
Note that the mean is not equal to any value. This is perhaps one of the drawbacks of the Mean.
Another drawback is that the Mean is very sensitive to extreme values (or outliers). For example,
if the value in only the sixth month (somehow) jumped to 100 instead of being 10, then the
overall annual average or the Mean would jump to 15. Thus one extreme value could drive the
average to greater than all other values in the whole series! This works similarly for outliers in
the lower end. The average household or per capita income of a large city where a few super
billionaires happen to live could be quite misleading.
So, the Mean has problems as a measure of the central tendency. Why is it so popular then?
Before you lose total respect for this measure, let me discuss some of the nice properties it does
have.
1. The mean takes into account all the values in the data. Change in any value, other things
remaining the same, will change the mean. That is, the mean does not ignore any value (or
information contained in the data). This is not true of other measures of Central tendency such as
Median and Mode which we will discuss below.
2. The mean is unique. One data set has only one mean. This is not true of Mode.
3. The mean indicates the balancing pint or center of gravity. If equal weights were distributed on
distances indicated by the values in the above series, then the whole structure could be exactly
balanced by putting a finger (or other support) under the point corresponding to the Mean 7.5. So
it is a point which balances the weights on the two sides.
4. A very useful property (used later in advanced statistical applications) is that the sum of
deviations around the mean is exactly zero. Let us try this for the above series.
X- X
-5.5
-2.5
-1.5
0.5
1.5
2.5
4.5
5.5
2.5
-0.5
-2.5
-4.5
0
X
2
5
6
8
9
10
12
13
10
7
5
3
90
Note that the formula given above has to be modified for grouped data. For grouped data apply
the weighted mean formula:
= n fi Xi /n fi
i=1
i=1
where fi is the frequency corresponding to the value Xi and the denominator is the total frequency
(or the total number of observations). (Replace by X to obtain the equivalent formula for
sample)
Example 2: Suppose we have grouped data as shown in the first two columns below. Then the
third column is calculated as the product of the first two to derive the mean.
X
5
7
8
10
12
Total
Frequency
2
4
7
5
2
20
Therefore,
X*Frequency
10
28
56
50
24
168
= 168 20 = 8.4
If we have grouped data with classes or intervals instead of single X values with frequencies,
then we first find the class midpoints and then the Mean by treating the class midpoints like the
X column above.
For example, the mean for the grouped data given below for 30 items is calculated after finding
the class midpoints first:
Example 3
Class Class Midpt (Mi)
3-7
5
7-11
9
11-15
13
15-19
17
Frequency
3
10
12
5
Total
30
Therefore,
N/A
X
Mi*Frequency
15
90
156
85
346
= 346 30 = 11.53
Note that, in case of grouped data the deviations from the mean will add up to Zero only after
multiplying by the respective frequencies. The above examples are examples of frequency
weighted mean. But the weights could be any other thing instead of frequency. You just replace
the frequencies by the weights in the above formula to get the weighted average or mean. For
example it could be relative frequency or probability associated with the individual values. If the
weights are relative frequency in decimals or probabilities then the only new thing to remember
is that the denominator would be exactly 1. (Can you guess why? you guessed it right. The
sum of all relative frequencies or probabilities has to be 1). Or the weights could be the credit
hours for each course you take, to derive the weighted average score.
Later (in chapters 6 and 7) we will discuss discrete and continuous random variables and
probability distributions. The above formula applies only to discrete variables. For continuous
variables (such as the normal random variable) we need to replace the summation sign by the
integral sign (Calculus! Yikes!). But dont get scared, you will not be asked to use Integrals in
this course. You just have to be aware of it.
The Median
Next to the Mean, the Median is also popular in some applications, such as the comparison of
cities based on Median Household Income. But I will be brief on this topic because of its limited
use in most cases (especially in the advanced applications of Statistics).
Interestingly, the Median is indeed the central value in that it divides the whole population (or
sample) in two halves. It is such a value that 50 percent of the values are equal to or less than this
value and 50 percent are greater than equal to it. To calculate the Median, order the series of
values in increasing or decreasing order. If the number of observations is odd Median is the
N +1 th
value). If the number of observations is even then the
middlemost value (the (
2
N th
N
and the (
+1 th values].
Median is the average of the two middle-most values [the (
2
2
Left skewed
Right Skewed
The normal distribution is the most widely known example of a symmetric distribution as shown
below:
0
Mean=Median=Mode
Simply knowing the Central Tendency is not enough. For example, there are some
underdeveloped small countries whose per capita income levels are similar to those of the USA,
but only a few dozen families in those countries own almost everything and the rest of the
country is living in poverty. Similarly in stocks we need to know not only the expected (or
average) return but also the variability of the return, which measures the risk. Looking only at
return and not the risk would be financially disastrous (as also shown by the recent financial
crisis). So, measures of variation or spread are also important.
Among the various measures of Variation, the Variance (or its square root: the Standard
Deviation) is by far the most widely used measure in most applications. And it is based on the
Mean! So I will mention other measures only briefly and focus more on the Variance and its
derivatives.
The Range is simply the difference between the largest and the smallest values. It gives some
idea of the spread but only considers the two extreme values and ignores the rest: a highly
undesirable property.
To understand Interquartile Range we need to know percentiles and quartiles. It is very
straightforward and does not need further explanation. The Interquartile range estimates the
interval which contains the middle 50 percent of the values. There is very little use for such a
measure which ignores most of the information contained in the data. So I will now move to the
most important measure of Variation.
)2/(n-1),
The reason we divide by n-1 in the case of samples instead of n (the number of observations)
is to obtain an unbiased estimator for the Variance. We lose one degree of freedom in the
calculation of the sample variance because we have to use the data once to calculate the Mean
before we can calculate the variance, as if, using data depreciates it! You dont need to fully
understand this statistical jargon. Simply learn the rule that in the sample variance formula
we divide by n-1 instead of n.
Let us calculate the variance for example1, for which we have already calculated the Mean.
Xi X
2
5
6
8
9
10
12
13
10
7
5
3
90
2-7.5
5-7.5
6-7.5
8-7.5
9-7.5
10-7.5
12-7.5
13-7.5
10-7.5
7-7.5
5-7.5
3-7.5
0
(Xi X )2
30.25
6.25
2.25
0.25
2.25
6.25
20.25
30.25
6.25
0.25
6.25
20.25
131
)2/(n-1)
Let us calculate the sample variance for the grouped data in Example 2 above:
X
Freq(f)
5
2
7
4
8
7
10
5
12
2
Total
20
X = 168 20=8.4
X*f X- X
(X- X )2
10
-3.4 11.56
23.12
28
-1.4
1.96
7.84
56
-0.4
0.16
1.12
50
1.6
2.56
12.80
24
3.6
12.96
25.92
168
--70.8
2
s = 70.8 (20-1) = 3.726
f(X-
X )2
If you have grouped data with classes, then first find the class midpoints and then the Mean and
then the variance by treating the class midpoints like the X column above, and use the formula
given above. This is done for Example 3 below.
Class Class Midpt (Mi)
3-7
5
7-11
9
11-15
13
15-19
17
Freq
3
10
12
5
Total
N/A
30
X = 346 30 = 11.53
s2 = 367.44 (30-1) = 12.67
Mi*Freq
15
90
156
85
346
Mi-6.53
-2.53
1.47
5.47
(M i 42.64
6.40
2.16
29.92
X )2 fi(Mi 127.92
64.0
25.92
149.6
367.44
X )2
X
12
7.50
11.91
3.45
2
13
11
90.00
131.00
This gives all the important measures we have discussed so far. I want you to also learn it the
hard way (that is using calculator and the formula)!
However, for grouped data this is a little tricky. You have to first convert the grouped data as data
of individual values and follow the above procedure. For example, in the case of the grouped
data of the second example you would type the value of X equal to 5 two times (for frequency 2)
and the value 7 four times, and so on. I find using calculator and formula easier when the data is
grouped in frequencies.
10
For a normal distribution knowing only two parameters (standard deviation or variance and the
mean) is sufficient to derive the whole distribution. For any normal distribution with mean and
standard deviation (or variance 2) the following is approximately true:
i.
ii.
iii.
The interval around the mean between - and + contains about 68.27 percent
(more than two-thirds) of all the values. That is one deviation around the mean value
includes 68.27 percent of values.
Two deviations around the mean include 95.45 percent of values, and
Three deviations around the mean include 99.73 percent (that is nearly all) of the
values.
The three intervals above are also called the corresponding Confidence or Tolerance intervals.
For example 95.45% confidence interval would be constructed by subtracting from and adding to
the mean, two times the standard deviation. As an illustration, if the mean is 40 and the standard
deviation is estimated to be 3, then the range 40 3 is expected to contain 68.27 percent of
items. In other words, 68.27 percent of items are expected to fall between 37 and 43. Similarly,
95.45 percent of items are expected to fall between 34 and 46, and so on. These simple facts are
also called the Empirical rules and can be safely applied to most distributions which may not be
exactly normal but are not very skewed.
11
Let X be a random variable with expected value and finite variance 2. Then for any real
number k > 0, Chebyshevs inequality guarantees that, Pr(|X - | k) (1 k2)
In simple words, for any distribution, the minimum proportion of the values that lie within k
standard deviations of the mean is at least 1- 1/k2, where k is any constant greater than 1. Only
the case k 2 provides useful information (when k 1 the right-hand side is greater than or
equals to one, so the inequality becomes invalid because probability cannot be greater than 1. As
an illustration, using k = 2 shows that at least 75% of the values lie in the interval ( 2, +
2). The empirical rule says around 95.45 percent lie in this interval. This is a more precise
statement compared to Chebyshevs rule.
Similarly if we knew (or could safely assume) that the distribution is normal, then we could say
(by looking at the Normal Table as explained in a later chapter) that at least 75% of items lie in
the interval ( 1.16, + 1.16). This is a comparatively much tighter (or precise) interval than
what the Chebyshevs rule can give. Therefore, the Empirical rule based on Normal Distribution
(discussed above) is much more popular than the Chebyshevs rule.
The Z-Score: Subtracting the mean from a given value of a random variable and dividing by
the standard deviation, is also called standardization, and the resulting value is called Z-score.
If we do this for a normal random variable we obtain the Standard normal distribution or the ZDistribution. The z-scores tell us how far a given value (left or right) is from the mean in
multiples of the standard deviation.
Example: If the mean is 40 and the standard deviation is 4, then the Z-score of 45 is (45-40) 4
= 1.25, and the Z-score of 38 is (38-40) 4 = -0.5
*100
Example: If standard deviation is 10 and the mean is 40, then the coefficient of Variation is
(10/40)*100 = 25%. If a second distribution has standard deviation of 50 and mean equal to 400,
then its Coefficient of Variation is (50/400)*100 = 12.5%.
Coefficient of Variation in percentage =
Thus the second distribution seems to have less relative variability compared to the first
distribution notwithstanding its larger standard deviation.