Brief History of Statistics
Brief History of Statistics
Brief History of Statistics
Statistics - first applied to the political science concerned with the facts of a state or community
XVIII; all derived immediately from German statistisch adj., statistik sb.; whencestatistician XIX.
Statistics is concerned with exploring, summarising, and making inferences about the state of
complex systems. As summarised in Table 1.1, the development of statistics in Europe was
strongly motivated by the need to make sense of the large amount of data collected by
population surveys in the emerging nation states. At the same time, the mathematical
foundations for statistics advanced significantly due to breakthroughs in probability theory
inspired by games of chance (gambling). For more information about the history of statistics
refer to the books by Johnson and Kotz (1998) and Kotz and Johnson (1993).
Summary of some key events in the development of statistics in Europe. For more historical
details, refer to Johnson and Kotz (1998).
Year Event Person
1532 First weekly data on deaths in London Sir W. Petty
1900 K. Pearson
Development of the test
1901 Publ. of the first issue of Biometrika F. Galton et al.
1903 Development of Principal Component Analysis K. Pearson
1908 Publ. of The probable error of a mean ``Student''
1910 Publ. of An introduction to the theory of statistics G.U. Yule
1933 Publ. of On the empirical determination of a distribution A.N. Kolmogorov
1935 Publ. of The Design of Experiments R.A. Fisher
1936 Publ. of Relations between two sets of variables H. Hotelling
1972 Publ. of Regression models and life tables D.R. Cox
(1) Statistics helps in providing a better understanding and exact description of a phenomenon
of nature.
(2) Statistical helps in proper and efficient planning of a statistical inquiry in any field of study.
(4) Statistics helps in presenting complex data in a suitable tabular, diagrammatic and graphic
form for an easy and clear comprehension of the data.
(5) Statistics helps in understanding the nature and pattern of variability of a phenomenon
through quantitative obersevations.
(6) Statistics helps in drawing valid inference, along with a measure of their reliability about the
population parameters from the sample data.
Definitions of Basic Statistical Terms
The Three Ms
Mean
Example:
Heights of five people: 5 feet 6 inches, 5 feet 7 inches, 5 feet 10 inches, 5 feet 8 inches, 5 feet 8
inches.
The sum is: 339 inches.
Divide 339 by 5 people = 67.8 inches or 5 feet 7.8 inches.
The mean (average) is 5 feet 7.8 inches.
Median
The score that divides the results in half - the middle value.
Examples:
Odd amount of numbers: Find the median of 5 feet 6 inches, 5 feet 7 inches, 5 feet 10 inches, 5
feet 8 inches, 5 feet 8 inches.
Line up your numbers from smallest to largest: 5 feet 6 inches, 5 feet 7 inches, 5 feet 8 inches, 5
feet 8 inches, 5 feet 10 inches.
The median is: 5 feet 8 inches (the number in the middle).
Even amount of numbers: Find the median of 7, 2, 43, 16, 11, 5
Line up your numbers in order: 2, 5, 7, 11, 16, 43
Add the 2 middle numbers and divide by 2: 7 + 11 = 18 ÷ 2 = 9
The median is 9.
Mode
The most common result (the most frequent value) of a test, survey, or
experiment.
Example:
Find the mode of 5 feet 6 inches, 5 feet 7 inches, 5 feet 10 inches, 5 feet 8 inches, 5 feet 8
inches.
Put the numbers in order to make it easier to visualize: 5 feet 6 inches, 5 feet 7 inches, 5 feet 8
inches, 5 feet 8 inches, 5 feet 10 inches.
The mode is 5 feet 8 inches (it occurs the most - two times).
Significant Difference
Significance
The measure of whether the results of research were due to chance. The more
statistical significance assigned to an observation, the less likely the observation
occurred by chance.
p-value
The way in which significance is reported statistically (i.e. p<.01 means that there
is a less than 1% chance that the results of a study are due to random chance).
Note that in general p-values need to be fairly low (.01 and .05 are common) in
order for a study to make any strong claims based on the results.
Example:
A study had one group of students (Group A) study using notes they took in
class; the other group (Group B) studied using notes they took after class
using a recording of the lecture. Students in Group A scored higher on a test
than Group B. The study reports a significance of p<.01 for the results.
This means that whatever the reason students who took notes in class did
better on the test, there is only a 0 - 1% chance that the results are due to
some random factor (such as Group A having smarter students than Group
B).
Correlation
The degree to which two factors appear to be related. Correlation should not be
confused with causation. Just because two factors are reported as being correlated,
you cannot say that one factor causes the other. For example, you might find a
correlation between going to the library at least 40 times per semester and getting
high scores on tests. However, you cannot say from these findings what about
going to the library, or what about people who go to libraries often, is responsible
for higher test scores.
r-value
Variable - property of an object or event that can take on different values. For example, college
major is a variable that takes on values like mathematics, computer science, English,
psychology, etc.
Graphs - visual display of data used to present frequency distributions so that the shape of the
distribution can easily be seen.
o Bar graph - a form of graph that uses bars separated by an arbitrary amount of
space to represent how often elements within a category occur. The higher the
bar, the higher the frequency of occurrence. The underlying measurement scale
is discrete (nominal or ordinal-scale data), not continuous.
o Histogram - a form of a bar graph used with interval or ratio-scaled data. Unlike
the bar graph, bars in a histogram touch with the width of the bars defined by
the upper and lower limits of the interval. The measurement scale is continuous,
so the lower limit of any one interval is also the upper limit of the previous
interval.
o Boxplot - a graphical representation of dispersions and extreme
scores. Represented in this graphic are minimum, maximum, and quartile scores
in the form of a box with "whiskers." The box includes the range of scores falling
into the middle 50% of the distribution (Inter Quartile Range = 75th percentile -
25th percentile)and the whiskers are lines extended to the minimum and
maximum scores in the distribution or to mathematically defined (+/-1.5*IQR)
upper and lower fences.
o Scatterplot - a form of graph that presents information from a bivariate
distribution. In a scatterplot, each subject in an experimental study is
represented by a single point in two-dimensional space. The underlying scale of
measurement for both variables is continuous (measurement data). This is one
of the most useful techniques for gaining insight into the relationship between
tw variables.
Measures of Center - Plotting data in a frequency distribution shows the general shape of the
distribution and gives a general sense of how the numbers are bunched. Several statistics can
be used to represent the "center" of the distribution. These statistics are commonly referred to
as measures of central tendency.
o Mode - The mode of a distribution is simply defined as the most frequent or
common score in the distribution. The mode is the point or value of X that
corresponds to the highest point on the distribution. If the highest frequency is
shared by more than one value, the distribution is said to be multimodal. It is
not uncommon to see distributions that are bimodal reflecting peaks in scoring
at two different points in the distribution.
o Median - The median is the score that divides the distribution into halves; half of
the scores are above the median and half are below it when the data are
arranged in numerical order. The median is also referred to as the score at
the 50th percentile in the distribution. The median location of N numbers can be
found by the formula (N + 1) / 2. When N is an odd number, the formula yields a
integer that represents the value in a numerically ordered distribution
corresponding to the median location. (For example, in the distribution of
numbers (3 1 5 4 9 9 8) the median location is (7 + 1) / 2 = 4. When applied to
the ordered distribution (1 3 4 5 8 9 9), the value 5 is the median, three scores
are above 5 and three are below 5. If there were only 6 values (1 3 4 5 8 9), the
median location is (6 + 1) / 2 = 3.5. In this case the median is half-way between
the 3rd and 4th scores (4 and 5) or 4.5.
o Mean - The mean is the most common measure of central tendency and the one
that can be mathematically manipulated. It is defined as the average of a
distribution is equal to the SX / N. Simply, the mean is computed by summing all
the scores in the distribution (SX) and dividing that sum by the total number of
scores (N). The mean is the balance point in a distribution such that if you
subtract each value in the distribution from the mean and sum all of
these deviation scores, the result will be zero.
Measures of Spread - Although the average value in a distribution is informative about how
scores are centered in the distribution, the mean, median, and mode lack context for
interpreting those statistics. Measures of variability provide information about the degree to
which individual scores are clustered about or deviate from the average value in a distribution.
Measures of Shape - For distributions summarizing data from continuous measurement scales,
statistics can be used to describe how the distribution rises and drops.
o Symmetric - Distributions that have the same shape on both sides of the center
are called symmetric. A symmetric distribution with only one peak is referred to
as a normal distribution.
o Skewness - Refers to the degree of asymmetry in a distribution. Asymmetry
often reflects extreme scores in a distribution.
Positively skewed - A distribution is positively skewed when is has a tail
extending out to the right (larger numbers) When a distribution is
positively skewed, the mean is greater than the median reflecting the fact
that the mean is sensitive to each score in the distribution and is subject
to large shifts when the sample is small and contains extreme scores.
Negatively skewed - A negatively skwed distribution has an extended tail
pointing to the left (smaller numbers) and reflects bunching of numbers
in the upper part of the distribution with fewer scores at the lower end of
the measurement scale.
o Kurtosis - Like skewness, kurtosis has a specific mathematical definition, but
generally it refers to how scores are concentrated in the center of the
distribution, the upper and lower tails (ends), and the shoulders (between the
center and tails) of a distribution.
Mesokurtic - A normal distribution is called mesokurtic. The tails of a
mesokurtic distribution are neither too thin or too thick, and there are
neither too many or too few scores in the center of the distribution.
Platykurtic - Starting with a mesokurtic distribution and moving scores
from both the center and tails into the shoulders, the distribution flattens
out and is referred to as platykurtic.
Leptokurtic - If you move scores from shoulders of a mesokurtic
distribution into the center and tails of a distribution, the result is a
peaked distribution with thick tails. This shape is referred to as
leptokurtic.
State how the potential cause could have resulted in the described problem.
Establish what type of data can most easily prove or disprove the potential cause.
Develop a plan on how the study will be conducted. Identify the actions on an action
plan.
Prepare the required materials to conduct the study. Training may also be required.
Analyze the data. Use simple statistical tools emphasizing graphical illustrations of the
data.
State conclusions. Outline conclusions from the study. Does the data establish the
potential cause as being the reason for the problem?
The second level of measurement is the ordinal level of measurement. This level of
measurement depicts some ordered relationship among the variable's observations. Suppose a
student scores the highest grade of 100 in the class. In this case, he would be assigned the first
rank. Then, another classmate scores the second highest grade of an 92; she would be assigned
the second rank. A third student scores a 81 and he would be assigned the third rank, and so
on. The ordinal level of measurement indicates an ordering of the measurements.
The third level of measurement is the interval level of measurement. The interval level
of measurement not only classifies and orders the measurements, but it also specifies that the
distances between each interval on the scale are equivalent along the scale from low interval to
high interval. For example, an interval level of measurement could be the measurement of
anxiety in a student between the score of 10 and 11, this interval is the same as that of a
student who scores between 40 and 41. A popular example of this level of measurement
is temperature in centigrade, where, for example, the distance between 940C and 960C is the
same as the distance between 1000C and 1020C.
The fourth level of measurement is the ratio level of measurement. In this level of
measurement, the observations, in addition to having equal intervals, can have a value of zero
as well. The zero in the scale makes this type of measurement unlike the other types of
measurement, although the properties are similar to that of the interval level of
measurement. In the ratio level of measurement, the divisions between the points on the scale
have an equivalent distance between them.
PRESENTATION OF DATA
FREQUENCY DISTRIBUTION:
Data can be presented in various forms depending on the type of data collected. A frequency
distribution is a table showing how often each value (or set of values) of the variable in
question occurs in a data set. A frequency table is used to summarize categorical or numerical
data. Frequencies are also presented as relative frequencies, that is, the percentage of the total
number in the sample.
GRAPHICAL METHODS:
Frequency distributions and are usually illustrated graphically by plotting various types of
graphs:
This method comprises presenting data with the help of a paragraph or a number of
paragraphs. The official report of an inquiry commission is usually made by textual
presentation.
In 1999, out of a total of five thousand workers of a factory, four thousand and two hundred
were members of a Trade Union.The number of female workers was twenty per cent of the
total workers out of which thirty per cent were members of the
Trade Union.
In 2000, the number of workers belonging to the trade union was increased by twenty per cent
as compared to 1999 of which four thousand and two hundred were male.
The number of workers not belonging to trade union was nine hundred and fifty of which four
hundred and fifty were females. The merit of this mode of presentation lies in its simplicity and
even a layman can present data by this method.
The observations with exact magnitude can be presented with the help of textual presentation.
Furthermore, this type of presentation can be taken as the first step towards the other
methods of presentation.
Now, let us see, how the same data can be represented using tabular representation.
Status of the workers of the factory on the basis of their trade union membership for 1999 and
2000.
TU, M, F and T stand for trade union, male, female and total respectively.
For example, given below are the weights of 20 students in grade 10:
To find the frequency of in this data, count the number of times that appears in the list. There
are students that have this weight.
Example:
THE MEDIAN
The Median is the 'middle value' in your list. When the totals of the list are odd, the median is
the middle entry in the list after sorting the list into increasing order. When the totals of the list
are even, the median is equal to the sum of the two middle (after sorting the list into increasing
order) numbers divided by two. Thus, remember to line up your values, the middle number is
the median! Be sure to remember the odd and even rule.
Examples:
Find the Median of: 8, 3, 44, 17, 12, 6 (Even amount of numbers)
Line up your numbers: 3, 6, 8, 12, 17, 44
Add the 2 middles numbers and divide by 2: 8 12 = 20 ÷ 2 = 10
The Median is 10.
THE MODE
The mode in a list of numbers refers to the list of numbers that occur most frequently. A trick to
remember this one is to remember that mode starts with the same first two letters that most
does. Most frequently - Mode. You'll never forget that one!
Examples:
*It is important to note that there can be more than one mode and if no number occurs more
than once in the set, then there is no mode for that set of numbers.
THE RANGE
Occasionally in Statistics you'll be asked for the 'range' in a set of numbers. The range is simply
the the smallest number subtracted from the largest number in your set. Thus, if your set is 9,
3, 44, 15, 6 - The range would be 44-3=41. Your range is 41.
PROBABILITY
AND
STATISTICS