Brief History of Statistics

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Brief History Of Statistics

The Oxford English etymological dictionary defines statistics as follows:

Statistics - first applied to the political science concerned with the facts of a state or community
XVIII; all derived immediately from German statistisch adj., statistik sb.; whencestatistician XIX.

Statistics is concerned with exploring, summarising, and making inferences about the state of
complex systems. As summarised in Table 1.1, the development of statistics in Europe was
strongly motivated by the need to make sense of the large amount of data collected by
population surveys in the emerging nation states. At the same time, the mathematical
foundations for statistics advanced significantly due to breakthroughs in probability theory
inspired by games of chance (gambling). For more information about the history of statistics
refer to the books by Johnson and Kotz (1998) and Kotz and Johnson (1993).

Summary of some key events in the development of statistics in Europe. For more historical
details, refer to Johnson and Kotz (1998).
Year Event Person
1532 First weekly data on deaths in London Sir W. Petty

Start of data collection on baptisms, marriages, and deaths in


1539
France

1608 Beginning of parish registry in Sweden


1662 First published demographic study based on bills of mortality J. Graunt

Publ. of An estimate of the degrees of mortality of mankind


drawn from curious tables of the births and funerals at the city of
1693 E. Halley
Breslaw with an attempt to ascertain the price of annuities upon
lives

1713 Publ. of Ars Conjectandi J. Bernoulli


1714 Publ. of Libellus de Ratiocinus in Ludo Aleae C. Huygens
1714 Publ. of The Doctrine of Chances A. De Moivre
1735 Start of demographic data collection in Norway

Publ. of An essay towards solving a problem in the Doctrine of


1763 Rev. Bayes
Chances

1809 Publ. of Theoria Motus Corporum Coelestium C.F. Gauss


1812 Publ. of Théorie analytique des probabilités P.S. Laplace
1834 Establishment of the Statistical Society of London
1839 Establishment of the American Statistical Association (Boston)
1889 Publ. of Natural Inheritance F. Galton

1900 K. Pearson
Development of the test
1901 Publ. of the first issue of Biometrika F. Galton et al.
1903 Development of Principal Component Analysis K. Pearson
1908 Publ. of The probable error of a mean ``Student''
1910 Publ. of An introduction to the theory of statistics G.U. Yule
1933 Publ. of On the empirical determination of a distribution A.N. Kolmogorov
1935 Publ. of The Design of Experiments R.A. Fisher
1936 Publ. of Relations between two sets of variables H. Hotelling
1972 Publ. of Regression models and life tables D.R. Cox

J.A. Nelder and


1972 Publ. of Generalized linear models R.W.M.
Wedderburn

1979 Publ. of Bootstrap methods: another look at the jackknife B. Efron

Functions or Uses of Statistics

(1) Statistics helps in providing a better understanding and exact description of a phenomenon
of nature.

(2) Statistical helps in proper and efficient planning of a statistical inquiry in any field of study.

(3) Statistical helps in collecting an appropriate quantitative data.

(4) Statistics helps in presenting complex data in a suitable tabular, diagrammatic and graphic
form for an easy and clear comprehension of the data.

(5) Statistics helps in understanding the nature and pattern of variability of a phenomenon
through quantitative obersevations.

(6) Statistics helps in drawing valid inference, along with a measure of their reliability about the
population parameters from the sample data.
Definitions of Basic Statistical Terms

"N" is usually used to indicate the number of subjects in a study. Example:

If you have 76 participants in a study, N=76.

The Three Ms

Mean

The average result of a test, survey, or experiment.

Example:
Heights of five people: 5 feet 6 inches, 5 feet 7 inches, 5 feet 10 inches, 5 feet 8 inches, 5 feet 8
inches.
The sum is: 339 inches.
Divide 339 by 5 people = 67.8 inches or 5 feet 7.8 inches.
The mean (average) is 5 feet 7.8 inches.

Median

The score that divides the results in half - the middle value.

Examples:
Odd amount of numbers: Find the median of 5 feet 6 inches, 5 feet 7 inches, 5 feet 10 inches, 5
feet 8 inches, 5 feet 8 inches.
Line up your numbers from smallest to largest: 5 feet 6 inches, 5 feet 7 inches, 5 feet 8 inches, 5
feet 8 inches, 5 feet 10 inches.
The median is: 5 feet 8 inches (the number in the middle).
Even amount of numbers: Find the median of 7, 2, 43, 16, 11, 5
Line up your numbers in order: 2, 5, 7, 11, 16, 43
Add the 2 middle numbers and divide by 2: 7 + 11 = 18 ÷ 2 = 9
The median is 9.

Mode

The most common result (the most frequent value) of a test, survey, or
experiment.

Example:
Find the mode of 5 feet 6 inches, 5 feet 7 inches, 5 feet 10 inches, 5 feet 8 inches, 5 feet 8
inches.
Put the numbers in order to make it easier to visualize: 5 feet 6 inches, 5 feet 7 inches, 5 feet 8
inches, 5 feet 8 inches, 5 feet 10 inches.
The mode is 5 feet 8 inches (it occurs the most - two times).

Significant Difference

Significance

The measure of whether the results of research were due to chance. The more
statistical significance assigned to an observation, the less likely the observation
occurred by chance.
p-value

The way in which significance is reported statistically (i.e. p<.01 means that there
is a less than 1% chance that the results of a study are due to random chance).
Note that in general p-values need to be fairly low (.01 and .05 are common) in
order for a study to make any strong claims based on the results.

Example:

 A study had one group of students (Group A) study using notes they took in
class; the other group (Group B) studied using notes they took after class
using a recording of the lecture. Students in Group A scored higher on a test
than Group B. The study reports a significance of p<.01 for the results.
 This means that whatever the reason students who took notes in class did
better on the test, there is only a 0 - 1% chance that the results are due to
some random factor (such as Group A having smarter students than Group
B).

Correlation

The degree to which two factors appear to be related. Correlation should not be
confused with causation. Just because two factors are reported as being correlated,
you cannot say that one factor causes the other. For example, you might find a
correlation between going to the library at least 40 times per semester and getting
high scores on tests. However, you cannot say from these findings what about
going to the library, or what about people who go to libraries often, is responsible
for higher test scores.

r-value

The way in which correlation is reported statistically (a number between -1 and


+1). Generally, r-values should be >+/-.3 in order to report a significant
correlation.

 An r-value of -1 indicates a extreme negative correlation between two


variables - as one variable's value tends to increase, the other variable's
value tends to decrease.
 An r-value of +1 indicates an extreme positive correlation between two
variables - as one variable's value tends to increase, the other variable's
value also tends to increase.
 An r-value of 0 means there is no correlation at all between the elements
being studied.

Statistics - a set of concepts, rules, and procedures that help us to:

o organize numerical information in the form of tables, graphs, and charts;


o understand statistical techniques underlying decisions that affect our lives and
well-being; and
o make informed decisions.

Data - facts, observations, and information that come from investigations.

o Measurement data sometimes called quantitative data -- the result of using


some instrument to measure something (e.g., test score, weight);
o Categorical data also referred to as frequency or qualitative data. Things are
grouped according to some common property(ies) and the number of members
of the group are recorded (e.g., males/females, vehicle type).

Variable - property of an object or event that can take on different values. For example, college
major is a variable that takes on values like mathematics, computer science, English,
psychology, etc.

o Discrete Variable - a variable with a limited number of values (e.g., gender


(male/female), college class (freshman/sophomore/junior/senior).
o Continuous Variable - a variable that can take on many different values, in
theory, any value between the lowest and highest points on the measurement
scale.
o Independent Variable - a variable that is manipulated, measured, or selected by
the researcher as an antecedent condition to an observed behavior. In a
hypothesized cause-and-effect relationship, the independent variable is the
cause and the dependent variable is the outcome or effect.
o Dependent Variable - a variable that is not under the experimenter's control --
the data. It is the variable that is observed and measured in response to the
independent variable.
o Qualitative Variable - a variable based on categorical data.
o Quantitative Variable - a variable based on quantitative data.

Graphs - visual display of data used to present frequency distributions so that the shape of the
distribution can easily be seen.

o Bar graph - a form of graph that uses bars separated by an arbitrary amount of
space to represent how often elements within a category occur. The higher the
bar, the higher the frequency of occurrence. The underlying measurement scale
is discrete (nominal or ordinal-scale data), not continuous.
o Histogram - a form of a bar graph used with interval or ratio-scaled data. Unlike
the bar graph, bars in a histogram touch with the width of the bars defined by
the upper and lower limits of the interval. The measurement scale is continuous,
so the lower limit of any one interval is also the upper limit of the previous
interval.
o Boxplot - a graphical representation of dispersions and extreme
scores. Represented in this graphic are minimum, maximum, and quartile scores
in the form of a box with "whiskers." The box includes the range of scores falling
into the middle 50% of the distribution (Inter Quartile Range = 75th percentile -
25th percentile)and the whiskers are lines extended to the minimum and
maximum scores in the distribution or to mathematically defined (+/-1.5*IQR)
upper and lower fences.
o Scatterplot - a form of graph that presents information from a bivariate
distribution. In a scatterplot, each subject in an experimental study is
represented by a single point in two-dimensional space. The underlying scale of
measurement for both variables is continuous (measurement data). This is one
of the most useful techniques for gaining insight into the relationship between
tw variables.

Measures of Center - Plotting data in a frequency distribution shows the general shape of the
distribution and gives a general sense of how the numbers are bunched. Several statistics can
be used to represent the "center" of the distribution. These statistics are commonly referred to
as measures of central tendency.
o Mode - The mode of a distribution is simply defined as the most frequent or
common score in the distribution. The mode is the point or value of X that
corresponds to the highest point on the distribution. If the highest frequency is
shared by more than one value, the distribution is said to be multimodal. It is
not uncommon to see distributions that are bimodal reflecting peaks in scoring
at two different points in the distribution.
o Median - The median is the score that divides the distribution into halves; half of
the scores are above the median and half are below it when the data are
arranged in numerical order. The median is also referred to as the score at
the 50th percentile in the distribution. The median location of N numbers can be
found by the formula (N + 1) / 2. When N is an odd number, the formula yields a
integer that represents the value in a numerically ordered distribution
corresponding to the median location. (For example, in the distribution of
numbers (3 1 5 4 9 9 8) the median location is (7 + 1) / 2 = 4. When applied to
the ordered distribution (1 3 4 5 8 9 9), the value 5 is the median, three scores
are above 5 and three are below 5. If there were only 6 values (1 3 4 5 8 9), the
median location is (6 + 1) / 2 = 3.5. In this case the median is half-way between
the 3rd and 4th scores (4 and 5) or 4.5.
o Mean - The mean is the most common measure of central tendency and the one
that can be mathematically manipulated. It is defined as the average of a
distribution is equal to the SX / N. Simply, the mean is computed by summing all
the scores in the distribution (SX) and dividing that sum by the total number of
scores (N). The mean is the balance point in a distribution such that if you
subtract each value in the distribution from the mean and sum all of
these deviation scores, the result will be zero.

Measures of Spread - Although the average value in a distribution is informative about how
scores are centered in the distribution, the mean, median, and mode lack context for
interpreting those statistics. Measures of variability provide information about the degree to
which individual scores are clustered about or deviate from the average value in a distribution.

o Range - The simplest measure of variability to compute and understand is the


range. The range is the difference between the highest and lowest score in a
distribution. Although it is easy to compute, it is not often used as the sole
measure of variability due to its instability. Because it is based solely on the
most extreme scores in the distribution and does not fully reflect the pattern of
variation within a distribution, the range is a very limited measure of variability.
o Interquartile Range (IQR) - Provides a measure of the spread of the middle 50%
of the scores. The IQR is defined as the 75th percentile - the 25th percentile. The
interquartile range plays an important role in the graphical method known as
the boxplot. The advantage of using the IQR is that it is easy to compute and
extreme scores in the distribution have much less impact but its strength is also
a weakness in that it suffers as a measure of variability because it discards too
much data. Researchers want to study variability while eliminating scores that
are likely to be accidents. The boxplot allows for this for this distinction and is an
important tool for exploring data.
o Variance - The variance is a measure based on the deviations of individual scores
from the mean. As noted in the definition of the mean, however, simply
summing the deviations will result in a value of 0. To get around this problem
the variance is based on squared deviations of scores about the mean. When
the deviations are squared, the rank order and relative distance of scores in the
distribution is preserved while negative values are eliminated. Then to control
for the number of subjects in the distribution, the sum of the squared
deviations, S(X - `X), is divided by N (population) or by N - 1 (sample). The result
is the average of the sum of the squared deviations and it is called the variance.
o Standard deviation - The standard deviation (s or s) is defined as the positive
square root of the variance. The variance is a measure in squared units and has
little meaning with respect to the data. Thus, the standard deviation is a
measure of variability expressed in the same units as the data. The standard
deviation is very much like a mean or an "average" of these deviations. In a
normal (symmetric and mound-shaped) distribution, about two-thirds of the
scores fall between +1 and -1 standard deviations from the mean and the
standard deviation is approximately 1/4 of the range in small samples (N < 30)
and 1/5 to 1/6 of the range in large samples (N > 100).

Measures of Shape - For distributions summarizing data from continuous measurement scales,
statistics can be used to describe how the distribution rises and drops.

o Symmetric - Distributions that have the same shape on both sides of the center
are called symmetric. A symmetric distribution with only one peak is referred to
as a normal distribution.
o Skewness - Refers to the degree of asymmetry in a distribution. Asymmetry
often reflects extreme scores in a distribution.
 Positively skewed - A distribution is positively skewed when is has a tail
extending out to the right (larger numbers) When a distribution is
positively skewed, the mean is greater than the median reflecting the fact
that the mean is sensitive to each score in the distribution and is subject
to large shifts when the sample is small and contains extreme scores.
 Negatively skewed - A negatively skwed distribution has an extended tail
pointing to the left (smaller numbers) and reflects bunching of numbers
in the upper part of the distribution with fewer scores at the lower end of
the measurement scale.
o Kurtosis - Like skewness, kurtosis has a specific mathematical definition, but
generally it refers to how scores are concentrated in the center of the
distribution, the upper and lower tails (ends), and the shoulders (between the
center and tails) of a distribution.
 Mesokurtic - A normal distribution is called mesokurtic. The tails of a
mesokurtic distribution are neither too thin or too thick, and there are
neither too many or too few scores in the center of the distribution.
 Platykurtic - Starting with a mesokurtic distribution and moving scores
from both the center and tails into the shoulders, the distribution flattens
out and is referred to as platykurtic.
 Leptokurtic - If you move scores from shoulders of a mesokurtic
distribution into the center and tails of a distribution, the result is a
peaked distribution with thick tails. This shape is referred to as
leptokurtic.

Six Steps Of Investigation

 State how the potential cause could have resulted in the described problem.

 Establish what type of data can most easily prove or disprove the potential cause.
Develop a plan on how the study will be conducted. Identify the actions on an action
plan.
 Prepare the required materials to conduct the study. Training may also be required.

 Collect the required data.

 Analyze the data. Use simple statistical tools emphasizing graphical illustrations of the
data.

 State conclusions. Outline conclusions from the study. Does the data establish the
potential cause as being the reason for the problem?

Data Levels of Measurement


A variable has one of four different levels of measurement: Nominal, Ordinal, Interval, or
Ratio. (Interval and Ratio levels of measurement are sometimes called Continuous or Scale). It
is important for the researcher to understand the different levels of measurement, as these
levels of measurement, together with how the research question is phrased, dictate what
statistical analysis is appropriate. In fact, the Free download below conveniently ties a
variable's levels to different statistical analyses.

In descending order of precision, the four different levels of measurement are:


 Nominal--Latin for name only (Republican, Democrat, Green, Libertarian)
 Ordinal--Think ordered levels or ranks (small--8oz, medium--12oz, large--32oz)
 Interval--Equal intervals among levels (1 dollar to 2 dollars is the same interval as 88 dollars to
89 dollars)
 Ratio--Let the "o" in ratio remind you of a zero in the scale (Day 0, day 1, day 2, day 3, ...)

The first level of measurement is nominal level of measurement. In this level of


measurement, the numbers in the variable are used only to classify the data. In this level of
measurement, words, letters, and alpha-numeric symbols can be used. Suppose there are data
about people belonging to three different gender categories. In this case, the person belonging
to the female gender could be classified as F, the person belonging to the male gender could be
classified as M, and transgendered classified as T. This type of assigning classification is nominal
level of measurement.

The second level of measurement is the ordinal level of measurement. This level of
measurement depicts some ordered relationship among the variable's observations. Suppose a
student scores the highest grade of 100 in the class. In this case, he would be assigned the first
rank. Then, another classmate scores the second highest grade of an 92; she would be assigned
the second rank. A third student scores a 81 and he would be assigned the third rank, and so
on. The ordinal level of measurement indicates an ordering of the measurements.

The third level of measurement is the interval level of measurement. The interval level
of measurement not only classifies and orders the measurements, but it also specifies that the
distances between each interval on the scale are equivalent along the scale from low interval to
high interval. For example, an interval level of measurement could be the measurement of
anxiety in a student between the score of 10 and 11, this interval is the same as that of a
student who scores between 40 and 41. A popular example of this level of measurement
is temperature in centigrade, where, for example, the distance between 940C and 960C is the
same as the distance between 1000C and 1020C.

The fourth level of measurement is the ratio level of measurement. In this level of
measurement, the observations, in addition to having equal intervals, can have a value of zero
as well. The zero in the scale makes this type of measurement unlike the other types of
measurement, although the properties are similar to that of the interval level of
measurement. In the ratio level of measurement, the divisions between the points on the scale
have an equivalent distance between them.

PRESENTATION OF DATA

FREQUENCY DISTRIBUTION:

Data can be presented in various forms depending on the type of data collected. A frequency
distribution is a table showing how often each value (or set of values) of the variable in
question occurs in a data set. A frequency table is used to summarize categorical or numerical
data. Frequencies are also presented as relative frequencies, that is, the percentage of the total
number in the sample.

EXAMPLE: Frequency distribution of peptic ulcer according to site


of ulcer
Site of ulcer Frequency Percent
Gastric ulcer 24 30.0
Duodenal ulcer 50 62.5
Gastric and duodenal ulcer 6 7.5
TOTAL 80 100

GRAPHICAL METHODS:

Frequency distributions and are usually illustrated graphically by plotting various types of
graphs:

Bar graph - A bar graph is a way of


summarizing a set of categorical data. It
displays the data using a number of
rectangles, of the same width, each of
which represents a particular category.
Bar graphs can be displayed horizontally
or vertically and they are usually drawn
with a gap between the bars (rectangles).

Histogram - A histogram is a way of


summarizing data that are measured on
an interval scale (either discrete or
continuous). It is often used in
exploratory data analysis to illustrate the
features of the distribution of the data in
a convenient form.
Pie chart - A pie chart is used to display a
set of categorical data. It is a circle, which
is divided into segments. Each segment
represents a particular category. The area
of each segment is proportional to the
number of cases in that category.

Line graph - A line graph is particularly


useful when we want to show the trend
of a variable over time. Time is displayed
on the horizontal axis (x-axis) and the
variable is displayed on the vertical axis
(y- axis).

Textual presentation of data :

This method comprises presenting data with the help of a paragraph or a number of
paragraphs. The official report of an inquiry commission is usually made by textual
presentation.

Textual presentation of data - Example

Following is an example of textual presentation.

In 1999, out of a total of five thousand workers of a factory, four thousand and two hundred
were members of a Trade Union.The number of female workers was twenty per cent of the
total workers out of which thirty per cent were members of the
Trade Union.

In 2000, the number of workers belonging to the trade union was increased by twenty per cent
as compared to 1999 of which four thousand and two hundred were male.

The number of workers not belonging to trade union was nine hundred and fifty of which four
hundred and fifty were females. The merit of this mode of presentation lies in its simplicity and
even a layman can present data by this method.

The observations with exact magnitude can be presented with the help of textual presentation.
Furthermore, this type of presentation can be taken as the first step towards the other
methods of presentation.

Textual presentation, however, is not preferred by a statistician simply because, it is dull,


monotonous and comparison between different observations is not possible in this method.
For manifold classification, this method cannot be recommended and tabulation is usually
preferred.

Now, let us see, how the same data can be represented using tabular representation.

Status of the workers of the factory on the basis of their trade union membership for 1999 and
2000.

TU, M, F and T stand for trade union, male, female and total respectively.

The tabulation method is usually preferred to textual- presentation as

(i) It facilitates comparison between rows and columns.

(ii) Complicated data can also be represented using tabulation.

(iii) It is a must for diagrammatic representation.

(iv) Without tabulation, statistical analysis of data is not possible.

Data Presentation - Tables


Tables are a useful way to organize information using rows and columns. Tables are a versatile
organization tool and can be used to communicate information on their own, or they can be
used to accompany another data representation type (like a graph). Tables support a variety of
parameters and can be used to keep track of frequencies, variable associations, and more.

For example, given below are the weights of 20 students in grade 10:

To find the frequency of in this data, count the number of times that appears in the list. There
are students that have this weight.

The list above has information about the weight


of students, and since the data has been arranged
haphazardly, it is difficult to classify the students properly.

To make the information more clear, tabulate the given data.

This table makes the data more easy to understand.


THE MEAN

Example:

Four tests results: 15, 18, 22, 20


The sum is: 75
Divide 75 by 4: 18.75

The 'Mean' (Average) is 18.75

(Often rounded to 19)

THE MEDIAN

The Median is the 'middle value' in your list. When the totals of the list are odd, the median is
the middle entry in the list after sorting the list into increasing order. When the totals of the list
are even, the median is equal to the sum of the two middle (after sorting the list into increasing
order) numbers divided by two. Thus, remember to line up your values, the middle number is
the median! Be sure to remember the odd and even rule.

Examples:

Find the Median of: 9, 3, 44, 17, 15 (Odd amount of numbers)


Line up your numbers: 3, 9, 15, 17, 44 (smallest to largest)
The Median is: 15 (The number in the middle)

Find the Median of: 8, 3, 44, 17, 12, 6 (Even amount of numbers)
Line up your numbers: 3, 6, 8, 12, 17, 44
Add the 2 middles numbers and divide by 2: 8 12 = 20 ÷ 2 = 10
The Median is 10.

THE MODE

The mode in a list of numbers refers to the list of numbers that occur most frequently. A trick to
remember this one is to remember that mode starts with the same first two letters that most
does. Most frequently - Mode. You'll never forget that one!

Examples:

Find the mode of:


9, 3, 3, 44, 17 , 17, 44, 15, 15, 15, 27, 40, 8,
Put the numbers is order for ease:
3, 3, 8, 9, 15, 15, 15, 17, 17, 27, 40, 44, 44,
The Mode is 15 (15 occurs the most at 3 times)

*It is important to note that there can be more than one mode and if no number occurs more
than once in the set, then there is no mode for that set of numbers.

THE RANGE

Occasionally in Statistics you'll be asked for the 'range' in a set of numbers. The range is simply
the the smallest number subtracted from the largest number in your set. Thus, if your set is 9,
3, 44, 15, 6 - The range would be 44-3=41. Your range is 41.
PROBABILITY
AND
STATISTICS

You might also like