Levels of Data
Levels of Data
Levels of Data
USE OF
STATISTICAL
TOOLS
IN SOCIAL SCIENCE RESEARCH
1
2
When analysing the data, there are different types of
statistical measures that can be used to help to
interpret the numbers.
However, based on what type of data is available, not
all statistics can be used.
It is important to understand what type of data one
has available in order to determine what type of
statistical analysis can be completed.
3
The most elementary scale in measurement is does no more than
identify the category into which the individual, event or objects may
be classified.
For example, when tabulating the survey, one could assign the
number 1 to the respondents and the number 2 to another group of
respondents. These numbers are completely arbitrary. One could just
have easily assigned the numbers 51 and 885, as an example. The
numbers don't mean anything in relation to the categories.
4
Ordinal data requires that there is order in the data.
Ordinal numbers are used to indicate rank and nothing
more. The ordinal scale is used to arrange individuals or
objects in a series ranging from the highest to the lowest
according to the particular characteristic being measured.
One can not assume that the intervals are equal.
For example, when ranking the outcomes of a race, the
first place winner is 1 and the second place winner is 2.
There is a reason and a order behind the assignment of the
numbers. However, this doesn't tell us that the number 1
finisher is twice as good as the number 2 finisher.
Example: Intelligence, job satisfaction
5
Apart from the ranking ordering the data, the interval
scale allows one to state precisely how far apart are
the individuals, the objects or the events that form
the focus of the enquiry. It permits certain
mathematical procedures previously untenable at the
other two levels.
Example: Age, income, height, weight
Interestingly enough, the way you phrase the
question is directly related to the type of data that
you get back.
6
It is the highest level of measurement. It subsumes
all the other three. A ration scale incudes an absolute
zero.
Because it has an absolute zero, it is possible to do all
the operations like, addition, subtraction,
multiplications and division. Educational and
psychological test can assume ration level
measurement. It is most used in physical sciences.
Example: If weigh is used as an example, no mass at
all is zero. 1000 grams is 400 grams heavier than 600
grams and twice as heavy as 500.
7
A discrete variable is a variable which can take on
numerals or values that are specific distinct point on
the scale.
Example: Gender is a discrete variable. It can take
only a fixed number of values. Color of the eye,
religion, caste
8
It can take on any value between the points on a
scale. It can be measured with differing degrees of
exactness depending on the measuring instrument.
Example: Weight. It can take any number of values
from zero to infinity. Height, time, age
9
The first step in any data analysis strategy is to
calculate summary measures to get a general feel for
the data. Summary measures for a data set are often
referred to as descriptive statistics. Descriptive
statistics fall into three main categories:
Measures of position (or central tendency)
Measures of variability
Measures of skewness
They can be useful for beginning data analysis, for
comparing multiple data sets, and for reporting final
results of a survey.
10
The mean is simply the average of all the items in a sample.
To compute a sample mean, add up all the sample values
and divide by the size of the sample.
The continue values for seven smokers are:
73, 58, 67, 93, 33, 18, and 147
. If you added up these values you would get a sum of 489.
Divide that sum by 7 to get a mean of 69.9.
We will sometimes make the distinction between the
sample mean and the population mean. The population
mean (often represented by the Greek letter mu) is simply
the average of all the items in a population. Because a
population is usually very large, the population mean is
usually an unknown constant.
11
The median is the middle observation in a data set.
That is, 50 per cent of the observation are above the
median and 50 per cent are below the median (for
sets with an even number of observation, the median
is the average of the middle two observation).
The median is often used when a data set is not
symmetrical, or when there are outlying observation.
For example, median income is generally reported
rather than mean income because of the outlying
observation.
12
The Mode is the value around which the greatest
number of observation are concentrated, or quite
simply the most common observation. Mode is often
used with nominal data, but is not the preferred
measure for other types of data.
One can have three modes in a set of data
13
While measures of position describe where the data
points are concentrated, measures of variability
measure the dispersion (or spread) of the data set.
14
The range is the difference between the largest and the
smallest observations in the data set. This is a limited
measure because it depends on only two of the numbers in
the data set.
Using the above data set again, the range is 149, but that
does not provide any information regarding the
concentration of the data at the low end of the scale.
Another limitation of range is that it is affected by the
number of observations in the data set. Generally, the
more observation there are, the more spread out they will
be. One use of range in everyday life is in newspaper stock
market summaries, which give the day's high and low
numbers.
15
Unlike range, variance takes into consideration all the
data points in the data set. If all the observation are
the same, the variance would be zero. The more
spread out the observation are, the larger the
variance.
16
Standard deviation is the positive square root of the
variance, and is the most common measure of
variability. Standard deviation indicates how close to
the mean the observations are. The larger the
standard deviation, the more variation there is in the
data set.
17
Measures of position and variability tell us where the
data are located and how dispersed they are.
Measures of skewness are concerned with whether
the data are symmetrically distributed, or the shape of
the distribution.
Most people are familiar with the distribution
referred to as the normal, or bell-shaped, curve. Many
of the statistics we use assume the data are
distributed normally.
18
The first step in any data analysis strategy is to
determine what you want to know, or your purpose in
analyzing the data. Ideally, you should have
determined this before collecting your data, but all
too frequently this is not the case. Many of the
commonly used statistical tests can be classified into
one of three categories:
Description
Comparison
Association
19
The purpose of descriptive statistics is to describe the
data. The type of data will determine which
descriptive statistic is appropriate. Specifically, one
can only calculate a mean with interval or ratio data,
whereas a mode can be calculated with nominal,
ordinal, interval or ratio data.
20
A common goal in conducting research is to
determine if differences exist between two or more
groups.
For example, we may be interested in determining if
people who defect to another service provider are
different from those who choose to remain. While the
most common examples of this type of analysis focus
on differences of means and variances, it is important
to note we can analyze many types of differences
including correlation coefficients, proportions and
percentages. The statistic used is determined by the
type of data you have.
21
Examples
Chi-Square
A common goal in conducting research is to determine if
differences exist between two or more groups. For
example, we may be interested in determining if people
who defect to another service provider are different from
those who choose to remain. While the most common
examples of this type of analysis focus on differences of
means and variances, it is important to note we can
analyze many types of differences including correlation
coefficients, proportions and percentages. The statistic
used is determined by the type of data you have.
22
Example
t-test
23
Example: One way ANOVA(Interval)
While the t-test is useful for testing differences
between two groups, frequently we are interested in
more than two groups. In those cases, we often rely
on the Analysis of Variance (ANOVA) To tell us if
those groups are different on some variable of
interest. For example, if the training example from
above included a third group (i.e., a combination of
on- and off-site training) it would require use of the
ANOVA instead of the t-test.
24
Example: Factorial ANOVA
Frequently we are interested in understanding the
effects of varying levels of two or more variables on a
third variable. In such a case, we are unable to use the
One-way ANOVA because it is limited to comparisons
of the effects of one variable on another. Essentially, a
factorial ANOVA analyzes the impact of both the
variables independently as well as jointly to
determine how they affect another variable of
interest.
25
26