By the end of this topic, you should be able to:

Define statistics;
Differentiate between descriptive and inferential statistics;
Compare the different types of variables;
Explain the importance of sampling; and
Differentiate between the types of sampling procedures.


This topic introduces the meaning of statistics and explains the difference between
descriptive and inferential statistics. As inferential statistics is used to make
inference about the population on specific variables based on a sample, this topic
also explains the meanings of different types of variables and highlights the
different sampling techniques in educational research.
Let us refer to some definitions of statistics:

American Heritage Dictionary defines statistics as:

The mathematics of the collection, organisation and interpretation of numerical

data, especially the analysis of population characteristics by inference from
The Merriam-Websters Collegiate Dictionary defines statistics as:

A branch of mathematics dealing with the collection, analysis, interpretation and

presentation of masses of numerical data.

Jon Kettenring, President of the American Statistics Association, defines statistics as:

The science of learning from data. Statistics is essential for the proper running of
government, central to decision making in industry, and a core component of
modern educational curricula at all levels.
Note that the word "mathematics" is mentioned in two of the definitions above,
while "science" is stated in the other definition. Some students are afraid of
mathematics and science. These students feel that since they are from the fields of
humanities and social sciences, they are weak in mathematics. Being terrified of
mathematics does not just happen overnight. Chances are that you may have had
bad experiences with mathematics in earlier years (Kranzler, 2007).

Fear of mathematics can lead to a defeatist attitude which may affect the way you
approach statistics. In most cases, the fear of statistics is due to irrational beliefs.
Just because you had difficulty in the past, does not mean that you will always have
difficulty with quantitative subjects. You have come this far in your education and by
doing this course in statistics, it is not likely that you are an incapable person.

You have to convince yourself that statistics is not a difficult subject and you need
not worry about the mathematics involved. Identify your irrational beliefs and
thoughts about statistics. Are you telling yourself: "I'll never be any good in
statistics." I'm a loser when it comes to anything dealing with numbers," or "What
will other students think of me if I do badly?"

For each of these irrational beliefs about your abilities, ask yourself what evidence is
there to suggest that "you will never be good in statistics" or that "you are weak at
mathematics." When you do that, you will begin to replace your irrational beliefs
with positive thoughts and you will feel better. You will realise that your earlier
beliefs about statistics are the cause of your unpleasant emotions. Each time you
feel anxious or emotionally upset, question your irrational beliefs. This may help you
to overcome your initial fears.

Keeping this in mind, this course has been written by presenting statistics in a form
that appeals to those who fear mathematics. Emphasis is on the applied aspects of
statistics and with the aid of a statistical software called Statistical Package for the
Social Sciences (or better known as SPSS), you need not worry too much about the

intricacies of mathematical formulas. Computations of mathematical formulas have

been kept to a minimum. Nevertheless, you still need to know about the different
formulas used, what they mean and when they are used.


Statistics are all around you. Television uses a lot of statistics: for example, when it
reports that during the holidays, a total of 134 people died in traffic accidents; the
stock market fell by 26 points; or that the number of violent crimes in the city has
increased by 12%. Imagine a football game between Manchester United and
Liverpool and no one kept score! Without statistics, you could not plan your budget,
pay your taxes, enjoy games to their fullest, evaluate classroom performance and
so forth. Are you beginning to get the picture? We need statistics. Generally, there
are two kinds of statistics:
Descriptive Statistics
Inferential Statistics
1.2.1 Descriptive Statistics
Descriptive statistics are used to describe the basic features of the data in a study.
Historically, descriptive statistics began during Roman times when the empire
undertook census of births, deaths, marriages and taxes. They provide simple
summaries about the sample and the measures. Together with simple graphics
analysis, they form the basis of virtually every quantitative analysis of data. With
descriptive statistics, you are simply describing what is or what the data show.

Descriptive statistics are used to present quantitative descriptions in a manageable

form. In a research study, we may have lots of measures. Or we may measure a
large number of people on any measure. Descriptive statistics help us to simplify
large amounts of data in a sensible way. Each descriptive statistic reduces lots of
data into a simple summary. For instance, the Grade Point Average (GPA) for a
student describes the general performance of a student across a wide range of
subjects or courses.

Descriptive statistics includes the construction of graphs, charts and tables and the
calculation of various descriptive measures such as averages (e.g. mean) and
measures of variation (e.g. standard deviation). The purpose of descriptive statistics
is to summarise, arrange and present a set of data in such a way that facilitates

interpretation. Most of the statistical presentations appearing in newspapers and

magazines are descriptive in nature.
1.2.2 Inferential Statistics
Inferential statistics or statistical induction comprises the use of statistics to make
inferences concerning some unknown aspect of a population. Inferential statistics
are relatively new. Major development began with the works of Karl Pearson (18571936) and the works of Ronald Fisher (1890-1962) who published their findings in
the early years of the 20th century. Since the work of Pearson and Fisher, inferential
statistics has evolved rapidly and is now applied in many different fields and

Inference is the act or process of deriving a conclusion based solely on what one
already knows. In other words, you are trying to reach conclusions that extend
beyond data obtained from your sample towards what the population might think.
You are using methods for drawing and measuring the reliability of conclusions
about a population based on information obtained from a sample of the population.
Among the widely used inferential statistical tools are t-test, analysis of variance,
Pearsons correlation, linear regression and multiple regression.
1.2.3 Descriptive or Inferential Statistics
Descriptive statistics and inferential statistics are interrelated. You must always use
techniques of descriptive statistics to organise and summarise the information
obtained from a sample before carrying out an inferential analysis. Furthermore, the
preliminary descriptive analysis of a sample often reveals features that lead you to
the choice of the appropriate inferential method.

As you proceed through this course, you will obtain a more thorough understanding
of the principles of descriptive and inferential statistics. You should establish the
intent of your study. If the intent of your study is to examine and explore the data
obtained for its own intrinsic interest only, the study is descriptive. However, if the
information is obtained from a sample of a population and the intent of the study is
to use that information to draw conclusions about the population, the study is
inferential. Thus, a descriptive study may be performed on a sample as well as on a
population. Only when an inference is made about the population, based on data
obtained from the sample, does the study become inferential.


Define statistics.
Explain the differences between descriptive and inferential statistics.
When would you use the two types of statistics?
Explain two ways in which descriptive statistics and inferential statistics are
Before you can use a statistical tool to analyse data, you need to have data which
have been collected. What is data? Data is defined as pieces of information which
are processed or analysed to enable interpretation. Quantitative data consist of
numbers, while qualitative data consist of words and phrases. For example, the
scores obtained from 30 students in a mathematics test are referred to as data. To
explain the performance of these students you need to process or analyse the
scores (or data) using a calculator or computer or manually. We collect and analyse
data to explain a phenomenon. A phenomenon is explained based on the interaction
between two or more variables. The following is an example of a phenomenon:

Intelligence Quotient (IQ) and Attitude Influence

Performance in Mathematics

Note that there are THREE variables explaining the particular phenomenon, namely,
Intelligence Quotient, Attitude and Mathematics Performance.

What is a Variable?

A variable is a construct that is deliberately and consciously invented or adopted for

a special scientific purpose. For example, the variable Intelligence is a construct
based on observation of presumably intelligent and less intelligent behaviours.
Intelligence can be specified by observing and measuring using intelligence tests,
as well as interviewing teachers about intelligent and less intelligent students.
Basically, a variable is something that varies and has a value. A variable is a
symbol to which are assigned numerals or values. For example, the variable
mathematics performance is assigned scores obtained from performance on a
mathematics test and may vary or range from 0 to 100.

A variable can be either a continuous variable or categorical variable. In the case of

the variable gender there are only two values, i.e. male and female, and is called
a categorical variable. Other examples of categorical variables include graduate
non-graduate, low income high income, citizen non-citizen. There are also
variables which have more than two values. For example, religion such as Islam,
Christianity, Sikhism, Buddhism and Hinduism may have several values. Categorical
variable are also known as nominal variables. A continuous variable has numeric
value like 1, 2, 3, 4, 10 ... etc. An example is the scores on mathematics
performance which range from 0 to 100. Other examples are salary, age, IQ, weight,

When you use any statistical tool, you should be very clear on which variables have
been identified as independent and which are dependent variables.
1.3.1 Independent Variable
An independent variable (IV) is the variable that is presumed to cause a change in
the dependent variable (DV). The independent variables are the antecedents, while
the dependent variable is the consequent. See Figure 1.1 which describes a study to
determine which teaching method (independent variable) is effective in enhancing
the academic performance in history (dependent variable) of students.

An independent variable (teaching method) can be manipulated. Manipulated

means the variable can be manoeuvred, and in this case it is divided into discovery
method and lecture method. Other examples of independent variables are gender
(male and female), race (Malay, Chinese and Indian) and socioeconomic status
(high, middle and low). Other names for the independent variable are treatment,
factor and predictor variable.
1.3.2 Dependent Variable
A dependent variable is a variable dependent on other variable(s).The dependent
variable in this study is the academic performance which cannot be manipulated by
the researcher. Academic performance is a score and other examples of dependent
variables are IQ (score from IQ tests), attitude (score on an attitude scale), selfesteem (score from a self-esteem test) and so forth. Other names for the dependent
variable are outcome variable, results variable and criterion variable.

refer to caption
Figure 1.1: An example of independent variables and dependent variables
Put it another way, the DV is the variable predicted to, whereas the independent
variable is predicted from. The DV is the presumed effect, which varies with
changes or variation in the independent variable.
As mentioned earlier, a variable is deliberately constructed for a specific purpose.
Hence, a variable used in your study may be different from a variable used in
another study even though they have the same name. For example, the variable
academic achievement used in your study may be computed based on
performance in the UPSR examination; while in another study, it may be computed
using a battery of tests you developed. Operational definition (Bridgman, 1927)
means that variables used in the study must be defined as it is used in the context
of the study. This is done to facilitate measurement and to eliminate confusion.

Thus, it is essential that you stipulate clearly how you have defined variables
specific to your study. For example, in an experiment to determine the effectiveness
of the discovery method in teaching science, the researcher will have to explain in
great detail the variable discovery method used in the experiment. Even though
there are general principles of the discovery method, its application in the
classroom may vary. In other words, you have to define the variable operationally or
how it is used in the experiment.

What is a variable?
Explain the differences between a continuous variable and nominal variable.
Why should variables be operationally defined?

Every day, we make judgments and decisions based on samples. For example, when
you pick a grape and taste it before buying the whole bunch of grapes, you are
doing a sampling. Based on the one grape you have tasted, you will make the
decision whether to buy the grapes or not. Similarly, when a teacher asks a student
two or three questions, he is trying to determine the students grasp of an entire
subject. People are not usually aware that such a pattern of thinking is called

Population (Universe) is defined as an aggregate of people, objects, items, etc.

possessing common characteristics. It is a complete group of people, objects, items,
etc. about which we want to study. Every person, object, item, etc. has certain
specified attributes. In Figure 1.2, the population consists of #, $, @, & and %.
Sample is that part of the population or universe which we select for the purpose of
investigation. The sample is used as an "example" and in fact the word sample is
derived from the Latin exemplum, which means example. A sample should exhibit
the characteristics of the population or universe; it should be a "microcosm," a word
which literally means "small universe." In Figure 1.2, the sample also consists of one
#, $, @, & and %.

refer to caption

Figure 1.2: Drawing a sample from the population

We use samples to make inferences about the population. Reasoning from a sample
to the population is called statistical induction or inference. Based on the
characteristics of a specifically chosen sample (a small part of the population of the
group that we observe), we make inferences concerning the characteristics of the
population. We measure the trait or characteristic in a sample and generalise the
finding to the population from which the sample was taken.
Why is a sample used in educational research?
The study of a sample offers several advantages over a complete study of the
population. Why and when is it desirable to study a sample rather than the
population or universe?

In most studies, investigation of the sample is the only way of finding out about a
particular phenomenon. In some cases, due to financial, time and physical
constraints, it is practically impossible to study the whole population. Hence, an
investigation of the sample is the only way of making a study.
If one were to study the population, then every item in the population is studied.
Imagine having to study 500,000 Form 5 students in Malaysia! Wonder what the
costs will be! Even if you have the money and time to study the entire population of
Form 5 students in the country, it may take so much time that the findings will be
no use by the time they become available.
Studying the population may not be necessary, since we have sound sampling
techniques that will yield satisfactory results. Of course, we cannot expect from a
sample exactly the same answer that might be obtained from studying the whole
However, by using statistics, we can establish based on the results obtained from a
sample, the limits, with a known probability where the true answer lies.
We are able to generalise logically and precisely about different kinds of phenomena
which we have never seen simply based upon a sample of, say, 200 students.
What is the difference between a population and a sample?
Why is a study of the population practically impossible?
The sample should be representative of the population. Explain.
Provide a scenario of your own, in which a sample is not representative.

Explain why a sample of 30 doctors from Kuala Lumpur taken to estimate the
average income of all Kuala Lumpur residents is not representative.
When some students are asked how they selected the sample for a study, quite a
few are unable to explain convincingly the techniques used and the rationale for
selecting the sample. If you have to draw a sample, you must choose the method
for obtaining the sample from the population. In making that choice, keep in mind
that the sample will be used to draw conclusions about the entire population.
Consequently, the sample should be a representative sample, that is, it should
reflect as closely as possible the relevant characteristics of the population under
1.6.1 Simple Random Sampling
All individuals in the defined population have an equal and independent chance of
being selected as a member of the sample. Independent means that the selection
of one individual does not affect in any way the selection of any other individual. So,
each individual, event or object has an equal probability of being selected. Suppose
for example there are 10,000 Form 1 students in a particular district and you want
to select a simple random sample of 500 students, when we select the first case,
each student has one chance in 10,000 of being selected. Once the student is
selected, the next student to be selected has a 1 in 9,999 chance of being selected.
Thus, as each case is selected, the probability of being selected next changes
slightly because the population from which we are selecting has become one case

Using a Table of Random Numbers (refer to Figure 1.3) to select a sample, obtain a
list of all Form 1 students in Daerah Petaling and assign a number to each student.
Then, get a table of random numbers which consists of a long series of three or four
digit numbers generated randomly by a computer. Using the table, you randomly
select a row or column as a starting point, then select all the numbers that follow in
that row or column. If more numbers are needed, proceed to the next row or column
until enough numbers have been selected to make up the desired sample

Figure 1.3: Table of Random Numbers

1.6.2 Systematic Sampling
Systematic sampling is random sampling with a system. From the sampling frame, a
starting point is chosen at random, and thereafter at regular intervals. If it can be
ensured that the list of students from the accessible population is randomly listed,
then systematic sampling can be used. First, you divide the accessible population
(1,000) by the sample desired (100) which will give you 10. Next, select a figure less
or smaller than the number arrived by the division i.e. less than 10. If you choose 8,
then you select every eighth name from the list of population. If the random starting
point is 10, then the subjects selected are 10, 18, 26, 34, 42, 50, 58, 66 and 74 until
you have your sample of 100 subjects. This method differs from random sampling
because each member of the population is not chosen independently. The
advantage is that it spreads the sample more evenly over the population and it is
easier to select than a simple random sample.

Briefly discuss how you would select a sample of 300 teachers from a population of
5,000 teachers in a district using systematic sampling.
What are some advantages of using systematic sampling?
1.6.3 Stratified Sampling
In certain studies, the researcher wants to ensure that certain sub-groups or
stratum of individuals are included in the sample and for this stratified sampling is
preferred. For example, if you intend to study differences in reasoning skills among
students in your school according to socio-economic status and gender, random
sampling may not ensure that you have sufficient number of male and female
students with the socio-economic levels. The size of the sample in each stratum is
taken in proportion to the size of the stratum. This is called proportional allocation.
Suppose that Table 1.1 shows the population of students in your school.

Table 1.1: Population of Students in Your School

Male, High Income 160
Female, High Income


Male, Low Income 360

Female, Low Income


The first step is to calculate the percentage in each group.

% male, high income = ( 160 / 1,000 ) x 100 = 16%

% female, high income = ( 140 / 1,000 ) x 100 = 14%
% male, low income = ( 360 / 1,000 ) x 100 = 36%
% female, low income = ( 340 / 1,000) x 100 = 34%

If you want a sample of 100 students, you should ensure that:

16% should be male, high income = 16 students

14% should be female, high income = 14 students
36% should be male, low income = 36 students
34% should be female, low income = 34 students

When you take a sample from each stratum randomly, it is referred to as stratified
random sampling. The advantage of stratified sampling is that it ensures better
coverage of the population than simple random sampling. Also, it is often
administratively more convenient to stratify a sample so that interviewers can be
specifically trained to deal with a particular age group or ethnic group.


Male, full-time teachers

= 90

Male, part-time teachers = 18

Female, full-time teachers= 63
Female, part-time teachers

= 9

The data above shows the number of full-time and part-time teachers in a school
according to gender.

Select a sample of 40 teachers using stratified sampling.

1.6.4 Cluster Sampling
In cluster sampling, the unit of sampling is not the individual but rather a naturally
occurring group of individuals. Cluster sampling is used when it is more feasible or
convenient to select groups of individuals than it is to select individuals from a
defined population. Clusters are chosen to be as heterogeneous as possible, that is,
the subjects within each cluster are diverse and each cluster is somewhat
representative of the population as a whole. Thus, only a sample of the clusters
needs to be taken to capture all the variability in the population.

For example, in a particular district there are 10,000 households clustered into 25
sections. In cluster sampling, you draw a random sample of five sections or clusters
from the list of 25 sections or clusters. Then, you study every household in each of
the five sections or clusters. The main advantage of cluster sampling is that it saves
time and money. However, it may be less precise than simple random sampling.
SPSS software is frequently used by educational researchers for data analysis. It can
be used to generate both descriptive and inferential statistical output to answer
research questions and test hypotheses. The software is modular with the base
module as its core. The other more commonly used modules are Regression Models
and Advanced Models.

To use SPSS, you have to create the SPSS data file. Once this data file is created and
data entered, you can run statistical procedures to generate your statistical output.
Refer to Appendix A at the end of this module on how to go about creating this SPSS
data file.

Statistics is a branch of mathematics dealing with the collection, analysis,

interpretation and presentation of masses of numerical data.
Descriptive statistics include the construction of graphs, charts and tables and the
calculation of various descriptive measures such as averages (means) and
measures of variation (standard deviations).
Inferential statistics or statistical induction comprises the use of statistics to make
inferences concerning some unknown aspect of a population.
A variable is a construct that is deliberately and consciously invented or adopted for
a special scientific purpose.
A variable can be either a continuous variable (ordinal variable) or categorical
variable (nominal variable).
An independent variable (IV) is the variable that is presumed to cause a change in
the dependent variable (DV).
A dependent variable is a variable dependent on other variable(s).
Operational definition means that variables used in the study must be defined as it
is used in the context of the study.
Population (universe) is defined as an aggregate of people, objects, items, etc.
possessing common characteristics, while sample is that part of the population or
universe we select for the purpose of investigation.
In simple random sampling, all individuals in the defined population have an equal
and independent chance of being selected as a member of the sample.
Systematic sampling is random sampling with a system. From the sampling frame, a
starting point is chosen at random, and thereafter at regular intervals.
In a stratified sample, the sampling frame is divided into non-overlapping groups or
strata and a sample is taken from each stratum.
In cluster sampling, the unit of sampling is not the individual but rather a natural
group of individuals.
Cluster sampling
Dependent variable
Descriptive statistics
Independent variable

Inferential statistics
Nominal variable
Ordinal variable
Random sampling
Stratified sampling
Systematic sampling
Creating an SPSS Data File

After you have developed your questionnaire, you need to create an SPSS data file
to enable you to enter data into a format which can be read by SPSS. You can do
this via the SPSS Data Editor which is inbuilt into the SPSS package. When creating
an SPSS data file, your items/questions in the questionnaire will have to be
translated into variables. For example, if you have a question What is your
occupation? and this question has several response options such as 1. Salesman 2.
Clerk 3. Teacher 4. Accountant 5. Others; what you need to do is to translate your
question into a variable a name, perhaps called occu. In the context of SPSS data
entry, these response options are called value labels, for example Salesman is
assigned a value label of 1, Clerk 2, Teacher 3, Accountant 4 and Others 5. If the
respondent is a teacher, you enter 3 when inputting data into the variable occu in
your data file. Sometimes you may have a question which requires the respondent
to state in absolute terms such as Your annual salary is _________ In this case, you
can create a variable name called salary. Since this variable only requires the
respondent to state his/her salary, you do not need to create response options just
enter the actual salary figure.

When defining the variable name, you have to consider the following:

it can only have a maximum of 8 characters (however version SPSS 12.0 and above
allows up to 64 characters);

it must begin with a letter;

it cannot end with a full stop or underscore;
it must be unique, i.e. no duplication is allowed;
it cannot include blanks or special characters such as !, ?, , and *.
When defining a variable name, an uppercase character does not differ from a lower
case character.
Descriptive statistics are used to summarise a collection of data and present it in a
way that can be easily and clearly understood. For example, a researcher
administered a scale via a questionnaire to measure self-esteem among 500
teenagers. How might these measurements be summarised? There are two basic
methods: numerical and graphical. Using the numerical approach, one might
compute the mean and the standard deviation. Using the graphical approach, one
might create a frequency table, bar chart, a line graph or a box plot. These
graphical methods display detailed information about the distribution of the scores.
Graphical methods are better suited than numerical methods for identifying
patterns in the data. Numerical approaches are more precise and objective.

Descriptive statistics are typically distinguished from inferential statistics. With

descriptive statistics you are simply describing what is or what the data show based
on the sample. With inferential statistics, you are trying to reach conclusions based
on the sample that extend beyond the immediate data. For instance, we use
inferential statistics to infer from the sample data what the population might think.
Or, we use inferential statistics to make judgments of the probability that an
observed difference between groups is dependable or might have happened by
chance in this study. Thus, we use inferential statistics to make inferences from our
data to more general conditions; we use descriptive statistics simply to describe
what is going on in our data.

Descriptive statistics are used to present quantitative descriptions in a manageable

form. In a research study, we may have lots of measures or we may measure a
large number of people on any measure. Descriptive statistics help us to simply
depict large amounts of data in a sensible way. Each descriptive statistic reduces
lots of data into a simpler summary. For instance, consider Grade Point Average
(GPA). This single number describes the general performance of a student across a

potentially wide range of course experiences. The number describes a large number
of discrete events such as the grade obtained for each subject taken. However,
every time you try to describe a large set of observations with a single indicator you
run the risk of distorting the original data or losing important details. The GPA does
not tell you whether a student was in a difficult or easy course, or whether the
student was taking courses in his major field or in other disciplines. Given these
limitations, descriptive statistics provide a powerful summary of phenomena that
may enable comparisons across people or other units.


2.2.1 Mean
Mean and the standard deviation are the most widely used statistical tools in
educational and psychological research. Mean is the most frequently used measure
of central tendency, while standard deviation is the most frequently used measure
of variability or dispersion.

Computing the Mean

The mean or X bar (pronounced as X bar) is the figure obtained when the sum of all
the items in the group is divided by the number of items (N). Say for example you
have the score of 10 students on a science test.

The sum () of all the ten scores =

25 + 32

23 + 22 + 26 + 21 + 30 + 24 + 20 + 27 +


In the computation of the mean, every item counts. As a result, extreme values at
either end of the group or series of scores severely affect the value of the mean.
The mean could be "pulled towards" as a result of the extreme scores which may
give a distorted picture of the groups or series of scores or data.

However, in general, the mean is a good measure of central tendency for roughly
symmetric distributions but can be misleading in skewed distributions (see the
example on page 20) since it can be greatly influenced by extreme scores.
2.2.2 Median
Median is the score found at the exact middle of the set of values. One way to
compute the median is to list all scores in ascending order and then locate the score
in the centre of the sample. For example, if we order the following seven scores as
shown below, we would get:

Score 25 is the median because it represents the halfway point for the distribution
of scores.

Look at this set of eight scores. What is the median score?

There are eight scores. The fourth score (20) and the fifth score (20) represent the
halfway point. Since both of these scores are 20, the median is 20.

If the two middle scores had different values, you have to interpolate to determine
the median by adding up the two values and dividing the sum by 2. For example,

The median is (18 + 20)/2 = 19.

2.2.3 Mode
Mode is the most frequently occurring value in the set of scores. To determine the
mode, you might again order the scores as shown below and then count each one.

The most frequently occurring value is the mode. In our example, the value 15
occurs three times and is the mode. In some distributions, there is more than one

modal value. For instance, in a bimodal distribution there are two values that occur
most frequently.

If the distribution is truly normal (i.e. bell-shaped), the mean, median and mode are
all equal to each other.

Should You Use the Mean or the Median?

The mean and median are two common measures of central tendencies of a typical
score in a sample. Which of these two should you use when describing your data? It
depends on your data. In other words, you should ask yourself whether the measure
of central tendency you have selected gives a good indication of the typical score in
your sample. If you suspect that the measure of central tendency selected does not
give a good indication of the typical score, then you most probably have chosen the
wrong one.

The mean is the most frequently used measure of central tendency and it should be
used if you are satisfied that it gives a good indication of the typical score in your
sample. However, there is a problem with the mean. Since it uses all the scores in a
distribution, it is sensitive to extreme scores.


The mean for these set of nine scores:

20 + 22 + 25 + 26 + 30 + 31 + 33 + 40 + 42 is 29.89

If we were to change the last score from 42 to 70, see what happens to the mean:

20 + 22 + 25 + 26 + 30 + 31 + 33 + 40 + 70 is 33.00

Obviously, this mean is not a good indication of the typical score in this set of data.
The extreme score has changed the mean from 29.89 to 33.00. If these were test
scores, it may give the impression that students performed better in the later test
when in fact only one student scored highly.

NOTE: Keep in mind this characteristic when interpreting the mean obtained from a
set of data.

If you find that you have an extreme score and you are unable to use the mean,
then you should use the median. The median is not sensitive to extreme scores. If
you examine the above example, the median is 30 in both distributions. The reason
is simply that the median score does not depend on the actual scores themselves
beyond putting them in ascending order. So the last score in a distribution could be
80, 150 or 5,000 and the median still would not change. It is this insensitivity to
extreme scores that makes the median useful when you cannot use the mean.
Variability or dispersion refers to the spread of the values around the central
tendency. There are two common measures of dispersion, the range and the
standard deviation.
2.3.1 Range
Range is simply the highest value minus the lowest value. For example, in a
distribution, if the highest value is 36 and the lowest is 15, the range is 36 15 =
2.3.2 Standard Deviation
Standard deviation is a more accurate and detailed estimate of dispersion because
an outlier can greatly exaggerate the range. The standard deviation shows the
relation that a set of scores has to the mean of the sample. For instance, when you
give a test, there is bound to be variation in the scores obtained by students.
Variability, variation or dispersion is determined by the distance of a particular score
from the norm or measure of central tendency such as the mean. The standard
deviation is a statistic that shows the extent of variability or variation for a given
series of scores from the mean.

Standard deviation makes use of the deviations of the individual scores from the
mean. Then, each individual deviation is squared to avoid the problem of plus and

minus. Standard deviation is the most often used measure of variability or variation
in educational and psychological research.

The following is the formula for calculating standard deviation:

a. Interpretation of the Formula

Standard deviation is found by:

Taking the difference between the mean X bar and each item X bar;
Squaring this difference X bar;
Summing all the squared differences X bar;
Dividing by the number of scores (N) minus 1; and
Extracting the square root.
b. Computing Standard Deviation

Example: A mathematics test was given to a group of 10 students. Their scores are
shown in Column 1 of Table 2.1.

Table 2.1: Example of Computing Standard Deviation

Column 1

Column 2

Column 3

X bar X bar X bar


23 25 = 2


22 25 = 3 9


26 25 = + 1


21 25 = 4



30 25 = + 5



24 25 = + 1


20 25 = 5



27 25 = + 2


25 25 = 0 0


32 25 = + 7

X bar


X bar

Apply the formula:

c. Differences in Standard Deviations

A mathematics test was administered to Class A and Class B. The distribution of the
scores are shown below.

In Class A (Figure 2.1), the scores are widely spread out, which means there is high
variance or a bigger standard deviation i.e. most of the scores are 6 from the
mean. If the mean is 50, then you can say that approximately 95% of the students
scored between 44 and 56.

Figure 2.1: Standard deviation

In Class B (Figure 2.2), there is low variance or a small standard deviation which
explains why most of the scores are clustered around the mean. Most of the scores
are bunching around the mean i.e. most of the scores are 3 from the mean. If
the mean is 50, approximately 95% of the students scored between 47 and 53.

Figure 2.2: Standard deviation

Below are the scores obtained by students in two classes on a history test:

Class A marks: 15, 25, 20, 20, 18, 22, 16, 24, 28, 12

Class B marks: 10, 30, 13, 27, 16, 24, 5, 35, 28, 12

Compute the mean of the two classes.

Compute the standard deviation of the two classes.
Explain the implication of differences in standard deviations.
Frequency distribution is a way of displaying numbers in an organised manner. A
frequency distribution is simply a table that, at the minimum, displays how many

times in a data set each response or "score" occurs. A good frequency distribution
will display more information than this; although with just this minimal information,
many other bits of information can be computed.
2.4.1 Tables
Tables can contain a great deal of information but they also take up a lot of space
and may overwhelm readers with details. How should tables be presented in a
manner that can be easily understood? In general, frequency tables are best for
variables with different numbers of categories (see Table 2.2).

Table 2.2: Question: Should Sex Education be Taught in Secondary School?



Valid PercentCumulative Percent

4. Strongly Agree



3. Agree




2. Disagree 4





1. Strongly Disagree
Total 13




100.0 100.0

Table 2.2 summarises the responses of 13 teachers with regard to the teaching of
sex education in secondary school.
The first column contains the values or categories of the variables (opinion on
teaching sex education in schools extent of agreement).
The frequency column indicates the number of respondents in each category.
The percent column lists the percentage of the whole sample in each category.
These percentages are based on the total sample size, including those who did not
answer the question. Those who did not answer will be shown as missing cases in
this column.
The valid percent column contains the percentage of those who gave a valid
response to the question that belongs to each category. When there are no missing
cases, the valid percent column is similar to the percent column.
The cumulative percentage column provides the rolling addition of percentages
from the first category to the last valid category. For example, 7.7 percent of
teachers strongly agree that sex education should be taught in secondary school. A
further 23.1 percent of them simply agree that sex education should be taught. The
cumulative percentage column adds up the percentage of those who strongly agree

with those who agree (7.7 + 23.1 = 30.8). Thus, 30.8 percent at least agree (either
agree or strongly agree) that sex education should be taught in secondary school.
2.4.2 SPSS Procedure
To obtain a frequency table, measure of central tendency and variability:

Select the Analyse menu.

Click on the Descriptive Statistics and then on Frequencies to open the Frequencies
dialogue box.
Select the variable(s) you require (i.e. opinion on sex education) and click on the
button to move the variable into the Variables(s) box.
Click on the Statistics. command push button to open the Frequencies: Statistics
sub-dialogue box.
In the Central Tendency box, select the Mean, Median and Mode check boxes.
In the Dispersion box, select the Std. deviation and Range check boxes.
Click on Continue and then OK.
Graphs are widely used in describing data. However, it should be appropriately
used. There is a tendency for graphs to be cluttered, confusing and downright
2.5.1 Bar Charts
The following are elements of a graph that should be given due consideration (refer
to Figure 2.3):

The X-axis represents the values of the variables being displayed. The X-axis may
be divided into discrete categories (bar charts) or continuous values (line graphs).
Which units are used depend on the level of measurement of the variable being
In the example in Figure 2.3, the X-axis represents the students gain scores after
undergoing an innovative instructional programme.

The Y-axis, which appears either in percentages or frequencies, as in Figure 2.3,

shows the frequency of students who obtained the various scores indicated in the Xaxis.
Interpretation of the graph on Students Gain Scores:
A total of 275 students obtained between 1 and 5 marks as a result of the
innovative instructional programme; 199 obtained between 6 and 10 marks; 77
between 11 and 15 marks; and 28 between 16 and 20 marks.
The number of students who obtained high gain scores decreases gradually.
Students' Gain Scores

2.5.2 Histogram
Histograms are different from bar charts because they are used to display
continuous variables (see the histogram in Figure 2.4).

Figure 2.4: Percentage who agreed that sex education should be taught in
secondary schools
The X-axis represents the different age groups, while the Y-axis represents the
percentages of respondents.

Each bar in the X-axis represents one age group in ascending order.
The Y-axis in this case represents the percentages of respondents in the Sex
Education survey.
Interpretation of the graph Sex Education Should be Taught in Secondary School:
Among the 18 to 28 age group, only 20% agreed that sex education should be
taught in schools compared to 60% in the 51 to 61 age group.
About 40% in the 40 to 50 age group and 50% among the 29 to 39 age group
agreed that sex education should be taught in secondary schools.
Only 10% of those aged 73 years and older agreed that secondary school students
should be taught sex education.

Figure 2.5: Example of a line graph

Interpret the line graph (Figure 2.5) showing the frequency of a group of
respondents visiting the library. A separate line is used for male and female

Descriptive statistics are used to summarise a collection of data and present it in a
way that can be easily and clearly understood.
Mean, median and mode are common descriptive statistics used to measure central
tendency, while standard deviation is the commonly used statistic to measure
variability or dispersion of data.
A frequency distribution is a table that, at the minimum, displays how many times in
a data set each response or "score" occurs.
Graphs are also used to condense large sets of data and these include the use of
bar charts, histograms and line graphs.

Frequency distribution
Measures of central tendency
Measures of variability or dispersion
Standard deviation
Creating an SPSS Data File

After you have developed your questionnaire, you need to create an SPSS data file
to enable you to enter data into a format which can be read by SPSS. You can do
this via the SPSS Data Editor which is inbuilt into the SPSS package. When creating
an SPSS data file, your items/questions in the questionnaire will have to be
translated into variables. For example, if you have a question What is your
occupation? and this question has several response options such as 1. Salesman 2.
Clerk 3. Teacher 4. Accountant 5. Others; what you need to do is to translate your
question into a variable a name, perhaps called occu. In the context of SPSS data
entry, these response options are called value labels, for example Salesman is
assigned a value label of 1, Clerk 2, Teacher 3, Accountant 4 and Others 5. If the
respondent is a teacher, you enter 3 when inputting data into the variable occu in
your data file. Sometimes you may have a question which requires the respondent
to state in absolute terms such as Your annual salary is _________ In this case, you
can create a variable name called salary. Since this variable only requires the
respondent to state his/her salary, you do not need to create response options just
enter the actual salary figure.

When defining the variable name, you have to consider the following:

it can only have a maximum of 8 characters (however version SPSS 12.0 and above
allows up to 64 characters);
it must begin with a letter;
it cannot end with a full stop or underscore;
it must be unique, i.e. no duplication is allowed;
it cannot include blanks or special characters such as !, ?, , and *.
When defining a variable name, an uppercase character does not differ from a lower
case character.

Besides understanding the variable name convention and value labels, you will also
need to know other variable definitions such as variable label, variable type,
missing values, column format and measurement level. A variable label describes
the variable name, for example, if the variable name is occu, the variable label can
be Respondents occupation. You need not specify the variable label if do not wish
to but variable label improves the interpretability of your output especially if you
have many variables. Missing values can also be assigned to a variable. It is rare for
one to obtain a questionnaire without any item being left blank. By convention, a
missing value is usually assigned a value of 9 but for statistical analysis it would be
preferable to assign a value which is equivalent to the mean of the variable to fill up
all the missing values. However, this can only be done for interval or ratio level
variables. For example, if you have the variable income and data were derived from
150 respondents and 20 did not provide their income information then compute the
mean of the income via SPSS for the 150 respondents and then recode all missing
values as the computed mean value.

The type of variable relates closely to your items in the questionnaire. For example,
the item age is a numeric variable, meaning you can input the variable using only
numbers such as if a persons age is 34 then you can type 34 under the age
variable column for this particular case. However, sometimes there is a need to use
alphanumeric characters to input data into a variable. A good example is
respondents address. In this case, alphanumeric characters constitute what is
called a string variable type. For example, a short open-ended question will be
Please state your address. The respondent will write his/her address using
alphanumeric characters such as 23 Jalan SS2/75, 47301 Petaling Jaya, Selangor. So
this address is actually a combination of alphabets and numbers.

The column format in the data editor allows you to specify the alignment of your
data in a column, for example left, centre or right. Measurement in the SPSS
variable definition convention differs slightly from that used in the statistics
textbook as SPSS uses scale to refer to both interval and ratio measurement.
Ordinal and nominal levels of measurement are maintained as they are. In statistical
analysis, it is extremely important to know what the level of measurement for a
particular variable is. A nominal variable (also called categorical variable) classifies
persons or objects into two or more categories, for example, the variable gender is
categorised as 1 for Male and 2 for Female, marital status as 1 for Single, 2 for
Married and 3 for Divorced. Numbering in nominal variables does not indicate that
one category is higher or better than another, for example, representing 1 for Male
and 2 for Female does not mean that male is lower that female by virtue of the
number being smaller. In nominal measurement the numbers are only labels. On the
other hand, an ordinal variable not only classifies persons or objects; they also rank
them in terms of degree. Ordinal variables put persons or objects in order from
highest to lowest or from most to least. In ordinal scale, intervals between ranks are
not equal, for example, the difference between rank 1 and rank 2 is not necessarily
the same as the difference between rank 2 and rank 3. For example, a person(A)
with a height of 5 10 and falls under rank 1 does not have the same interval as a
person(B) with a height of 5 5 who is ranked 2 and another person(C) with a
height of 4 8 who is ranked 3. The difference in height among the three persons is
not equal but there is an order, i.e. A is taller than B and B is taller than C.

Interval variables have all the characteristics of nominal and ordinal variables but
also have equal intervals. For example, achievement test is treated as an interval
variable. The difference in a score of 50 and a score of 60 is essentially the same as
the difference between the score of 80 and 90. Interval scales, however, do not
have a true zero point. Thus, if Ahmad has a score of 0 for Mathematics it does not
mean he has no knowledge of mathematics at all nor does Muthu scoring 100
means he has total knowledge of Mathematics. Thus, if a person scores 90 marks
we know he scores twice as high as one who scores 45 but we cannot say that a
person scoring 90 knows twice as much as a person scoring 45.

Ratio variables are the highest, most precise level of measurement. This type of
variable has all the properties of the other types of variables above. In addition, it
has a true zero point. For example a persons height a person who is 6 feet tall is
twice as tall a person who is 3 feet tall. A person who weighs 50 kg is one third the
weight of another who is 150 kg. Since ratio scales encompass mostly physical
measures they are not used very often in social science research.

In SPSS, interval and ratio measurements are classified as scale variables. Nominal
and ordinal measurements remain as they are, i.e. nominal and ordinal variables

A good understanding of the level of measurement will be useful when defining the
variables via the SPSS Data Editor and in the data analysis process. But before you
proceed to the next phase of data analysis, you need to enter data into a format
which can be read by SPSS. There are several ways you may do this, using i. SPSS
Data Editor ii. Excel iii. Access and iv. Word. The steps to enter data via the SPSS
Data Editor are described below.

How to define variables and enter data using the SPSS Data Editor?


Click Start All Programs SPSS for Windows SPSS 12.0 for Windows select
Type in data OK Variable View Start defining your variables by specifying the
Name: Type Gender <Enter>
Type: Select Numeric OK
Width: 8
Decimal: 0
Label: Respondents gender
Values: Under Value, type 1; under Value Label, type Male; Click Add
Under Value again, type 2; under Value Label, type Female
Click Add
Missing: No missing values OK
Columns: 8
Align: Right

Measure: Nominal
Proceed to define the second variable and so forth until you have completed all
variables in your questionnaire. Do note that certain variables such as ID do not
have value labels. If you are not sure what the level of measurement for that
particular variable is, you may want to keep the default which is Scale. Do
remember that if the particular variable you are defining share the same
specification such as the variable label of a variable you have already defined, then
you may merely copy it into the relevant cells.
After you have completed defining all your variables, the next step is to enter data
into the data cells by doing the following:
Click Data View
Click row 1, column 1 (note the variable name as shown)
Type in the data e.g. if the respondents gender is male, then type 1 and then
proceed to the next variable by pressing the right arrow key () on your keyboard.
Input the next variable and so on so forth until you have completed all your data
By the end of this topic, you should be able to:

Explain what normal distribution means;

Assess normality using graphical techniques histogram;
Assess normality using graphical techniques box plots;
Assess normality using graphical techniques normality plots; and
Assess normality using statistical techniques.


This topic explains what normal distribution is and introduces the graphical as well
as the statistical techniques used in assessing normality. It also presents SPSS
procedures for assessing normality.


Now that you know what mean stands for, as well as the standard deviation of a set
of scores, we can proceed to examine the concept of normal distribution. The
normal curve was developed mathematically in 1733 by DeMoivre as an
approximation to the binomial distribution. Laplace used the normal curve in 1783
to describe the distribution of errors. However, it was Gauss who popularised the
normal curve when he used it to analyse astronomical data in 1809 and it became
known as the Gaussian distribution.

The term normal distribution refers to a particular way in which scores or

observations tend to pile up or distribute around a particular value rather than be
scattered all over. The normal distribution which is bell-shaped is based on a
mathematical equation (which we will not get into).

While some argue that in the real world, scores or observations are seldom normally
distributed, others argue that in the general population, many variables such as
height, weight, IQ scores, reading ability, job satisfaction and blood pressure turn
out to have distributions that are bell-shaped or normal.
Normal distribution is important for the following reasons:

Many physical, biological and social phenomena or variables are normally

distributed. However, some variables are only approximately normally distributed.
Many kinds of statistical tests (such as t-test, ANOVA) are derived from a normal
distribution. In other words, most of these statistical tests work best when the
sample tested is distributed normally.
Fortunately, these statistical tests work very well even if the distribution is only
approximately normally distributed. Some tests work well even with very wide
deviations from normality. They are described as robust tests that are able to
tolerate the lack of a normal distribution.

3.3.1 Mean, Median and Mode

The centre of the distribution is the mean. The mean of a normal distribution is also
the most frequently occurring value (i.e. the mode) and it is also the value that
divides the distribution of scores into two equal parts (i.e. the median). In any
normal distribution, the mean, median and the mode all have the same value (i.e.
100 in the example above).
Normal distribution shows the area under the curve. The three-standard-deviations
rule, when applied to a variable, states that almost all the possible observations or
scores of the variable lie within three standard deviations to either side of the mean.
The normal curve is close to (but does not touch) the horizontal axis outside the
range of the three standard deviations to either side of the mean. Based on the
graph in Figure 3.1, you will notice that with a mean of 100 and a standard deviation
of 15;

68% of all IQ scores fall between 85 (i.e. one standard deviation less than the mean
which is 100 15 = 85) and 115 (i.e. one standard deviation more than the mean
which is 100 + 15 = 115).
95% of all IQ scores fall between 70 (i.e. two standard deviations less than the
mean which is 100 30 = 70) and 130 (i.e. two standard deviations more than the
mean which is 100 + 30 = 130).
99% of all IQ scores fall between 55 (i.e. three standard deviations less than the
mean which is 100 45 = 55) and 145 (i.e. three standard deviations more than the
mean which is 100 + 45 = 145).
A normal distribution can have any mean and standard deviation. However, the
percentage of cases or individuals falling within one, two or three standard
deviations from the mean is always the same. The shape of a normal distribution
does not change. Means and standard deviations will differ from variable to variable
but the percentage of cases or individuals falling within specific intervals is always
the same in a true normal distribution.

What is meant by the statement that a population is normally distributed?
Two normally distributed variables have the same means and the same standard
deviations. What can you say about their distributions? Explain your answer.
Which normal distribution has a wider spread: the one with mean 1 and standard
deviation 2 or the one with mean 2 and standard deviation 1? Explain your answer.
The mean of a normal distribution has no effect on its shape. Explain.
What are the parameters for a normal curve?
Often in statistics, one would like to assume that the sample under investigation has
a normal distribution or an approximate normal distribution. However, such an
assumption should be supported in some way by some techniques. As mentioned
earlier, the use of several inferential statistics such as the t-test and ANOVA require
that the distribution of the variables analysed are normally distributed or at least
approximately normally distributed. However, as discussed in Topic 1, if a simple
random sample is taken from a population, the distribution of the observed values
of a variable in the sample will approximate the distribution of the population.
Generally, the larger the sample, the better the approximation tends to be. In other
words, if the population is normally distributed, the sample of observed values

would also be normally distributed if the sample is randomly selected and it is large
3.5.1 Assessing Normality using Graphical Methods
Assessing normality means determining whether the samples of students, teachers,
parents or principals you are studying are normally distributed. When you draw a
sample from a population that is normally distributed, it does not mean that your
sample will necessarily have a distribution that is exactly normal. Samples vary, so
the distribution of each sample may also vary. However, if a sample is reasonably
large and it comes from a normal population, its distribution should look more or
less normal.

For example, when you administer a questionnaire to a group of school principals,

you want to be sure that your sample of 250 principals is normally distributed. Why?
The assumption of normality is a prerequisite for many inferential statistical
techniques and there are two main ways of determining the normality of

The normality of a distribution can be determined using graphical methods (such as

histograms, stem-and-leaf plots and boxplots) or using statistical procedures (such
as the Kolmogorov-Smirnov statistic and the Shapiro-Wilk statistics).

SPSS Procedures for Assessing Normality

There are several procedures to obtain the different graphs and statistics to assess
normality, for example the EXPLORE procedure is the most convenient when both
graphs and statistics are required.

From the main menu, select Analyse.

Click Descriptive Statistics and then Explore open the Explore dialogue box.

Select the variable you require and click the arrow button to move this variable into
the Dependent List: box.

Click the Plots...command push button to obtain the Explore: Plots sub- dialogue

Click the Histogram check box and the Normality plots with tests check box, and
ensure that the Factor levels together radio button is selected in the Boxplots

Click Continue.

In the Display box, ensure that Both is activated.

Click the Options ... command push button to open the Explore: Options subdialogue box.

In the Missing Values box, click the Exclude cases pairwise (if not selected by

Click Continue and then OK.

3.5.1 (b) Assessing Normality using Skewness

Skewness is the degree of departure from the symmetry of a distribution. A normal
distribution is symmetrical. A non-symmetrical distribution is described as being
either negatively or positively skewed. A distribution is skewed if one of its tails is
longer than the other or the tail is pulled to either the left or the right.

Refer to Figure 3.3, which shows the distribution of the scores obtained by students
on a test. There is a positive skew because it has a longer tail in the positive
direction or the long tail is on the right side (towards the high values on the
horizontal axis).

What does it mean? It means that more students were getting low scores in the test
and this indicates that the test was too difficult. Alternatively, it could mean that the
questions were not clear or the teaching methods and materials did not bring about
the desired learning outcomes.

Figure 3.3: Distribution of scores obtained by students on a test

Refer to Figure 3.4 which shows the distribution of the scores obtained by students
on a test. There is a negative skew because it has a longer tail in the negative
direction or to the left (towards the lower values on the horizontal axis).

What does it mean? It means that more students were getting high scores on the
test. This may indicate that either the test was too easy or the teaching methods
and materials were successful in bringing about the desired learning outcomes.

Figure 3.4: Distribution of scores obtained by students on a test

Interpreting the Statistics for Skewness

Besides graphical methods, you can also determine skewness by examining the
statistics reported. A normal distribution has a skewness of 0. See the table on the
right in Figure 3.5, which reports the skewness statistics for three independent
groups. A positive value indicates a positive skew, while a negative value indicates
a negative skew.

Among the three groups, Group 3 is not normally distributed compared to the other
two groups. Its skewness value of -1.200 which is greater than 1 normally indicates
that the distribution is non-symmetrical (Rule of thumb: >|1| indicates a nonsymmetrical distribution).

The distribution of Group 2 with a skewness value of .235 is closer to being normal
of 0 followed by Group 1 with a skewness value of .973.

Figure 3.5: Skewness statistics for three independent groups

3.5.1 (c) Assessing Normality using Kurtosis

Kurtosis indicates the degree of "flatness" or "peakedness" in a distribution relative
to the shape of normal distribution. Refer to the graphs in Figure 3.6.

Figure 3.6: Kurtosis

Low Kurtosis: Data with low kurtosis tend to have a flat top near the mean rather
than a sharp peak.
High Kurtosis: Data with high kurtosis tend to have a distinct peak near the mean,
decline rather rapidly and have a heavy tail.
See the graphs in Figure 3.7:

A normal distribution has a kurtosis of 0 and is called mesokurtic (Graph A). (Strictly
speaking, a mesokurtic distribution has a value of 3 but in line with the practice
used in SPSS, the adjusted version is 0).
If a distribution is peaked (tall and skinny), its kurtosis value is greater than 0 and it
is said to be leptokurtic (Graph B) and has a positive kurtosis.
If, on the other hand, the kurtosis is flat, its value is less than 0, or platykurtic
(Graph C) and has a negative kurtosis.

Figure 3.7: Mesokurtic, Leptokurtic and Platykurtic

Interpreting the Statistics for Kurtosis

Besides graphical methods, you can also determine skewness by examining the
statistics reported. A normal distribution has a kurtosis of 0. See the table below in
Figure 3.8, which reports the kurtosis statistics for three independent groups.

Figure 3.8: Kurtosis statistics for three independent groups

Group 1 with a kurtosis value of 0.500 (positive value) is more normally distributed
than the other two groups because it is closer to 0.
Group 2 with a kurtosis value of 1.58 has a distribution that is more flattened and
not as normally distributed compared to Group 1.
Group 3 with a kurtosis value + 1.65 has a distribution that is more peaked and not
as normally distributed compared to Group 1.
3.5.1 (d) Assessing Normality using Box Plot

The boxplot also provides information about the distribution of scores. Unlike the
histogram which plots actual values, the boxplot summarises the distribution using
the median, the 25th and 75th percentiles, and extreme scores in the distribution.
See Figure 3.9, which shows a boxplot for the same set of data on scientific literacy
discussed earlier. Note that the lower boundary of the box is the 25th percentile and
the upper boundary is the 75th percentile.
3.5.1 (e) Assessing Normality using Normality Probability Plot
Besides the histogram and the box plot, another frequently used graphical
technique of determining normality is the "Normality Probability Plot" or "Normal QQ Plot." The idea behind a normal probability plot is simple. It compares the
observed values of the variable to the observations expected for a normally
distributed variable. More precisely, a normal probability plot is a plot of the
observed values of the variable versus the normal scores (the observations
expected for a variable having the standard normal distribution).
In a normal probability plot, each observed or value (score) obtained is paired with
its theoretical normal distribution forming a linear pattern. If the sample is from a
normal distribution, then the observed values or scores fall more or less in a straight
line. The normal probability plot is formed by:

Vertical axis: Expected normal values

Horizontal axis: Observed values
SPSS Procedures
Select Analyze from the main menu.
Click Descriptive Statistics and then open the Explore dialogue box.
Select the variable you require (i.e. mathematics score) and click on the arrow
button to move this variable to the Dependent List: box.
Click the Plots....command push button to obtain the Explore: Plots sub dialogue
Click the Histogram check box and the Normality plots with tests check box and
ensure that the Factor levels together radio button is selected in the Boxplots
Click Continue.
In the Display box, ensure that both are activated.

Click the Options....command push button to open the Explore: Options subdialogue box.
In the Missing Values box, click on the Exclude cases pairwise radio button. If this
option is not selected then, by default, any variable with missing data will be
excluded from the analysis. That is, plots and statistics will be generated only for
cases with complete data.
Click on Continue and then OK.
Note that these commands will give you the 'Histogram', 'Stem-and-leaf plots',
'Boxplots' and 'Normality Plots'.

refer to caption
Figure 3.10: Example of a normal probability plot
When you use a normal probability plot to assess the normality of a variable, you
must remember that ascertaining whether the distribution is roughly linear and is
normal is subjective. The graph in Figure 3.10 is an example of a normal probability
plot. Though none of the value falls exactly on the line, most of the points are very
close to the line.

Values that are above the line represent units for which the observation is larger
than its normal score
Values that are below the line represent units for which the observation is smaller
than its normal score
Note that there is one value that falls well outside the overall pattern of the plot. It
is called an outlier and you will have to remove the outlier from the sample data
and redraw the normal probability plot.

Even with the outlier, the values are close to the line and you can conclude that the
distribution will look like a bell-shaped curve. If the normal scores plot departs only
slightly from having all of its dots on the line, then the distribution of the data
departs only slightly from a bell-shaped curve. If one or more of the dots departs
substantially from the line, then the distribution of the data is substantially different
from a bell-shaped curve.

Refer to the normal probability plot in Figure 3.11. Note that there are possible
outliers which are values lying off the hypothetical straight line.
Outliers are anomalous values in the data which may be due to recording errors,
which may be correctable, or they may be due to the sample not being entirely from
the same population.

refer to caption
Figure 3.11: Outliers
Skewness to the left:
Refer to the normal probability plot in Figure 3.12. Both ends of the normality plot
fall below the straight line passing through the main body of the values of the
probability plot, then the population distribution from which the data were sampled
may be skewed to the left.

refer to caption
Figure 3.12: Skewness to the left
Skewness to the right:
If both ends of the normality plot bend above the straight line passing through the
values of the probability plot, then the population distribution from which the data
were sampled may be skewed to the right. Refer to Figure 3.13.

refer to caption
Figure 3.13: Skewness to the right

refer to caption
Figure 3.14: Normal probability plot for the distribution of mathematics scores
Refer to the output of a Normal Probability Plot for the distribution of mathematics
scores by eight students in Figure 3.14.

Comment on the distribution of scores.

Would you consider the distribution normal?
Are there outliers?

Figure 3.9:
Boxplot for the set of data on scientific literacy
The box has hinges that form the outer boundaries of the box. The hinges are the
scores that cut off the top and bottom 25% of the data. Thus, 50% of the scores fall
within the hinges. The thick horizontal line through the box represents the median.
In the case of a normal distribution, the line runs through the centre of the box.
If the median is closer to the top of the box, then the distribution is negatively

If it is closer to the bottom of the box, then it is positively skewed.

The smallest and largest observed values within the distribution are represented by
the horizontal lines at either end of the box, commonly referred to as whiskers.

The two whiskers indicate the spread of the scores.

Scores that fall outside the upper and lower whiskers are classified as extreme
scores or outliers. If the distribution has any extreme scores, i.e. 3 or more box
lengths from the upper or lower hinge, these will be represented by a circle (o).

Outliers tell us that we should see why it is so extreme. Could it be that you may
have made an error in data entry?

Why is it important to identify outliers? This is because many of the statistical

techniques used involve calculation of means. The mean is sensitive to extreme
scores and it is important to be aware whether your data contain such extreme
scores if you are to draw conclusions from the statistical analysis conducted.

