STA 111 - Topic One - Lecture 1

STA 111: Probability and Statistics 1



Lecture One: Introduction and Nature of Statistical Data

1.1 Objectives
By the end of the lecture, the learner should be able to:
i) Understand the meaning, nature, importance and limitations of statistics
ii) Explain the types of variables
iii) Classify measurements and data into various types

1.2 Introduction

1.2.1 Meaning and Definition of Statistics

Statistics has different meanings for different people and the purpose. Statistics has
been defined also in different ways by different writers. This is due to changes in the
scope of statistics with the passage of time.
Statistics is used in two senses:
 In plural sense meaning a collection of facts or estimates – the figure themselves
(numerical data).
 As a singular noun meaning Statistics is the scientific method of collecting,
organizing, summarizing, presenting and analyzing data, as well as interpreting
data. (Interpretation means drawing valid conclusions and making reasonable
decisions on the basis of such analysis).
1. Collection of data: Once an investigator has collected data through a survey, it is
necessary to edit these data in order to correct any apparent inconsistencies,
ambiguities, recording errors or for that matter any mistake that can enter into the
actual computations. But even before the data has been Statistics Page 9 of 204
collected and edited, it is assumes that these can be suitably classified according to
some common characteristic of the population sampled.
2. Description of data: The organized data can now be presented in the form of
tables or diagrams or graphs. This presentation in an orderly manner facilitates the
understanding as well as analysis of data.
3. Analysis of data: The basic purpose of data analysis is to make it useful for certain
conclusions. This analysis may simply be a critical observation of data to draw some
meaningful conclusions about it or it may involve highly complex and sophisticated
mathematical techniques. Some simple statistical tools such as calculations of
averages, dispersion of data around averages and percentages are commonly used
to analyze data.

4. Interpretation of data: Interpretation means drawing conclusions from the data
which form the basis of decision making. Correct interpretation requires a high
degree of skill and experience and is necessary in order to draw valid conclusions.

1.2.2 Uses of statistics

Statistics is an increasingly important subject which is useful in many types of scientific
investigations. Statistics is particularly useful in situations where there is experimental
uncertainty and may be defined as „the science of making decisions in the face of
uncertainty‟. It is applicable in various fields including education, business, agriculture,

 To present data in a concise and definite form – helps in classifying and tabulating
raw data for processing and further tabulation for other users.
 To make it easy to understand complex and large data - permits summarization and
presentation of large quantities of information. i.e. It condenses and summarizes
voluminous data into a few presentable, understandable and precise figures. For
example, stock market prices of individual stocks and their trends are highly
complex to comprehend, but a graph of prices trends gives us the overall picture at a
 To undertake and understand research in our areas of interest such as It helps in
determining functional relationship between two or more phenomenon. Statistical
techniques such as correlational analysis assist in establishing the degree of
association between two or more independent variables. For example, the coefficient
of correlation between literacy and employment gives us the degree of association
between extent of training and industrial productivity.
 Used in government and other organizations to formulate new programmes and
policies as well as in administration ie It helps the central management and the
government in formulating policies. Example, the recently conducted census, will be
used as a source of information for planning by the government for the next 10 years
until another census is conducted in 2019.
 For comparison of variables in different sets of data - Arrangement of data with
respect to different characteristics facilitates comparison and interpretation. For
example, data on age, height, gender, and family income of college students gives us
a much better picture of students when the data is categorized relative to these
 Aids in forecasting outcomes of future events- Statistical methods are highly useful
tools in analyzing the past data and predicting some future trends. Eg Helps
businesses in decision making by making future estimates and expectations . For

example, the sales for a particular product for the next year can be computed by
knowing the sales for the same product over the previous years, the current market
trends and the possible changes in the variable that affect the demand of the

1.2.3 Scope of Statistics

Some of the important areas where the knowledge of statistics is usefully applied are as

1. Government. Various departments of the government collect and interpret vast

amount of data and information for efficient functioning and decision making.

2. Economics. Statistics are widely used in economics study and research. The subject of
economics is mainly concerned with production and distribution of wealth as well as
savings and investments. Some of the areas of economic interest in which statistical
tools are used are as follows:

(a) Statistical methods are extensively used in measuring and forecasting Gross
National Product ( GNP ).
(b) Economic stability is primarily judged by statistical studies of business cycles.
(c) Statistical analyzes of population growth, unemployment figures, rural or urban
population shifts and so on influence much of the economic policy making.
(d) Econometric models which involve application of statistical methods and used for
optimum utilization of resources available.
(e) Financial statistics are necessary in the fields of money and banking including
consumer savings and credit availability.
3. Physical, Natural and Social Sciences. In physical sciences, as an example, the science
of meteorology uses statistics in analyzing the data gathered by satellites in predicting
weather conditions.
4. Statistics and Research. There is hardly any advanced research going on without the
use of statistics in one form or another. Statistics are used extensively in medical,
pharmaceutical and agricultural research. The effectiveness of a new drug is
determined by statistical experimentation and evaluation.

5. Other Areas. Statistics are commonly used by insurance companies, stock brokerage
firms, banks, public utility companies and so on. Statistics are also immensely useful to
politicians since they can predict their chance of winning through the use of sampling

techniques in random selection of voters sampled and studying their attitude on issues
and policies.

1.2.4 Limitations of Statistics

Statistics has a number of limitations, pertinent among them are as follows:
1. It does not deal with individual values. Statistics only deals with aggregate values.
For example, the marks obtained by one student in a class does not carry any
meaning in itself, unless it can be compared with a set standard or with other
students in the same class or with his own marks obtained earlier.
2. It cannot deal with qualitative characteristics. Statistics is not applicable to
qualitative characteristics such as honesty, kindness, goodness, colour, poverty,
beauty, and so on, since these cannot be expressed in quantitative terms. The
characteristics, however, can be statistically dealt with if some quantitative values
can be assigned to these with logical criterion.
3. Statistical conclusions are not universally true. Since statistics is not an exact science,
as is the case with natural sciences, the statistical conclusions are true only under
certain assumptions.
4. Statistical interpretation requires a high degree of skill and under standing of the
subject. In order to get meaningful results, it is necessary that the data be properly
and professionally collected and critically interpreted. it requires extensive training
to read and analyze statistics in its proper context.
5. Statistics can be misused. The famous statement that „figures don‟t lie but the liars
can figure‟, is a testimony to the misuse of statistics. Thus, inaccurate or incomplete
figures, can be manipulated to get desirable references. Example, advertising
slogans such as 4 out of 5 dentists recommend brand X tooth paste gives us the
impression that 80% of all dentists recommended this brand. This may not be true
since we don‟t know how big the sample is or whether the sample represents the
entire population or not. Another example is the opinion polls after the news. We
are normally given a percentage but not told the sample size of the total number of
people who called to respond to the questions.
6. There are certain phenomena or concepts where statistics cannot be used. This is
because these phenomena or concepts are not amenable to measurement. For
example, beauty, intelligence, courage cannot be quantified. Statistics has no place in
all such cases where quantification is not possible.
7. Statistics reveal the average behaviour, the normal or the general trend. An
application of the ‟average‟ concept if applied to an individual or a particular
situation may lead to a wrong conclusion and sometimes may be disastrous. For
example, one may be misguided when told that the average depth of a river from

one bank to the other is four feet, when there may be some points in between where
its depth is far more than four feet. On this understanding, one may enter those
points having greater depth, which may be hazardous.
8. Since statistics are collected for a particular purpose, such data may not be relevant
or useful in other situations or cases. For example, secondary data (i.e., data
originally collected by someone else) may not be useful for the other person.
9. Statistics are not 100 per cent precise as is Mathematics or Accountancy. Those who
use statistics should be aware of this limitation.
10. In statistical surveys, sampling is generally used as it is not physically possible to
cover all the units or elements comprising the universe. The results may not be
appropriate as far as the universe is concerned. Moreover, different surveys based
on the same size of sample but different sample units may yield different results.
11. At times, association or relationship between two or more variables is studied in
statistics, but such a relationship does not indicate cause and effect‟ relationship. It
simply shows the similarity or dissimilarity in the movement of the two variables. In
such cases, it is the user who has to interpret the results carefully, pointing out the
type of relationship obtained.
12. A major limitation of statistics is that it does not reveal all pertaining to a certain
phenomenon. There is some background information that statistics does not cover.
Similarly, there are some other aspects related to the problem on hand, which are
also not covered. The user of Statistics has to be well informed and should interpret
Statistics keeping in mind all other aspects having relevance on the given problem.

1.2.5 Misuses
Sometimes people, knowingly or unknowingly, use statistical data wrongly. Such
forms of misuse include:
i) Failure to give the sources of data: this may compromise the reliability of the data
because the user of such data will not know how far this data will fit his/her
situation including if he/she wants to refer to the original source.
ii) Defective data: This may be done knowingly in order to defend one‟s position or to
prove a particular point. This apart, the definition used to denote a certain
phenomenon may be defective. For example, in case of data relating to unemployed
persons, the definition may include even those who are employed, though partially.
The question here is how far it is justified to include partially employed persons
amongst unemployed ones.
iii) Unrepresentative sample: In statistics, several times one has to conduct a survey,
which necessitates to choose a sample from the given population or universe. The
sample may turn out to be unrepresentative of the universe. One may choose a

sample just on the basis of convenience. He may collect the desired information
from either his friends or nearby respondents in his neighbourhood even though
such respondents do not constitute a representative sample.
iv) Inadequate sample: At times one may conduct a survey based on an extremely
inadequate sample. For example, in a city we may find that there are 100,000
households. When we have to conduct a household survey, we may take a sample
of merely 100 households comprising only 0.1 per cent of the universe. A survey
based on such a small sample may not yield right information.
v) Unfair Comparisons: For instance, one may construct an index of production
choosing the base year where the production was much less. Then he may compare
the subsequent year‟s production from this low base. Such a comparison will
undoubtedly give a wrong picture of the production though in reality it is not so.
Another source of unfair comparisons could be when one makes absolute
comparisons instead of relative ones. An absolute comparison of two figures, say,
of production or export, may show a good increase, but in relative terms it may
turnout to be very negligible. Another example of unfair comparison is when the
population in two cities is different, but a comparison of overall death rates and
deaths by a particular disease is attempted. Such a comparison is wrong. Likewise,
when data are not properly classified or when changes in the composition of
population in the two years are not taken into consideration, comparisons of such
data would be unfair as they would lead to misleading conclusions.
vi) Unwanted conclusions: This may be as a result of making false assumptions. For
example, while making projections of population in the next five years, one may
assume a lower rate of growth though the past two years indicate otherwise.
Sometimes one may not be sure about the changes in business environment in the
near future. In such a case, one may use an assumption that may turn out to be
wrong. Another source of unwarranted conclusion may be the use of wrong
average. Suppose in a series there are extreme values, one is too high while the
other is too low, such as 800 and 50. The use of an arithmetic average in such a case
may give a wrong idea. Instead, harmonic mean would be proper in such a case.
vii) Confusion of correlation and causation: In statistics, several times one has to
examine the relationship between two variables. A close relationship between the
two variables may not establish a cause-and-effect-relationship in the sense that one
variable is the cause and the other is the effect. It should be taken as something that
measures degree of association rather than try to find out causal relationship.

1.2.6 Branches of statistics

Statistics can be divided into two branches:

1. Descriptive: statistics that summarize the characteristics of given data, without
trying to extrapolate or make predictions. Utilizes numerical and graphical method
to summarize the information, look for patterns in the data set and present the
information in a convenient form (Describes or summarizes things you definitely
2. Inferential: statistics used to make claims or predictions about the larger
population based on a subset (sample) of that population. Utilizes sample data to
make estimates, decisions, predictions and other generalizations about a larger set of
data. (Compares groups, tests hypothesis or predicts or infers).
Conclusions made are called Statistical inference which cannot be absolutely
certain hence the need to use probability in drawing conclusions.
In this course, you will study numerical and graphical ways to describe and display
your data. This area of statistics is what we have called "Descriptive Statistics." You
will learn how to calculate, and even more importantly, how to interpret these
measurements and graphs.

1.3 Data

1.3.1 Definition of some terms.

 Organization of Data - Data organization, in broad terms, refers to the method of
classifying and organizing data sets to make them more useful. Some IT experts
apply this primarily to physical records, although some types of data organization
can also be applied to digital records.
 Data is a collection of observations from an experiment or a survey
 A population is a set of units (people, objects, transactions or events). The entire set
of all possible outcomes or measurements of interest.
 In collecting data, it‟s often not possible to observe the whole group referred to as
target group population; hence one observes a smaller representative of the group
called a sample (sample - a subset of the population for which we have data, and
that we hope is representative of the population).
- If the whole group is observed a census has been conducted.
- If a smaller group is observed a sample surveyhas been conducted.
- If the sample is a representative of a population, then important conclusions
about population can be made from it.
 Target population may be finite or infinite.
 Finite Population: e.g. number of students in ABC University.
 Infinite Population: e.g. number of insects in ABC University.
 Variable: A characteristic or property of an individual population unit. A quantity
that can assume prescribed set of values. May be discrete, continuous or constant.
 Discrete Variable - Take on a finite number (values), are countable. E.g. size of a

 Continuous Variable - Takes any value within a specified range. E.g. Height of
 Constant Variables - Takes one value. E.g. Number of hours in a day.

1.3.2 Levels of Measurement

Measurement: is the process we use to assign numbers to variables of individual
population units according to a set of rules.
a) Nominal measurement – classifies data into mutually exclusive (non-
overlapping) exhausting categories in which no order, or ranking can be
imposed on the data e.g. gender - male & female, bloodgroups O. A, B & AB,
eye colour – blue, brown, religion etc.
b) Ordinal - classifies data into categories that can be ranked or ordered with
respect to each other. For example – guest speaker might be ranked as good,
average or poor, health condition of a patient can be good, better or best. The
precise difference between ranks does not exist. More examples: Grade A, B…
etc, Ranking scale (poor, good, excellent, etc), judging (1st, 2ndetc)
c) Interval measurement: classifies and ranks data and precise difference between
units of measurement exist. However, there is no meaningful zero. For example
– temperature has no meaningful difference between each unit. 0 degrees
Celsius does not mean there is no heat, IQ, Exam score.
d) Ratio measurement: There is a difference between units and a true zero exists.
Examples – height, time, age, salary, etc.

1.3.3 Types of Data

All data can be classified as one of two general types: Quantitative Data and
Qualitative Data.
1. Quantitative data (Numerical data – it yields numerical responses, for example,
“What is your age?”)
They are data that are measured on a naturally occurring numerical scale.They
represent a measurable quantity. Observations are numbers representing an amount or
count of a certain characteristic like height, weight etc
Examples: The number of patients admitted in the County hospital, the current
unemploymentrate for each county the scores of a sample of 150 students in an exam,
the number of male studentsin the class.
Ratio and interval measurement fall under the quantitative category.
These data can be classified into two types: discrete and continuous.
a) Discrete Data
Discrete data can only take on particular values and thus has clear boundaries. Assumes
only countable number of values.
Example: You can have 30 students or 31 students, but not 30.5 students, so “number of
students”is a discrete variable, family size etc. In fact, any variable based on counting is
discrete, whether you are counting thenumber of books purchased in a year or the
number of motor accidents reported in a year.

b) Continuous Data
Continuous data can take any value, or any value within a range or an interval. Most
data measured by intervaland ratio scales, other than that based on counting, is
Example: weight and height of students, distance from town to campus, an income
received byan employee are all continuous.

2. Qualitative data (Categorical data – that which yields responses such as Yes or No.
for example,” Did you buy the books?”)
Qualitative data cannot be measured on a natural numerical scale; they can only be
classifiedinto groups or categories. Take on values that are names or labels. Categories
are non - overlapping, may or may not suggest an order or rank.
Examples: The political party affiliations in a sample of 50 chief executive officers, the
size of acar (subcompact, compact, mid-size, or full-size) rented by each of a sample of
30 business travelers, acoffee tester‟s ranking (best, worst, etc.) of four brands of coffee
for a panel of 10 testers.
These data can be classified into three types:Attribute, Nominal and Ordinal.
a) Attribute Data: Also known as dichotomous data. These data has only two
Example: yes/no, male/female.
b) Nominal Data: These data have several unordered categories.
Example: type of an insurance policy (motor, medical, fire, burglary, life insurance
c) Ordinal or Ranked Data:These data have several ordered categories
Example: Questionnaire response such as Strongly Agree ......... Strongly Disagree to
questionslike:I am the best student in my class, My classmates are very co-operative, I
live in the best hostel, Muscle response (none, partial, complete), Tree vigor (Healthy,
sick, dead), Income (< 𝐾𝑆ℎ9999 , 𝐾𝑆ℎ10,000 − 𝐾𝑆ℎ19,999, 𝐾𝑆ℎ20,000 − 𝐾𝑆ℎ49,999, >

In economics, data is also often categorized by how it relates to time.
Cross-sectional data.
In cross-sectional data, all observations come from the same point in time. The
observations typically correspond to individuals or groups like states or countries. For
instance, a survey of Americans on who they support in the upcoming presidential
election is cross-sectional data. So is a data set with the homicide rate for each state in a
single year.
Longitudinal or time-series data.
In longitudinal or time-series data, each data point corresponds to a particular point in
time – usually for a single individual or group. For instance, if you recorded your
income every day for a year, that would give me a longitudinal data set. The GDP of
the U.S. from 1945 to the present is also a longitudinal data set.

Panel data.
Panel data is both cross-sectional and longitudinal. It involves getting cross-sectional
data for many time periods (or, alternatively, time-series data for many different
individuals or groups). For instance, if you recorded the income for each one of your
classmates every year for the next 20 years, that would be a panel data set.
One way to think of this is in terms of dimensions. Both cross-sectional and time-series
data are one-dimensional; panel data is two-dimensional.

1.3.4 Data Sources and Collection Tools

Data Collection

N/B:In Experimental methods, the researcher has to control the independent variables
while in Non-Experimental methods there is no control.

Sources of Data

There are two main sources of data collection techniques: Primary and Secondary
sources. There is also a third source known as internal data.

(a) Primary Data

Primary data are measurements observed and recorded as part of an original study.
Data is primary if it has been collected by the same person or entity that is using it. It
has not yet been published, is more reliable, authentic and objective. It has not been
changed or altered. The work of collecting original data is usually limited by time,
money, and manpower available for the study.
There are two basic methods of obtaining primary data, namely:
i) Surveys – most commonly used method in social sciences, management,
psychology etc.

ii) Questionnaire – commonly used in survey-asking people questions (Questioning)
A formal list of such questions either open or closed ended questions for which the
respondent gives answers. May be conducted through telephone, mail, live,
electronic mail or fax etc.
iii) Direct Observation - When data are collected by observation, the investigator asks
no questions and may let the one being observed or may not let him know he‟s
being observed.
iv) Interviews– face to face with the respondent. Is slow, expensive and may take away
from their working hours but allows in depth and follow-up questioning.
v) Experiments – subjects are divided into treatment groups and control groups to
measure the difference between them after some kind of treatment is given to the
former group. This is very common in medical testing.

(b) Secondary Data

Data which has been already collected by and available from other sources. This is
primary data from another purpose for our purpose. Secondary data can be obtained
from journals, reports, government publications, publications of research organizations,
trade and professional bodies, compilations from computerized data bases and
information systems, magazines, newspapers, internet, stories told by people etc. This is
also referred to as Data mining(data mining (sometimes called data or knowledge
discovery) is the process of analyzing data from different perspectives and
summarizing it into useful information - information that can be used to increase
revenue, cuts costs, or both)
N/B Information from the Census, Bureau of Labor Statistics, Dept. of Commerce, etc.,
is secondary. Well, that‟s true if you use it. If they (that is, employees of the Census
Bureau) use it, it‟s primary.

(c) Internal Data

Internal data refer to the measurements that are the by-product of routine business
record keeping like accounting, finance, production, personnel, quality control, sales,

Exercise 1.1
1. Describe meaning of each of the following terms:
i) Statistics.
ii) Data
iii) Frequency distribution
2. Discuss four functions of statistics.
3. What are the major limitations of Statistics? Explain with suitable examples
4. Distinguish between the following terms as used in statistics:
i) Descriptive and inferential statistics.
ii) Target population and sample.
iii) Census and sample survey.

iv) Nominal and interval measurement.
v) Quantitative Data and Qualitative Data.
5. Explain the two main sources of data.
6. Categorize these measurements according to their level:
i) Students performance: Distinction, Pass, Fail
ii) Annual net income for Afya Insurance in 2012
iii) Names of insurance products
iv) Religious preference of tourists
v) Room temperature measured in Kelvin scale
vi) The Length of time spent in a restaurant
vii)The rank of an army officer
viii) The type of a vehicle driven by the president
ix) The mass of a pig
7. State which of the following variables are discrete and which are continuous:
i) Height of a person
ii) Number of employees in ABC bank
iii) Temperature on a certain day
iv) Age of a building
v) Length of a train journey
vi) Time taken to complete a project
vii)Volume of water in a container
viii) Number of children in a family
8. Classify the following examples of data as nominal, ordinal, interval or ratio giving
reasons for each:
i) The species of trees growing in a farm
ii) The grades of students at the end of semester exams
iii) The financial stability of banks in Kenya
iv) The number of years of service of all employees in karatina university
v) Favorite rainbow colours among a sample of 50 pupils in ABC school.
vi) The number of defective bulbs produced by XYZ factory between January and
May 2000.
9. List the various methods of data collection techniques you know of.
10. Research question:
Write down the advantages of data classification.

