Den
Den
Den
A. Importance of Statistics- Statistics is important in almost all fields. The importance of studying statistics is due to the increased
amount of data that is collected and disseminated to the public.
Statistics
- the collection, organization, presentation, analysis, or interpretation of numerical data, especially as a
branch of mathematics in which deductions are made on the assumption that the relationship between
a sufficient sample of numerical data are characteristic of those between all such data.
- it is a science which deals with the collection, presentation, analysis, and use of data to make decisions,
solve problems, and, design products and processes.
B. Categories of Statistics
Descriptive Statistics
- It comprises those methods concerned with collecting and describing a set of data so as to yield
meaningful information.
- It deals with the methods of organizing, summarizing and presenting a mass of data.
Inferential Statistics
- It is concerned with making generalizations about a population or other groups of data based on the study
of the sample.
- It comprises those methods concerned with the analysis of a subset of data leading to predictions or
inferences about the entire set of data.
Population
- it consists of the totality of the observations with which we are concerned.
- it refers to a group of a total number of people, objects, or reactions that can be described as having a
unique or combination of qualities.
- It refers to an entire group that is being studied. Each member of the population is called a unit.
Sample
- It refers to a finite number of objects selected from a well-defined population.
- It is a collection of some elements in a population which is a representative of the entire population.
D. Variable
A variable is any property, characteristic or attribute which is of interest about each individual unit of a population or of
a sample.
Types:
1. Qualitative – also called categorical variables. It describes data which fit into categories.
2. Quantitative – They represent a measurable quantity.
E. Data
- it is the raw material for the statistical investigation.
1. Nominal Scale – It involves categorizing cases according to the presence or absence of some attribute. It is generally
used for the purpose of classification. Data gathered from variables measured at a nominal level can be categorized
but cannot be ranked, as there are no quantitative differences between and among them.
2. Ordinal Scale – The simplest scale which orders people, objects, or events along some continuum. The name of this
level is derived from the use of ordinal numbers for ranking. Numbers are used only to place objects in order and the
difference between consecutive values does not have a meaning.
3. Interval Scale – The scale on which zero is arbitrary. It does not reflect the absence of an attribute. Data gathered
from variables measured at an interval scale can be categorized, ranked, and can be added or subtracted.
Difference between two values has a meaning.
4. Ratio Scale – Possesses all of the characteristics of interval scales but have a true zero point. A variable measured in
this level does not only include the concept of order and interval but it also adds the idea of “nothingness”. Thus, a case
where 0 is on a scale indicates the total absence of the property being measured.
G. Parameter and Statistic
Parameter – it is a summary measure which is calculated from population data. It is represented by Greek letters
Statistic - it is a summary measure or value which is calculated from sample data. The letters of the English alphabet is
used to represent statistics.
A. SAMPLING
Sampling is the process of selecting units, like people, organizations, or objects from a population of interest in order to
study and fairly generalize the results back to the population from which the sample was taken.
There are usually three criteria that needs to be specified to determine the appropriate sample size: the level of
precision, the level of confidence or risk, and the degree of variability in the attributes that are being measured (Miaoulis and
Michener, 1976).
3. Degree of Variability
It refers to the distribution of the attributes in the population.
The more heterogeneous a population is, the larger the sample size required to obtain a iven level of
precision. The more homogeneous a population is, the smaller the sample size.
The variance is usually an unknown quantity.
Several formulas to calculate the sample size for a certain study has been developed and suggested such as Parten’s
formula (1950), Lehr’s rule (1992), the formula by Berlowitz, Watson’s formula (2001), and Cochran’s formula (1963) with most
of these formulas considering the case where the population distribution is approximately normal.
The following are some formulas that are used to calculate the size of a sample.
For populations that are large, Cochran (1963:75) developed the following equation to yield a representative
sample for proportions.
Z 2 pq
n0
e2
Where: n0 - sample size
Z 2 - is the abscissa of the normal curve that cuts off an area at the tails (1 - ) equals
the desired confidence level.
e - is the desired level of precision.
p - is the estimated proportion of an attribute that is present in the population
q - is 1 - p
* The value for Z is found in the statistical table of areas under the normal curve.
If the population is small, then the sample size can be reduced slightly.
n0
n
n 1
1 0
N
Where: n - is the sample size.
N - is the population size
If the behavior of the population is not certain or the researcher is not familiar with the population’s behavior, Yaro
Yamen’s formula (1980) or Taro Yamane’s formula (1967) may be used. The formula is:
N
n
1 Ne 2
For polytomous or continuous variables, one way of determining the sample size is to combine responses into
two categories and then compute for the sample size based on proportion (Smith, 1983).
Another method to determine the sample size for the mean is to use the following formula:
Z2 2
n0
e2
For the aforementioned methods/approaches of determining the sample size, the sampling design is
assumed to be simple random sampling. More complex designs must take into account the variances of
subpopulation strata, or clusters, before making an estimate of the variability in the population as a whole.
The sample size should be appropriate for the analysis that is planned.
An adjustment in the sample size may be needed to accommodate a comparative analysis of subgroups.
o Sudman (1976) suggests a minimum of 100 elements for each major group or subgroup in the
sample and a sample of 20 to 50 elements is necessary for each minor subgroup.
o According to Kish (1965), when the attribute of interest is present 20% to 80% of the time (the
distribution approaches normality), 30 to 200 elements are then sufficient. For skewed distributions, a
larger sample or a census is required. Skewed distributions can result in serious departures from
normality even for moderate size samples (Kish, 1965).
Researchers commonly add 10% to the sample size to compensate for persons that the researcher is unable
to contact.
The sample size is also often increased by 30% to compensate for nonresponse.
Probability Sampling
A probability sampling method is any method of sampling that utilizes some form of random selection. Random selection
is performed by selecting a group of subjects (a sample) for study from a larger group (a population). Each individual is chosen
entirely by chance.
Example. From a population size of 300 items, 30 are to be selected randomly using systematic random sampling. Which
elements or units in the population are to be taken for the sample?
Example. From the data below on the number of employees per department of a certain company, determine the
number of employees that are to be taken from each of the departments needed to represent the population using
equal allocation and proportional allocation of samples.
Department No. of Employees
Engineering 150
Production 500
Marketing 325
Management 100
Total 1,075
5. Multi-Stage sampling
- Uses several stages or phases in the process of sampling from a population. Very useful in conducting
nationwide surveys or any survey that involves a very large population.
Non-Probability Sampling
This is a sampling method that does not involve random selection of samples. With non-probability samples, the
population may or may not be represented well, and it will often be difficult to know how well the population has been
represented. Some forms of non-probability sampling are:
2. Purposive sampling
- sampling is based on certain criteria laid down by the researcher. People who satisfy the criteria are
interviewed.
b. Expert sampling
- Involves the assembling of a sample of persons with known or demonstrable experience and expertise in
some area.
Two reasons we might do expert sampling:
1. It would be the best way to elicit the views of persons who have specific expertise.
2. To provide evidence for the validity of another sampling approach you’ve chosen.
c. Quota sampling
- Select items nonrandomly according to some fixed quota.
d. Snowball sampling
- Begin by identifying someone who meets the criteria for inclusion in your study. You then ask them to
recommend others who they may know who also meet the criteria.
3. Registration method
- a method enforced by certain laws.
4. Observation method
- it is a method which observes the behavior of individuals or organizations in the study. This is also used when the
respondents cannot read nor write.
5. Experiment Method
- used when the objective of the study is to determine the cause and effect of certain phenomena or event.
Preparing a codebook involves deciding (and documenting) how you will go about:
1. Defining and labeling each of the variables.
2. Assigning numbers to each of the possible categorical responses.
Windows Excel
Variable Coding instructions
Variable Name
Number assigned to each
Identification number ID questionnaire
1 = Males
Sex ( or Gender) Sex
2 = Females
1 = Single
2 = Steady relationship
3 = Married for the first time
Marital Status Marital
4 = Remarried
5 = Divorced/Separated
6 = Widowed
*Variable Names
- Each question or item must have a unique variable name (these names shall clearly identify the information.)
Note: The first variable in any data set should be the ID (respondent number)
IDs and demographic variables are usually placed at the beginning of the file. Then, entry of other variables should follow
some sort of logical order.
Each row in the data file represents one case. Each column represents a separate variable and the variable labels at
the top of the columns should make it clear what is being measured.
A. Textual
- Collected data may be organized and presented in a narrative or textual form.
Example:
2000 Census of Population
The population of the Philippines as of May 1,2000 is 75.33 million. This figure is higher by 6.71 million
from the 1995 population.
The annual growth rate from 1995 to 2000 is 2.02 percent, which is lower by 0.30 percentage point
from the 1995 figure of 2.32 percent and by 0.33 percentage points from the 1990 figure of 2.35 percent
Spread
o This refers to the variability of the data. If the observations cover a wide range, then the spread
is larger. The spread is smaller, on the other hand, when the observations are clustered around a
single value.
Shape
o It is described by the following characteristics:
Symmetry. Graph can be divided at the center so that each half is a mirror image of
the other.
Skewness. Some distributions have many observations on one side of the graph than
the other. A distribution with fewer observations on the right (toward higher values) are
said to be skewed to the right. On the other hand, distributions with fewer observations
on the left (toward lower values) are said to be skewed to the left.
Uniform. Data distribution is equally spread across the range of the distribution.
Unusual features.
o Gaps. Areas of a distribution where there are no observations.
o Outliers. Distribution of data are sometimes characterized by extreme values that greatly differ
from the other observations.
1. Dotplot
- A graphic display that is used to compare frequency counts within a small number of categories or groups,
usually with small sets of data.
- The pattern of data in a dotplot can be described in terms of symmetry and skewness, only if the categories
are quantitative. If the categories are qualitative, the dotplot cannot be described in those terms.
Example:
Bar Chart. It represents the frequency or magnitudes of quantities of each of the categories as a bar rising vertically
from the horizontal axis with the height of each bar proportional to the frequency or magnitude of the
corresponding category.
It may be simple, compound and can be vertically or horizontally arranged. It is used for both qualitative and
quantitative data.
Example:
Figure 1. Monthly mean particulate matter (PM10) level in Baguio City for 2010.
Histogram. It is made up of columns plotted on a graph where there is no space between adjacent columns. The
columns are positioned over a label that represents a continuous, quantitative variable.
A histogram is distinct from a bar chart based on the type of variable that is being presented. With this distinction, it
can be appropriate to talk about skewness of a histogram.
Example:
Example:
Interquartile Range (IQR). The interquartile range is represented by the width of the box.
5. Scatterplot
- A graphic tool used to display the relationship between two quantitative variables.
- Each dot on the scatterplot represents an ordered pair of observation from a data set.
- Used to analyze patterns in bivariate data. The patterns are described in terms of linearity, slope, and strength.
Examples:
6. Line chart.
- Graphical presentation of data especially useful for showing trends over a period of time.
Example:
x
i 1
i
N
Sample Mean: If a set of data 𝑥1 , 𝑥2 … 𝑥𝑛 represents a finite sample of size 𝑛, then the sample mean 𝑥̅ is
n
x
i 1
1
x
n
Example 1
Suppose you are to choose ten people who enter the campus and whose ages are as follows:
15 25 18 20 25 18 18 20 25 15
What is the mean age of this sample?
2. Weighted Mean – if the data set 𝑥1 , 𝑥2 … 𝑥𝑘 have assigned weights 𝑤1 , 𝑤2 … 𝑤𝑘 , respectively, then the weighted mean is
computed as follows:
k
w x i i
x i 1
k
w i 1
i
Example 2
The table provides the grades obtained by a student in the different criteria for grading and the corresponding weight
for each criterion. Find his weighted average.
Criteria Grade Weight
Long Tests 80 0.30
Quizzes 85 0.20
Departmental Exam 82 0.25
Class Participation 88 0.10
Homework and Projects 85 0.15
Example 3
Mall goers were asked to rate the level of effectiveness of the inspection being done by security forces in preventing
crimes in malls.
Level of Effectiveness Very Effective (4) Moderately Effective (3) Least Effective (2) Not Effective (1)
Number of Mall goers 97 132 176 170
*Likert Scale: Interval Scale = (highest rate – lowest rate)/ no. of ratings = ( 4 - 1 )/ 4 = 0.75
B. MEDIAN, 𝜇̃ or 𝑥̃
- a value that divides the distribution into two equal parts (after arranging the values/scores in ascending or descending order).
As such, it is a positional average. The median is defined by
𝑥𝑛+1 𝑖𝑓 𝑛 𝑖𝑠 𝑜𝑑𝑑
2
𝜇̃ 𝑜𝑟 𝑥̃ = {𝑥𝑛 + 𝑥𝑛+1
2 2
𝑖𝑓 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛
2
Example 4
Find the median: (a) 12, 15, 18, 8, 9,10, 6; (b) 23, 18, 15, 12, 10, 9, 8, 6
C. MODE, 𝜇̂ or 𝑥̂
- the value in the distribution with the highest frequency. It locates the point where the observation values occur with the greatest
density. It can be used for quantitative as well as qualitative data.
Example 5
Find the mode of the following data: 15 12 4 9 6 10 5 15
12 4 12 6 12 5 15 12 4 15 4 6 5
Evidently, a distribution can have no mode, one mode, or more than one mode. Thus, the mode is not a very reliable measure of
central tendency. However, there are instances when no other measure can be used except the mode. In determining the
prevalent gender, civil status, or highest educational attainment, only the mode can be used because no numerical values can
be assigned to these variables.
D. MIDRANGE
- the mean of the largest and smallest values in the data set.
Remarks
Mean:
1. All the scores or measurements are considered in the computation of the mean.
2. Very high or very low scores or measurements affect the mean.
Median:
1. Only the middle scores or measurements are considered in the computation of the median.
2. Very high or very low scores do not affect the median.
Mode:
1. It is very easy to compute but is seldom used because it is very unstable.
2. It is most appropriate for nominal scale as a measure of popularity.
Exercises
1. Find the mean, median and mode of the following examination scores given in a stem-and-leaf plot.
Exam Scores
4 568
5 34569
6 2356699
7 01133455578
8 122369
2. The numbers of incorrect answers on a true or false competency test for a random sample of 15 students were recorded
as follows: 2, 1, 3, 0, 1, 3, 6, 0, 3, 3, 5, 2, 1, 4, and 2. Find a. mean, b. median, c. mode, d. midrange.
3. A student had accumulated 20 credits with the grade of A, 25 credits with B’s, 10 credits with C’s, and 2 credits with D’s.
The school uses the grading scale in which A = 4 grade points, B = 3, C = 2 and D = 1. Determine the grade point average
of the student.
4. A student was taking six subjects in college during the first semester. Find his average grade if his final grades were as
follows:
Subject Math Physics English Speech Statistics
Grade 1.75 2.50 2.25 1.50 3.0
Units 3 5 3 2 4
5. An economist studying trends in gasoline prices within a city takes sample of 30 of the city’s gas stations, determining
for each station the price per litre (in dollars) of unleaded regular gasoline. The results are given below. Find the mean
and median.
Price ($) 1.05 1.07 1.08 1.10 1.11 1.12
Frequency 1 3 1 12 8 5
MEASURES OF VARIABILITY OR DISPERSION
The measures of central tendency do not by themselves give an adequate description of the data. It is also very important for us
to know how the observations spread out from the average. The measures of variation indicate the extent to which individual
items in a series are scattered about the average. It is used to determine the extent of the scatter so that steps may be taken to
control the existing variation.
Both samples have the same mean but, it is quite obvious that the measurements for sample A are more uniform or the values are
close to each other as compared to sample B.
RANGE
The range measures the distance between the largest and the smallest values and, as such, gives an idea of the spread of the
data set. However, the range does not use the concept of deviation. It is affected by outliers but does not consider all values in
the data set. Thus it is a not a very useful measure of variability.
𝑅𝑎𝑛𝑔𝑒 (𝑅) = 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒 – 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒
Population Variance: Given the finite population 𝑥1 , 𝑥2 … 𝑥𝑁 , the population variance, which is exact, is
∑(𝑥𝑖 − 𝜇)2 𝑁 ∑ 𝑥𝑖 2 − (∑ 𝑥𝑖 )2
𝜎2 = 𝜎2 =
𝑁 𝑁2
Sample Variance: Given a random sample 𝑥1 , 𝑥2 … 𝑥𝑛 , the sample variance is
∑(𝑥𝑖 − 𝑥̅ )2
𝑠2 =
𝑛−1
𝑛 ∑ 𝑥𝑖 2 − (∑ 𝑥𝑖 )2
𝑠2 =
𝑛(𝑛 − 1)
where: = population standard deviation 𝑥𝑖 = 𝑖th observation
𝑠 = sample standard deviation 𝜇 = population mean
𝑥̅ = sample mean 𝑁 = population size
𝑛 = sample size
If the data are clustered around the mean, then the variance and the standard deviation will be somewhat small. If, however,
the data are widely scattered about the mean, the variance and the standard deviation will be somewhat large.
Notes:
1. We divide by the quantity 𝑛 − 1 in order to make the sample variance an unbiased estimator of the population variance.
(An estimator is unbiased if its average value is equal to the parameter it is estimating.)
2. The unit of the standard deviation is the same as that of the raw data, so it is preferable to use the standard deviation
as a measure of variability instead of the variance.
3. The range is a quick but a rough measure of variation since considers only the highest value and the lowest value of the
observations.
In attempting to develop a sense for the standard deviation, we consider the following results from Chebyshev’s theorem:
At least 75% of all scores will fall within two standard deviations of the mean.
At least 89% of all scores will fall within three standard deviations of the mean.
Let us also consider the empirical rule, which applies to data that is approximately bell shaped. For these bell-shaped
distributions, the empirical rule states that:
About 68% of all scores fall within one standard deviation of the mean.
About 95% of all scores fall within two standard deviations of the mean.
About 99.7% of all scores fall within three standard deviations of the mean.
Exercises
1. A sample of seven taxicabs from a large fleet of taxicabs used the following amounts of gasoline in one day: 10.9, 19.3,
14.7, 13.8, 15.3, 11.4, and 12.6 gallons. Compute for the range, mean absolute deviation, variance and standard
deviation of the sample data.
2. The manager of a small dry cleaner employs six people. As part of their personnel file, she asked each one to record to
the nearest one-tenth of a mile the distance they travel one way from home to work. The six distances are listed below:
17.6, 22.9, 29.8, 29.7, 12.2, and 15.8. Determine the range and the standard deviation.
3. The following are the gains and losses (in thousands of pesos) of two commodities for 10 business days.
Commodity 1: 6 4 2 -3 4 0 -2 5 4 5
Commodity 2: 3 2 0 -1 -4 3 5 6 5 5
a.) Calculate the mean and standard deviation of each of the samples.
b.) Which commodity shows the more consistent performance?
4. A written test administered to 2 sections of Math 5C gave the following mean and standard deviation.
Section A Section B
Mean 60 76
Standard Deviation 10 12
Determine which of the 2 sections has greater variability of scores.
5. The mean stature of college women is 5’2” with standard deviation of 2.5” while their mean weight is 105 lbs. with a
standard deviation of 8 lbs. Which is more variable, height or weight of college women?
The numbers of minutes spent in the computer lab by a sample of 20 students working on a project are given below.
Find the mean, range, variance, standard deviation, and coefficient of variation for this sample.
Numbers of Minutes
30 | 0 2 5 5 6 6 6 8
40 | 0 2 2 5 7 9
50 | 0 1 3 5
60 | 1 3
6. Find the mean, range, variance, standard deviation, and coefficient of variation for the following data set given in a
stem-and-leaf plot.
4 | 568
5 | 34569
6 | 2356699
7 | 01133455578
8 | 12369
9 | 3578
7. The following scores represent the final examination scores for a business statistics course:
23 60 79 32 57 74 52 70 82 36
80 77 81 95 41 65 92 85 55 76
52 10 64 75 78 25 80 98 81 67
41 71 83 54 64 72 88 62 74 43
60 78 89 76 84 48 84 90 15 79
34 67 17 82 69 74 63 80 85 61
Compute the mean, variance, standard deviation and coefficient of variation of the data.
8. A study of the effects of smoking on sleep patterns is conducted. The measure observed is the time, in minutes, that it
takes to fall asleep. These data are obtained:
Smokers: 69.3, 56.0, 22.1, 47.6, 53.2, 48.1, 52.7, 34.4, 60.2, 43.8, 23.2, 13.8
Nonsmokers: 28.6, 25.1, 26.4, 34.9, 29.8, 28.4, 38.5, 30.2, 30.6, 31.8, 41.6, 21.1, 36.0, 37.9, 13.9
a. Find the sample mean for each group.
b. Find the sample standard deviation for each group.
c. Find the coefficient of variation for each group.
d. Comment on what kind of impact smoking appears to have on the time required to fall asleep.
9. The weights of 10 boxes of a certain brand of cereal have a mean content of 278 grams with a standard deviation of
9.64 grams. If these boxes were purchased at 10 different stores and the average price per box is $1.29 with a standard
deviation of $0.09, can you conclude that the weights are relatively more homogeneous than the prices?
SLU-SAMCIS\sir h