STATISTIC

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 100

INTRODUCTION

In the modern world of information and communication technology, the importance of statistics
is very well recognized by all the disciplines. Statistics has originated as a science of statehood
and found applications slowly and steadily in Agriculture, Economics, Commerce, Biology,
Medicine, Industry, planning, education and so on. As of today, there is no other human walk of
life, where statistics cannot be applied.
Statistics is concerned with the scientific method of collecting, organizing, summarizing,
presenting and analyzing statistical information (data) as well as drawing valid conclusion on
the basis of such analysis. It could be simply defined as the ╉science of data╊. Thus, statistics
uses facts or numerical data, assembled, classified and tabulated so as to present significant
information about a given subject. Statistic is a science of understanding data and making
decisions in the face of randomness.
The study of statistics is therefore essential for sound reasoning, precise judgment and objective
decision in the face of up- to- date accurate and reliable data. Thus many researchers,
educationalists, business men and government agencies at the national, state or local levels rely
on data to answer operations and programs. Statistics is usually divided into two categories,
which is not mutually elution namely: Descriptive statistics and inferential statistics.
DESCRIPTIVE STATISTICS
This is the act of summarizing and given a descriptive account of numerical information in form
of reports, charts and diagrams. The goal of descriptive statistics is to gain information from
collected data. It begins with collection of data by either counting or measurement in an inquiry.
It involves the summary of specific aspect of the data, such as averages values and measure of
dispersion (spread). Suitable graphs, diagrams and charts are then used to gain understanding
and clear interpretation of the phenomenon under investigation keeping firmly in mind where
the data comes from. Normally, a descriptive statistics should:
i. be single – valued
ii. be algebraically tractable
iii. consider every observed value.
INFERENTIAL STATISTICS
This is the act of making deductive statement about a population from the quantities computed
from its representative sample. It is a process of making inference or generalizing about the
population under certain conditions and assumptions. Statistical inference involves the
processes of estimation of parameters and hypothesis testing.
Statistics in Physical Sciences
Physical sciences such as Chemistry, Physics, Meteorology and Astronomy are based on
statistical concepts. For example, it is evident that the pressure exerted by a gas is actually an
average pressure, an average effect of forces exerted by individual molecules as they strike the
wall of a container. The modern science of meteorology is to a great degree dependent upon
statistical methods for its existence. The methods that give weather forecasting the accuracy it
has today have been developed using modern sample survey techniques.
Statistics in Engineering
Statistics also plays an important role in Engineering. For example, such topics as the study of
heat transfer through insulating materials per unit time, performance guarantee testing
programs production control, inventory control, standardization of fits and tolerances of
machine parts, job analysis of technical personnel, studies involving the fatigue of metals
(endurance properties), corrosion studies, quality control, reliability analysis and many other
specialized problems in research and development make great use of probabilistic and statistical
methods.
Data can be described as a mass of unprocessed information obtained from measurement of
counting of a characteristics or phenomenon. They are raw facts that have to be processed in
numerical form they are called quantitative data. For instance the collection of ages of students
offering STS 202 in a particular session is an example of this data. But when data are not
presented in numerical form, they are called qualitative data. E.g.: status, sex, religion, etc.

SOURCES OF STATISTICAL DATA


1. Primary data: These are data generated by first hand or data obtained directly from
respondents by personal interview, questionnaire, measurements or observation. Statistical data
can be obtained from:
(i) Census – complete enumeration of all the unit of the population
(ii) Surveys – the study of representative part of a population
(iii) Experimentation – observation from experiment carried out in laboratories and research
center.
(iv) Administrative process e.g. Record of births and deaths.
ADVANTAGES
 Comprises of actual data needed
 It is more reliable with clarity
 Comprises a more detail information
DISADVANTAGES
 Cost of data collection is high
 Time consuming
 There may larger range of non response
2. Secondary data: These are data obtained from publication, newspapers, and annual
reports. They are usually summarized data used for purpose other than the intended one. These
could be obtain from the following:
(i) Publication e.g. extract from publications
(ii) Research/Media organization
(iii) Educational institutions
ADVANTAGES
 The outcome is timely
 The information gathered more quickly
 It is less expensive to gather.

DISADVANTAGES
 Most time information are suppressed when working with secondary data
 The information may not be reliable
METHODS OF COLLECTION OF DATA
There are various methods we can use to collect data. The method used depends on the problem
and type of data to be collected. Some of these methods include:
1. Direct observation
2. Interviewing
3. Questionnaire
4. Abstraction from published statistics.

DIRECT OBSERVATION
Observational methods are used mostly in scientific enquiry where data are observed directly
from controlled experiment. It is used more in the natural sciences through laboratory works
than in social sciences. But this is very useful studying small communities and institutions.
INTERVIEWING
In this method, the person collecting the data is called the interviewer goes to ask the person
(interviewee) direct questions. The interviewer has to go to the interviewees personally to collect
the information required verbally. This makes it different from the next method called
questionnaire method.
QUESTIONNAIRE
A set of questions or statement is assembled to get information on a variable (or a set of
variable). The entire package of questions or statement is called a questionnaire. Human beings
usually are required to respond to the questions or statements on the questionnaire. Copies of
the questionnaire can be administered personally by its user or sent to people by post. Both
interviewing and questionnaire methods are used in the social sciences where human
population is mostly involved.
ABSTRACTIONS FROM THE PUPLISHED STATISTICS
These are pieces of data (information) found in published materials such as figures related to
population or accident figures. This method of collecting data could be useful as preliminary to
other methods.
Other methods includes: Telephone method, Document/Report method, Mail or Postal
questionnaire, On-line interview method, etc.
PRESENTATION OF DATA
When raw data are collected, they are organized numerically by distributing them into classes or
categories in order to determine the number of individuals belonging to each class. Most cases,
it is necessary to present data in tables, charts and diagrams in order to have a clear
understanding of the data, and to illustrate the relationship existing between the variables being
examined.
FREQUENCY TABLE
This is a tabular arrangement of data into various classes together with their corresponding
frequencies.
Procedure for forming frequency distribution
Given a set of observation 𝑥1, 𝑥2 , 𝑥3, … , 𝑥𝑛 , for a single variable.
1. Determine the range (R) = L – S where L = largest observation in the raw data; and S =
smallest observation in the raw data.
2. Determine the appropriate number of classes or groups (K). The choice of K is arbitrary
but as a general rule, it should be a number (integer) between 5 and 20 depending on the size of
the data given. There are several suggested guide lines aimed at helping one decided on how
many class intervals to employ. Two of such methods are:
(a) K = 1 +3.322 (log10 𝑛)
(b)K = √𝑛 where 𝑛 = number of observations.
3. Determine the width (W) of the class interval. It is determined as W = 𝑅
𝐾
4. Determine the numbers of observations falling into each class interval i.e. find the class
frequencies.
NOTE: With advent of computers, all these steps can be accomplishes easily.

SOME BASIC DEFINITIONS


Variable: This is a characteristic of a population which can take different values. Basically, we
have two types, namely: continuous variable and discrete variable. A continuous variable is a
variable which may take all values within a given range. Its values are obtained by
measurements e.g. height, volume, time, exam score etc.
A discrete variable is one whose value change by steps. Its value may be obtained by counting. It
normally takes integer values e.g. number of cars, number of chairs.
Class interval: This is a sub-division of the total range of values which a (continuous) variable
may take. It is a symbol defining a class E.g. 0-9, 10-19 etc. there are three types of class interval,
namely: Exclusive, inclusive and open-end classes method.
Exclusive method:
When the class intervals are so fixed that the upper limit of one class is the lower limit of the
next class; it is known as the exclusive method of classification. E.g. Let some expenditures of
some families be as follows:
0 – 1000, 1000 – 2000, etc. It is clear that the exclusive method ensures continuity of data as
much as the upper limit of one class is the lower limit of the next class. In the above example,
there are so families whose expenditure is between 0 and 999.99. A family whose expenditure is
1000 would be included in the class interval
1000-2000.
Inclusive method:
In this method, the overlapping of the class intervals is avoided. Both the lower and upper limits
are included in the class interval. This type of classification may be used for a grouped frequency
distribution for discrete variable like members in a family, number of workers in a factory etc.,
where the variable may take only integral values. It cannot be used with fractional values like
age, height, weight etc. In case of continuous variables, the exclusive method should be used.
The inclusive method should be used in case of discrete variable.
Open end classes:
A class limit is missing either at the lower end of the first class interval or at the upper end of the
last class interval or both are not specified. The necessity of open end classes arises in a number
of practical situations, particularly relating to economic and medical data when there are few
very high values or few very low values which are far apart from the majority of observations.

Class limit: it represents the end points of a class interval. {Lower class limit & Upper class
limit}. A class interval which has neither upper class limit nor lower class limit indicated is
called an open class interval e.g. less than 25, 25 and above.
Class boundaries: The point of demarcation between a class interval and the next class interval
is called boundary. For example, the class boundary of 10-19 is 9.5 – 19.5

Cumulative frequency: This is the sum of a frequency of the particular class to the frequencies of
the class before it.
Example 1: The following are the marks of 50 students in STS 102:

48 70 60 47 51 55 59 63 68 63 47 53 72 53 67 62 64 70 57
56 48 51 58 63 65 62 49 64 53 59 63 50 61 67 72 56 64 66 49
52 62 71 58 53 63 69 59 64 73 56.
(a) Construct a frequency table for the above data.
(b) Answer the following questions using the table obtained:
(i) how many students scored between 51 and 62?
(ii) how many students scored above 50?
(iii) what is the probability that a student selected at random from the class will score less
than 63?
Solution:
(a) Range (R) = 73 − 47 = 26
No of classes (𝑘) = √𝑛 = √50 = 7.07 ≈ 7
Class size (W) = 26/7 = 3.7 ≈ 4
Example 2: The following data represent the ages (in years) of people living in a housing estate
in Abeokuta.
18 31 30 6 16 17 18 43 2 8 32 33 9 18 33 19 21 13 13 14
14 6 52 45 61 23 26 15 14 15 14 27 36 19 37 11 12 11
20 12 39 20 40 69 63 29 64 27 15 28.
Present the above data in a frequency table showing the following columns; class interval, class
boundary, class mark (mid-point), tally, frequency and cumulative frequency in that order.
Solution:
Range (R)= 69 − 2 = 67
No of classes (𝑘) = √𝑛 = √50 = 7.07 ≈ 7.00
Class width (W) = 𝑅/𝑘 = 67/7 = 9.5 ≈ 10
Observation from the Table
The data have been summarized and we now have a clearer picture of the distribution of the
ages of inhabitants of the Estate.
Exercise 1
Below are the data of weights of 40students women randomly selected in Ogun state. Prepare a
table showing the following columns; class interval, frequency, class boundary, class mark, and
cumulative frequency.
96 84 75 80 64 105 87 62 105 101 108 106 110 64 105 117
103 76 93 75 110 88 97 69 94 117 99 114 88 60 98 77
96 96 91 73 82 81 91 84
Use your table to answer the following question
i. How many women weight between 71 and 90?
ii. How many women weight more than 80?
iii. What is the probability that a woman selected at random from Ogun state would
weight more than 90?
MEASURES OF LOCATION
These are measures of the centre of a distribution. They are single values that give a description
of the data. They are also referred to as measure of central tendency. Some of them are
arithmetic mean, geometric mean, harmonic mean, mode, and median.

THE ARITHMETIC MEAN (A.M)


The arithmetic mean (average) of set of observation is the sum of the observation divided by the
number of observation. Given a set of a numbers
𝑥1, 𝑥2 … , 𝑥𝑛 , the arithmetic mean denoted by x ̅ is defined by

Example 1: The ages of ten students in STS 102 are


16,20,19,21,18,20,17,22,20,17, determine the mean age.
Calculation of mean from grouped data
If the items of a frequency distribution are classified in intervals, we make the assumption that
every item in an interval has the mid-values of the interval and we use this midpoint for 𝑥.
Example 3: The table below shows the distribution of the waiting items for some customers in a
certain petrol station in Abeokuta.

Use of Assume mean


Sometimes, large values of the variable are involve in calculation of mean, in order to make our
computation easier, we may assume one of the values as the mean. This if A= assumed mean,
and d= deviation of 𝑥 from A, i.e. 𝑑 = 𝑥 – 𝐴
Example 5: Consider the data in example 3, using a suitable assume mean, compute the mean.
NOTE: It is always easier to select the class mark with the longest frequency as the assumed
mean.

ADVANTAGE OF MEAN
The mean is an average that considers all the observations in the data set. It is single and easy to
compute and it is the most widely used average.

DISAVANTAGE OF MEAN
Its value is greatly affected by the extremely too large or too small observation.
THE HARMONIC MEAN (H.M)
The H.M of a set of numbers 𝑥1, 𝑥2, … , 𝑥𝑛 is the reciprocal of the arithmetic mean of the
reciprocals of the numbers. It is used when dealing with the rates of the type 𝑥 per 𝑑 (such as
kilometers per hour, Naira per liter). The formula is expressed thus:

Note:
(i) Calculation takes into account every value
(ii) Extreme values have least effect
(iii) The formula breaks down when“o”is one of the observations.

THE GEOMETRIC MEAN (G.M)


The G.M is an analytical method of finding the average rate of growth or decline in the values of
an item over a particular period of time. The geometric mean of a set of number 𝑥1,𝑥2, … , 𝑥𝑛 is
the 𝑛𝑡h root of the product of the number. Thus
Example: The rate of inflation in fire successive year in a country was
5%, 8%, 12%, 25% and 34%. What was the average rate of inflation per year?

∴ Average rate of inflation is 16%


Note: (1) Calculate takes into account every value.
(2) It cannot be computed when “o” is on of the observation.

Relation between Arithmetic mean, Geometric and Harmonic


In general, the geometric mean for a set of data is always less than or equal to the corresponding
arithmetic mean but greater than or equal to the harmonic mean.
That is, H.M ≤ G.M ≤ A.M
The equality signs hold only if all the observations are identical.

Example 1: The values of a random variable 𝑥 are given as 8, 5, 9, 12, 10, 6 and 4.
Find the median.
Solution: In an array: 4,5,6,8,9,10,12. 𝑛 is odd, therefore

Example 2: The value 0f a random variable 𝑥 are given as


15, 15, 17, 19, 21, 22, 25, and 28. Find the median.
Solution: 𝑛 is odd.

Calculation of Median from a grouped data


The formula for calculating the median from grouped data is defined as

Where: 𝐿1 = Lower class boundary of the median class


𝑁 = ∑f = Total frequency
𝐶fb = Cumulative frequency before the median class
fm = Frequency of the median class.
W = Class size or width.
Example3: The table below shows the height of 70 men randomly selected at Sango Ota.
ADVANTAGE OF THE MEDIAN
(i) Its value is not affected by extreme values; thus it is a resistant measure of central tendency.
(ii) It is a good measure of location in a skewed distribution
DISAVANTAGE OF THE MEDIAN
1) It does not take into consideration all the value of the variable.
THE MODE
The mode is the value of the data which occurs most frequently. A set of data may have no, one,
two or more modes. A distribution is said to be uni-model, bimodal and multimodal if it has one,
two and more than two modes respectively. E .g: The mode of scores 2, 5, 2, 6, 7 is 2.
Calculation of mode from grouped data
From a grouped frequency distribution, the mode can be obtained from the formula.
ADVANTAGE OF THE MODE
1) It is easy to calculate.
DISADVANTAGE OF THE MODE
(i) It is not a unique measure of location.
(ii) It presents a misleading picture of the distribution.
(iii) It does not take into account all the available data.

Exercise 2
1. Find the mean, median and mode of the following observations: 5, 6,10,15,22,16,6,10,6.
2. The six numbers 4, 9,8,7,4 and Y, have mean of 7. Find the value of Y.
3. From the data below

Calculate the (i)Mean (ii)Mode (iii) Median


MEASURES OF PARTITION
From the previous section, we’ve seen that the median is an average that divides a distribution
into two equal parts. So also these are other quantity that divides a set of data (in an array) into
different equal parts. Such data must have been arranged in order of magnitude. Some of the
partition values are: the quartile, deciles and percentiles.
THE QUARTILES
Quartiles divide a set of data in an array into four equal parts.
For ungrouped data, the distribution is first arranged in ascending order of magnitude.
Then

Where
i = The quality in reference
𝐿qi=Lower class boundary of the class counting the quartile
𝑁 = Total frequency
𝐶fqi = Cumulative frequency before the Qi class
fqi= The frequency of the Qi class
W = Class size of the Qi class.

DECILES
The values of the variable that divide the frequency of the distribution into ten equal parts are
known as deciles and are denoted by 𝐷1, 𝐷2, … , 𝐷9. the fifth deciles is the median.
For ungrouped data, the distribution is first arranged in ascending order of magnitude. Then

For a grouped data

PERCENTILE
The values of the variable that divide the frequency of the distribution into hundred equal parts
are known as percentiles and are generally denoted by 𝑃1,… ,𝑃99.
The fiftieth percentile is the median.
For ungrouped data, the distribution is first arranged in ascending order of magnitude. Then
Example: For the table below, find by calculation (using appropriate expression)
(i) Lower quartile, Q1
(ii) Upper Quartile, Q3
(iii) 6th Deciles, 𝐷6
(iv) 45th percentile of the following distribution
MEASURES OF DISPERSION
Dispersion or variation is degree of scatter or variation of individual value of a variable about the
central value such as the median or the mean. These include range, mean deviation, semi-
interquartile range, variance, standard deviation and coefficient of variation.
THE RANGE
This is the simplest method of measuring dispersions. It is the difference between the largest
and the smallest value in a set of data. It is commonly used in statistical quality control.
However, the range may fail to discriminate if the distributions are of different types.

SEMI – INTERQUARTILE RANGE


This is the half of the difference between the first (lower) and third quartiles (upper). It is good
measure of spread for midrange and the quartiles.
THE MEAN/ABSOLUTE DEVIATION
Mean deviation is the mean absolute deviation from the centre. A measure of the center could be
the arithmetic mean or median.
Given a set of 𝑥1, 𝑥2 , … , 𝑥𝑛, the mean deviation from the arithmetic mean is defined by:

Example1: Below is the average of 6 heads of household randomly selected from a country. 47,
45, 56, 60, 41, 54 .Find the
(i) Range
(ii) Mean
(iii) Mean deviation from the mean
(iv) Mean deviation from the median.
Solution:
(i) Range = 60 − 41 = 19
Example2: The table below shown the frequency distribution of the scores of 42 students in MTS
201
THE STANDARD DEVIATION
The standard deviation, usually denoted by the Greek alphabet 𝜎 (small signal ょ ゅ for
population ょ is defined as the ╉positive square root of the arithmetic mean of the squares of the
deviation of the given observation from their arithmetic mean╊. Thus, given 𝑥1, … , 𝑥𝑛 as a
set of 𝑛 observations, then the standard deviation is given by:

MERIT
(i) It is well defined and uses all observations in the distribution.
(ii) It has wider application in other statistical technique like skewness, correlation,
and quality control e.t.c
DEMERIT
(i) It cannot be used for computing the dispersion of two or more distributions given
in different unit.
THE VARIANCE
The variance of a set of observations is defined as the square of the standard deviation and is
thus given by 𝜎2
COEFFICIENT OF VARIATION/DISPERSION
This is a dimension less quantity that measures the relative variation between two servers
observed in different units. The coefficients of variation are obtained by dividing the standard
deviation by the mean and multiply it by 100. Symbolically

The distribution with smaller C.V is said to be better.


EXAMPLE3: Given the data 5, 6, 9, 10, 12. Compute the variance, standard deviation and
coefficient of variation
EXAMPLE 4: Given the following data. Compute the
(i) Mean
(ii) Standard deviation
(iii) Coefficient variation.
SOLUTION

Exercise 3
The data below represents the scores by 150 applicants in an achievement text for the post of
Botanist in a large company:
Estimate
(i) The mean score
(ii) The median score
(iii) The modal score
(iv) Standard deviation
(v) Semi – interquartile range
(vi) D4
(vii) P26
(viii) coefficient of variation
PROBABILITY
Probability Theory is a mathematical model of uncertainty. We shall briefly consider the
following terminologies:
Experiment: This can be described as an act performed.
Trial: Is an act performed.
Outcome: Is a result realized from the trial.
Sample Space: This is the list of all the possible outcomes of an experiment. Each of the
outcome in a sample space is called sample point.
Event: This is a subset of a sample space of an experiment.
E.g. When a coin is tossed twice, Sample Space (𝑆) = {𝐻𝐻, 𝐻𝑇, 𝑇𝐻, 𝑇𝑇}. Define the event 𝐴 as: at
least one head is observed. We have 𝐴 = {𝐻𝐻, 𝐻𝑇, 𝑇𝐻}.

Axioms of Probability
Let 𝑆 be a sample space, let be the class of events, and let P be a real- valued function defined on
. Then P is called a probability function, and 𝑃(𝐴) is called the probability of the event 𝐴 if the
following axioms hold:

(I) For every event 𝐴, 0 ≤ 𝑃(𝐴) ≤ 1.


(II) 𝑃(𝑆) = 1.
(III) If 𝐴 and 𝐵 are mutually exclusive events, then 𝑃(𝐴 𝖴 𝐵) = 𝑃(𝐴) +
𝑃(𝐵).

Random variables and their properties


Random variables is a function X that assigns to every element 𝑥 ∈ 𝑆 one and only one real value
X(𝑐) = 𝑥 called the random variable. It could also be simply define as a function that assigns
numerical value to each outcome defined by sample space. Consider the following table
which gives frequency distribution and relative frequency for all 2000 families living in a small
town. Consider X to be the number of heads obtained in 3 tosses of a coin. Sample Space (𝑆) =
{𝐻𝐻𝐻, 𝐻𝐻𝑇, 𝐻𝑇𝐻, 𝐻𝑇𝑇, 𝑇𝐻𝐻, 𝑇𝐻𝑇, 𝑇𝑇𝐻, 𝑇𝑇𝑇}. Secondly this variable is random in the sense
that the value that will occur in a given instance cannot be predicted in certainty, we can make a
list of elementary outcomes as associated with X.

Remark: The possible value of a random variable X can be determined directly from the
description of the random variable without listing the sample space. However, to assign
probability to this value treated as the event is sometimes helpful to refer to the sample space.
A random variable could be discrete or continuous.
Discrete: A random variable whose values are countable is called a discrete random variable.
E.g. number of cars sold in a day.
Continuous: A random variable that can assume any values (one or more) contain in an interval
is called a continuous random variable. E.g. time taking in complete an examination.
Cumulative distribution/ Distribution function
If X is a discrete random variable, the function given by
(2) A lot of 12 television sets chosen at random are defective, if 3 of the sets are chosen at
random for shipment in hotel, how many defective set can they expect?
BERNOULLI DISTRIBUTION
A random variable 𝑥 has a Bernoulli distribution if and only if its probability distribution is
given as f(𝑥, 𝑝) = 𝑝𝑥 (1 − 𝑝)𝑛−𝑥 ; 𝑥 = 0,1. In this context, 𝑝 may be probability of passing or
failing an examination

BINOMIAL DISTRIBUTION
An experiment consisting of 𝑛 repeated trials such that
(1) the trials are independent and identical
(2) each trial result in only one or two possible outcomes
(3) the probability of success 𝑝 remains constant
(4) the random variable of interest is the total number of success.
The binomial distribution is one of the widely used in statistics and it used to find the
probability that an outcome would occur 𝑥 times in 𝑛 performances of an experiment. For
example, consider a random variable of flipping a coin 10 times. When a coin is toss, the
probability of getting head is 𝑝 and that of tail is 1 − 𝑝 = 𝑞.
A random variable 𝑥 has a binomial distribution and it is referred to as binomial random
variable if and only if its probability is given by
Example: Observation over a long period of time has shown that a particular sales man can
make a sale on a single contact with the probability of 20%. Suppose the same person contact
four prospects,
(a) What is the probability that exactly 2 prospects purchase the product?
(b) What is the probability that at least 2 prospects purchase the product?
(c) What is the probability that all the prospects purchase the product?
(d) What is the expected value of the prospects that would purchase the product?
Solution: Let X denote the number of prospect: 𝑥 = 0,1,2,3,4. Let 𝑝 denote the probability of
(success) purchase = 0.2 Hence, X ~𝐵(4,0.2).
POISSON DISTRIBUTION
When the size of the sample (𝑛) is very large and the probability of obtaining success in any one
trial very small, then Poisson distribution is adopted.
Given an interval of real numbers, assumed counts of occur at random throughout interval, if
the interval can be partition into sub interval of small enough length such that
(1) The probability of more than one count sub interval is 0
(2) The probability of one count in a sub interval is the same for all sub intervals and
proportional to the length of the sub interval
(3) The count in each of the sub interval is independent of all other sub intervals.
A random experiment of this type is called a Poisson Process. If the mean number of count in an
interval is 𝜆 > 0, the random variable 𝑥 that equals the number of count in an interval has a
Poisson distribution with parameter 𝜆 and the probability density function is given by

Example: Flaws occur at random along the length of a thick, suppose that a number of flaws
follows a Poisson distribution with a mean flaw of 2.3 per mm. determine probability of exactly
2 flaws in one mm of wire.

Hence we can apply the Poisson distribution to the binomial when 𝑛 ≥ 30 and 𝑛𝑝 < 5.
Example: If the 3% of the electric doors manufactured by a company are defective. Find the
probability that in the sample of 120 doors, at most 3 doors are defective.
(a) Use binomial to solve the problem
(b) Use Poisson distribution and compare your results.
Solution: Let 𝑝 denote probability of electric doors defective = 0.03
Let 𝑞 = 1 − 𝑝 = 0.97 denote probability of electric non defective. Let 𝑛 denote total number of
sample of electric doors = 120.
Let 𝑥 denote number of doors being consider.
(a) By binomial distribution:

= 0.515
Hence, Poisson distribution result is very close to the binomial distribution result showing that
it can be use to approximate binomial distribution in this problem.
GEOMETRIC DISTRIBUTION
Consider a random experiment in which all the conditions of a binomial distribution hold.
However, instead of fixed number of trials, trials are conducted until first success occurs. Hence
by definition, in a series of independent binomial trials with constant probability 𝑝 of success, let
the random variable 𝑥 denotes number of trials until first success. Then 𝑥 is said to have a
geometric distribution with parameter 𝑝 and given by

Examples:
(1) If the probability that a wave contain a large particles of contamination is 0.01, it assumes
that the wave are independent, what is the probability that exactly 125 waves need to be
analyzed before a large particle is detected?
Solution: Let X denotes the number of samples analyzed until a large particle is detected. Then
X is a geometric random variable with 𝑝 = 0.01. Hence, the required probability is f(𝑥 = 125) =
(0.01)(0.99)124 = 0.0029.
(2) Each sample of 𝑛 has 10% of chance of containing a particular rare molecule. Assume
samples are independent with regard to the present of rare molecule. Find the probability that
in the next 18 samples, (a) exactly 2 containing rare molecule. (b) at least 4 sample.
Solution: Left as exercise

HYPERGEOMETRIC DISTRIBUTION
Suppose we have a relatively small quantity consisting of 𝑁 items of which 𝑘(= 𝑁𝑃) are defective.
If two items are samples sequentially then the outcome for the second draw is very much
influenced by what happened on the first drawn provided that the first item drawn remain in the
quantity. We need to obtain a formula similar to that of binomial distribution, which applies to
sample without replacement.
A random variable 𝑥 is said to have a hypergeometric distribution if and only if its probability
density function is given by

= 0 otherwise
Where 𝑁 = total number of sample
𝑛 = number of chosen without replacement of items from 𝑁 elements.
𝑘 = when consider a set of 𝑁 objects which 𝑘 are looked upon as success.
𝑁 − 𝐾= as failure.
The mean and variance for hypergeometric distribution are given as:

Examples:
(1) The random sample of 3 oranges is taking from a basket containing 12 oranges, if 4 of the
oranges in the basket are bad, what is the probability of getting (a) no bad oranges from the
sample (b) more than 2 are bad from the sample.

(2) A batch of parts contain 100 parts from a local supplier of tubing and 200 parts from a
supplier of tubing in the next state, if 4 parts are selected at random without replacement, what
is the probability that they are all from the local supplier.
Solution: Let X equals the number of parts in the sample from the local supplier, then X has a
hypergeometric distribution and the required probability is f(𝑥 = 4) consequently

NEGATIVE BINOMIAL DISTRIBUTION


A generalization of the geometric distribution in which the random variables is a number of
Bernoulli trials required to obtain 𝑟 success results in negative binomial distribution.
We may be interested in the probability that 8th child exposed to measles is the 3rd to contact it.
If the 𝑘th success is to occur on the 𝑥th trial, there must be 𝑘 − 1 successes on the first 𝑥 trial and
the probability for this is given by

The probability of success on the 𝑘th trial is θ and the probability that the 𝑘th success occurs on
the 𝑥th trial is:

Hence, we say that a random variable 𝑥 has a negative binomial distribution if and only if its
probability density function is given by

Exercise 5
(1) The probability that an experiment will succeed is 0.6 if the experiment is repeated to 5
successive outcomes have occurred, what is the mean and variance of number of repetition
required?
(2) A high performance aircraft contains 3 identical computer, only one is used to operates in
aircraft, the other two are spares that can be activated incase the primary system fails. During
one hour of operations, the probability of failure in the primary computer or any activated spare
system is 0.0005. Assume that each hour represent an identical trial.
(a) What is the expected value to failure of all the 3 computers?
(b) What is the probability that all the 3 computers fail within a 5 hour flight?
NORMAL DISTRIBUTION
The normal distribution is the most important and the most widely used among all continuous
distribution in the statistics. It is considered as the corner stone of statistics theory. The graph of
a Normal distribution is a bell – shaped curved that extends indefinitely in both direction.

Features (Properties) of Normal Curve


1. The curve is symmetrical about the vertical axis through the mean 𝜇.
2. The mode is the highest point on the horizontal axis where the curve is maximum and occurs
where 𝑥 = 𝜇.
3. The normal curve approaches the horizontal axis asymptotically.
4. The total area under the curve is one (1) or 100%.
5. About 68% of all the possible 𝑥- values (observations) lie between 𝜇 − 𝜎 and 𝜇 + 𝜎, or the area
under the curve between 𝜇 − 𝜎 and 𝜇 + 𝜎 is 68% of the total area.
6. About 95% of the observations lie between 𝜇 − 2𝜎 and 𝜇 + 2𝜎.
7. 99.7% (almost all) of the observations lie between 𝜇 − 3𝜎 and 𝜇 + 3𝜎.
Note: The last three of the above properties are arrived at through advanced mathematical
treatment.
It is clear from these properties that a knowledge of the population means and standard
deviation gives a complete picture of the distribution of all the values.

Notation: Instead of saying that the values of a variable 𝑥 are normally distributed with mean 𝜇
and standard deviation 𝜎, we simply say that 𝑥 has an
𝑁( 𝜇, 𝜎2) or 𝑥 is 𝑁( 𝜇, 𝜎2) or 𝑥~𝑁(𝜇, 𝜎2).
A random variable 𝑥 is said to have a normal distribution if its probability density function is
given by
Where 𝜎 and 𝜇 are the parameters of the distribution. Note that since 𝑛(𝑥; 𝜇, 𝜎) is a p.d.f, it
established the fact that the area is 1. In order word, cumulative distribution function is given by
Example: Find the probability that a random variable having the standard distribution will take
a value
(a) less than 1.72 (b) less than –0.88
(c) between 1.19 and 2.12 (d) between –0.36 and 1.21
STANDARDIZED NORMAL VARIABLE
In real world application, the given continuous random variable may have a normal distribution
in value of a mean and standard deviation different from 0 and 1. To overcome this difficulty, we
obtain a new variable denoted by 𝑧 and this is given by

Example:
(1) Suppose the current measurement in a strip of wire assumed to be normally distributed with
a mean of 10mA and a variance of 4mA. What is the probability that the current is greater than
13?
The above problem means that 68% of the 𝑥 −values are within 1 standard deviation of the mean
0.
(3) It is known that the marks in a University direct entry examination are normally distributed
with mean 70 and standard deviation 8. Given that your score is 66, what percentage of all the
candidates will be expected to score more than you?
Solution: We have 𝑥~𝑁(70,64) and the required proportion of 𝑥 −values that are above 66 is
𝑃(𝑥 > 66).

∴ 69% of all the candidates will be expected to do better than you.


PP: It is known from the previous examination results that the marks of candidates have a
normal distribution with mean 55 and standard deviation 10. If the pass mark in a new
examination is set at 45, what percentages of the candidates will be expected to fail?
Normal Approximation to Binomial distribution
Recall that the binomial distribution is applied to a discrete random variable. As 𝑛- trial
increases, the uses of binomial formula becomes tedious and when this happen, the normal
distribution can be use to approximate the binomial probability.
The probability we obtained by using the normal approximation to the binomial is an
approximate to the exact and the condition under which we can use normal approximation for a
binomial distribution are as follows:

Example: In a digital communication channel assumed that the number of bits received in error
can be model by binomial variable, assumed also that the probability that in bit received in error
is 1 × 10−5 if 16 million bits are transmitted. What is the probability that more than 150 errors
occurs?
Solution: Let X denotes the number of errors.

Exercise 6
(1) Two fair dies are toss 600 times. Let X denotes number of times the total of 7 occurs. Find
the probability that X lies between 80 and 110.
(2) A manufacturer of machine parts claims that at most 10% of each parts are defective. A
parameter needs 120 of such parts and to be sure of getting many goods ones, he places an order
for 140 parts. If manufacturer claims is valid, what is the probability that the purchaser would
receive at least 120 good parts?
Problem with simple linear equation
If X is normally distributed with mean= 2 and variance = 4. Find the value of 𝜆 such that the
probability that X > 𝜆 = 0.10. Hint X~𝑁(2,4).
REGERESSION AND CORRELATION
In the physical sciences and engineering, even social sciences, it is possible to formulate models
connecting several quantities such as temperature and pressure of a gas, the size and height of a
flowering plant, the yield of a crop and weather conditions. It may also be necessary to examine
what appears to be the case of variations in one variable in relation to the other. One technique
for examining such relationship is Regression.
Regression analysis is the study of the nature and extent of association between two or more
variables on the basis of the assumed relationship between them with a view to predict the value
of one variable from the other.
Scatter diagram: The first step in studying the relationship between two variables is to draw a
scatter diagram. This is a graph that shows visually the relationship between two variables in
which each point corresponds to pair of observations, one variable being plotted against the
other. The way in which the dots lie on the scatter diagrams shows the type of relationship that
exists.

Regression Models
In order to predict one variable from the other, it is necessary to construct a line or curve that
passes through the middle of the points, such that the sum of the distance between each point
and the line is equal to 0. Such line is called the line of best fit.
The simple regression equation of 𝑌 on X is defined as
𝑌 = 𝑎 + 𝑏X + e
while the multiple regression equation of 𝑌 on X1, X2, … , XK is
𝑌 = 𝑎 + 𝑏1X1 + 𝑏2X2 + ....+ 𝑏𝑘 X𝑘 + e.
where: 𝑌 is the observed dependent variable
X is the observed independent (explanatory) variable
𝑎 is the intercept (the point at which the reg. line cuts the 𝑌 axis)
𝑏 is the slope (regression coefficient). It gives the rate of change in 𝑌 per unit change in X.
e is the error term.
Ordinary Least Square Method (OLS)
Although there are other techniques for obtaining these parameters (𝑎, and 𝑏′𝑠) such as the
likelihood method, but we shall use the method of least squares to estimate the parameters of a
simple regression equation. This method involves finding the values of the regression
coefficients 𝑎 and 𝑏, that minimizes the sum of squares of the residuals (error).
Assumptions of OLS
i. The relationship between X and 𝑌 is assumed to be linear
ii.The X values are fixed
iii.There is no relationship between X and the error term i.e. 𝐸(Xe) = 0
iv.The error is assumed to be normally distributed with mean zero and variance 1.
Examples:
(1) The following are measurement height and ages months of maize plant in a plantation farm.

(a) Draw the scatter diagram


(b) Obtained the regression equation of the age month on the height of the plant.
(c) Estimate the height of the maize plant aged 10 months.
Solution:
(a) (Using SPSS)
maize plant would be 60.858 when the plant aged 10 months.
(2) A study was made on the effect of a certain brand of fertilizer (X) on cassava yield (𝑌) per
plot of farm area resulting in the following data:

(a) Plot the scatter diagram and draw the line of best fit.
(b) Resolve this data to a simple regression equation
(c) What is the value of maize yield when the fertilizer is 12?
(d) Obtain the standard error of the regression
CORRELATION
Correlation measures the degree of linear association between two or more variables when a
movement in one variable is associated with the movement in the other variable either in the
same direction or the other direction. Correlation coefficient is a magnitude, which indicates the
degree of linear association between two variables. It is given by

Interpretation of r:
𝑟 = +1, Implies that there is a perfect positive linear (direct) relationship
𝑟 = −1, Implies a perfect negative (indirect) linear relationship
−1 < 𝑟 < −0.5, Implies there is a strong negative linear relationship
−0.5 < 𝑟 < 0, Implies there is a weak negative linear relationship
0 < 𝑟 < +0.5, Implies there is a weak positive linear relationship
+0.5 < 𝑟 < +1, Implies there is a strong positive linear relationship
𝑟 = 0, Implies there is no linear relationship between the two variables.
Example 1: Using the data in example 1 on regression above, measure the degree of association
between X and 𝑌. Also comment on your result.

Spearman’s Rank Correlation Coefficient


This is a magnitude or a quantity that measures the degree of association between two variables
on the basis of their ranks rather than their actual values. Since qualitative variables such as
efficacy, intelligence, beauty, religion etc cannot be measured quantitatively, to circumvent this
problem. The Spearman’s rank correlation coefficient is given by

Where 𝐷 is the difference in ranks and 𝑛 is the number of observation, −1 ≤ 𝑟S ≤ +1.


Example 2: In a test carried out to measure the efficiency of fifteen Computer Science Students
in developing software for the computation of the institution students’ results; two judges were
asked to score the candidates. Their scores are as follows, is there any agreement in their
assessment of the candidates?

Comment: There is a fairly strong agreement between the two judges in their assessment.
Exercise 7
In a certain company, drums of mentholated spirit are kept in storage for sometime before being
bottled. During storage, the evaporation of part of water content of the spirit takes place and an
examination of such drums give the following results.
(i) Find the regression equation of evaporation loss on the storage time for the mentholated
spirit.
(ii) Find the regression equation of storage time on the evaporation loss.
(iii) Find the product moment correlation coefficient.
(iv) Determine the Spearman’s rank correlation.

ESTIMATION
When we assign value to a population parameter based on sample information is called
Estimation. An estimate is a value assign to population parameter based on the value of the
statistics. A sample statistics used to estimate a population parameter is called an estimator. In
other words, the function or rule that is used to guess the value of a parameter is called an
Estimator, and estimate is a particular value calculated from a particular sample of an
observation. But estimator like any statistic is a random variable. Parameter is represented by
Greek letter while statistic represented by roman numbers:

An estimator is divided into two, namely (i) Point estimator (ii) Interval estimator. A point
estimator is a single value given to the population parameter based on the value of the sample
statistics; while an interval estimator consists of two numerical values within which we believe
with some degree of confidence include the value of the parameter being estimated. In many
situations, a point estimate does not supply the complete information to a researcher; hence, an
approach is used called the confidence interval.
The subject of estimator is concerned with the methods by which population characteristics are
estimated from sample information. The objectives are:
(i) To present properties for judging how well a given sample statistics estimates the parent
population parameter.
(ii) To present several methods for estimating these parameters.
Properties of a Good Estimator

Example:
(a) Show that X is an unbiased estimator.
(b) Show that 𝑆2 is a biased estimator.
(c) If a population is infinite or if sampling is with replacement then the
2. Efficiency
The most efficiency estimator among a group of unbiased estimator is the one with the smallest
variance. This concept refers to the sampling variability of an estimator.
3. Sufficiency
An estimator is sufficient if it uses all the information that a sample can provide about a
population parameter and no other estimator can provide additional information.
4. Consistency
An estimator is consistent if as the sample size becomes larger, the probability increases that the
estimates will approach the true value of the population parameter. Alternatively, ô is
consistent if it satisfies the following:

INTERVAL ESTIMATION
This involves specifying a range of values on which we can assert with a certain degree of
confidence that a population parameter will fall within the interval. The confidence that we have
a population parameter will fall within confidence interval is 1 − 𝛼, where 𝛼 is the probability
that the interval does not contain.
Confidence Interval for 𝜇 when 𝜎 is known
A confidence Interval is constructed on the basis of sample information. It depends on the size
of 𝛼 which is the level of risk that the interval may be wrong. Assume the population variance 𝜎2
is known and the population is normal, then 100(1 − 𝛼)% confidence interval for 𝜇 is given by

Example: An experiment was carried out to estimate the average number of heart beat per
minute for a certain population. Under the conditions of the experiment, the average number of
heart beats per minute for 49 subjects was found to be 130. If it is reasonable to assume that
these 49 patients constitute a random sample, and the population is normally distributed with a
standard deviation of 10. Determine
(a) The 90% confidence interval for 𝜇
(b) The 95% confidence interval for 𝜇
(c) The 99% confidence interval for 𝜇
When the population variance is unknown, the distribution for construct confidence interval for
𝜇 is the t-distribution. Here an estimate 𝑆, must be calculate from the sample to substitute for
the unknown standard deviation. T- distribution is also based on population is normal. A 100(1
− 𝛼)% confidence interval for 𝜇 when 𝜎 is unknown is given by

Example: A sample of 25 teenager – old boys yielded a mean weight and standard deviation of
73 and 10 pounds respectively. Assuming a normally distributed population, find
(i) 90%
(ii) 99% confidence interval for the 𝜇 of the population.
Confidence Interval for a population proportion
To estimate a population proportion, we proceed in the same manner as when estimating a
population mean. A sample is drawn from the population of interest and the sample proportion
𝑝̅ is computed. The sample proportion is used as the point estimator of the population
proportion. Assume normally population, when 𝑛𝑝 and 𝑛(1 − 𝑝) > 5 so that

Example: A survey was conducted to study the dental health practices and attitudes of certain
urban adult population. Of 300 adults interviewed, 123 said that they regularly had a dental
check up twice a year. Obtain a 95% confidence interval for 𝑝̅ based on this data.
Exercise 8
A medical record Liberian drew a random sample of 100 patients’ charts and found that in 8% of
them, the face sheet had at least one item of information contradiction to other information in
the record. Construct the 90%, 95% and 99% confidence interval for the population of charts
containing such discrepancies.
HYPOTHESIS TESTING
This is another very important aspect of statistical inference. It involves testing the validity of a
statistical statement about a population parameter based on the available information at a given
level of significance.
Basic Definitions
Statistical hypothesis: This is a statistical statement which may or may not be true concerning
one or more populations.
Null or True hypothesis (H0): This is an assertion that a parameter in a statistical model takes a
particular value. This hypothesis expresses no difference between the observed value and the
hypothesis value. It is denoted by 𝐻0: θ1 = θ0 where θ0 is the observed value and θ1 is the true
value.
Alternative hypothesis (H1): This hypothesis expresses a deviation in the Null hypothesis. It
states that the true value deviates from the observed value. It is denoted by 𝐻1: θ1≠ θ0 or θ1 > θ0
or θ1 < θ0.

Critical Region: This is the subset of the sample space which leads to the rejection of the null
hypothesis under consideration. It is the set of all values whose total probability is small on the
null hypothesis, which is better explained by the alternative hypothesis.
Significance Level: It is the probability of taking a wrong decision or the probability of making
an error. There are two types of errors in hypothesis testing:
TYPE I ERROR: This is the error of rejecting the null hypothesis when it should be accepted. It
is denoted by 𝛼; 0 < 𝛼 < 1.
Test about a single population mean, 𝜇
Example 1: A researcher is interested in the mean level of some enzyme in a certain
population. The data available to the researcher are the enzyme determination made on a
sample of 10 individuals from the population of interest and the sample mean is 22. Assumed
the sample came from a population that is normally distributed with a known variance 45. Can
the researcher conclude that mean enzyme level in this population is different from 25? Take 𝛼 =
0.05.

Decision: Since 𝑍𝑐 = −1.4142 < 𝑍0.025 = 1.96, we accept 𝐻O (the null hypothesis) Conclusion: We
then conclude that the researcher can conclude that mean enzyme in the population is not
different from 25.
Example 2: A manufacturer of multi-vitamin tablet claims that riboflavin content of his tablets is
greater than 2.49mg. A check by the food and drug administration using 82 tablets shows a
mean riboflavin content of 2.52mg with standard deviation of 0.18mg; should the manufacturer
claims be rejected at 1% significant level?
𝜎 unknown
Example 3: A certain drug company claims that one brand of headache tablet is capable of
curing headache in one hour. A random variable of 16 headache patients is given the tablets. The
mean curing time was found to be one hour and nine minutes while the standard deviation of
the 16 times was eight minutes. Does the data support the company’s claim or not at 9 の% level
of significance?

Example: In a large hospital for the treatments of the mutually retarded, a sample of 12
individuals with mongolism yielded a mean serum uric acid value of 4.5mg/100ml. In general
hospital, a sample of 15 normal individuals of the same age and sex were found to have a mean
value of 3.4mg/100ml. if it is reasonable to assume that the two populations of values are
normally distributed with variances equal to one. Do this data provide sufficient evidence to
indicate a difference in the mean levels between mongolism, using 𝛼 = 0.05?

Example: Serum amylase determinations were made on a sample of 15 apparent normal


subjects. The sample yielded a mean of 96units/100ml and a standard deviation of
35units/100ml. These determinations were also made on 22 hospitalized subjects with mean
and standard deviation from the second room as 120 and 40units/100ml respectively. Would it
be justify in concluding that population means are difference whose 𝛼 = 0.05?
We accept 𝐻O and conclude that no difference in the means of the two populations.
Testing Hypothesis for population variance
Example: A sample of 25 ten years old girls yielded a mean weight and standard deviation of 73
and 10 pounds respectively. Should one conclude that population variance is different from 150
at 𝛼 = 0.05?
GOODNESS OF FIT
This test is used to compare the observed frequencies and the frequencies we might expect
(expected frequency) from a given theoretical explanation of the phenomenon under
investigation. A measure of this discrepancy is given by 2 (Chi square) statistic define as:

The statistic defined above has an approximate X2 −distribution with degree of freedom equal
to:
(i) 𝑘 − 1 (if expected frequencies can be computed without having to estimate population
parameters from sample statistics)
(ii) 𝑘 − 1 − 𝑚 (if expected frequencies can be computed only by estimating 𝑚 population).
To test the null hypothesis 𝐻O which specifies certain (population) proportions associated with
each class or category, we compute the value of X2 and thereafter make necessary conclusion.
Example 1: The distribution of final grade given by STS 202 lecturers in the past was 10% 𝐴𝑠,
20% 𝐵𝑠, 30% 𝐶𝑠, 25% 𝐷𝑠 and 15% 𝐹𝑠. A new lecturer gave the following grades for the second
semester:

Is there sufficient evidence to suggest that the new lecturer’s policy is different from that of the
formal lecturers? Use a 95% level of significance.
Solution: 𝐻O: 𝑝1 = 0.10, 𝑝2 = 0.20, 𝑝3 = 0.30, 𝑝4 = 0.25, 𝑝5 = 0.15
𝐻1: 𝐻O is not true
𝑛 = 12 + 20 + 26 + 14 + 8 = 80
Since we have five classes (grades), degree of freedom = 𝑘 − 1 = 5 − 1 = 4
and 𝛼 = 0.05.

CONTINGENCY TABLE ANALYSIS


Another use of the 2 statistic is in contingency testing, where 𝑛 randomly selected items are
classified to two different criteria. Here, it is desired to determine whether some protective
measure or sample preparation technique has been effective or not. Instead of 1 × 𝑘 table in the
goodness fit, we will have two – way classification table or h × 𝑘 tables in which the observed
frequencies occupy h rows and 𝑘 column. Such tables are called contingency tables. Correspond
to each observed frequency is the expected frequency which is computed subject to some
hypothesis according to the rules of probability.
The test statistic so defined has (𝑟 − 1)(𝑐 − 1) degree of freedom.
Example 2: Consider the case where pressure gauges are being hydraulically tested by 3
inspectors prior to shipment. It has been noted that their acceptance and rejection for some
period of time have been as follows:
Decision: Since X2𝑐𝑎𝑙 > X2𝑡𝑎𝑏 , the null hypothesis is rejected and we conclude that some
inspectors are more strict and demanding than the others.
Exercise 9
Random samples of 1200 students who live in Saudi hall of the Crescent University were asked
their daily eating habits by means of questionnaire X. Similarly, information was obtained from
another sample of 1050 students living in Yola hall of the same institution, by means of
questionnaire 𝑌. The results were as follows. Can the difference in these distributions be purely
due to chance? Support your answer from a statistical point of view.
INTRODUCTION TO EXPERIMENTAL DESIGN
Experimental design has been defined as the order in which an experiment is runs such that its
analysis will lead to valid statistical inference. The design of the experiment has three essential
components.
(a) Estimate of the error
(b) Controls of the error
(c) Proper interpretation of error
Experimental design methods are also useful in engineering design when new products are
developed and existing one improved. Some typical applications of statistically design
experiment in engineering include:
1. Evaluating and comparison of basic design configuration.
2. Evaluation of different materials
3. Selection of designed parameter so that the product will work well under a wild range of field
condition.
4. Determination of key product design parameter that impact product performance.
Terms in Experimental Design
Factor: An independent variable of interest under investigation Factor level
Treatment
Experimental unit: This is the unit in which a single treatment combination is applied in a single
of the experiment.
Replicator
Grouping
Blocking
Randomization
Control
Types of Experimental Design
Designs are classified according to their classification factor.
(1) Completely Randomized Design (CRD)
This is a design in which treatments are assign complete at random such that each experimental
unit have the same chance of receiving any one treatment.
(2) Randomized Completely Block Design (RCBD)
In this design, the experimental unit are divided into two homogeneous groups such that
treatment for each block are expected because variations are kept within each block.

ANALYSIS OF VARIANCE (ANOVA)


The technique known as analysis of variance employs tests based on variance – ratios to
determine whether or not significant differences exist among the means of several groups of
observations, where each group follows a normal distribution. ANOVA is particularly useful
when the basic differences between the groups cannot be stated quantitatively. A one – way
analysis of variance is used to determine the effect of one independent variable on a dependent
variable. A two – way analysis of variance is used to determine the effects of two independent
variables on a dependent variable, etc. As the number of independent increases, the calculations
become much more complex and are best carried out on a computer. The term independent
variable is what also referred to as factor or treatment.

ONE WAY ANOVA (Completely Randomized Design)


This model is used when we wish to test the equality of 𝑘 population means. The procedure is
based on assumptions that each of 𝑘 groups of observation is a random sample from a normally
distribution and that the population variance (𝜎2) is constant among the groups.
ANOVA TABLE

Decision: Since 𝐹𝑐𝑎𝑙 < 𝐹𝑡𝑎𝑏 , we accept 𝐻0 and conclude that the treatment means are equal or no
statistical difference in the treatment means.
Example 2: Specimens were randomly selected from three production processes for steal with
each process using a different percentage of carbon, independent observation on tensile strength
were made with one observation coming from each specimen. The data are as follows with
measurement in thousand (psi).

Is there evidence to say that the mean tensile strength differs for the 3 processes? Take 𝛼= 5%.
= 15.69
ANOVA TABLE

TWO – WAY ANOVA (Randomized Complete Block Design)


Example: Samples of 200 machined parts were selected from the one week output of machine
shop that employs three machinists. The parts were inspected to determine whether or not, they
are defective and are categorized according to which machinist did the work. The results are as
follows:
Is the Defective, Non – Defective classification independent of machinist? Conduct a test at 1%
level of significant.
ANOVA TABLE
Exercise 10
Three kinds of tomato are grown. The yields in grams, after harvesting are given in the table
below.

Carry out the analysis of variance for the data at 95% level of significance.
NON PARAMETRIC TEST
Statistical method of inference that does not dependent on stringent assumption of population
measure is called non parametric methods. They are used
(i) When we do not know the mean of the distribution population
(ii) When we need a result in a hurry
(iii) When data measured in a scale lower than that of the parametric method.
Non parametric tests were the test developed to deal with situations where the population
distributions are non-normal or unknown, or when little is known about the distributions of the
populations under study, or when these distributions do not meet the requirements necessary
for the use of parametric tests especially when the sample size is small (less than 30). Non
parametric method employs the median of the population and the method of hypothesis testing
in conducting its test. Some of parametric tests include sign test, Wilcox on signed-rank test,
Mann-Whitney U test, runs test, etc.
THE SIGN TEST

This test is used to study the median of a population and to compare two populations when the
samples are dependent. We usually assume that the data are continuous.
Testing a single population value (Mo)
Procedure
1. Hypothesis 𝐻0: 𝑀𝑑 = 𝑀o 𝐻1: 𝑀𝑑 ≠ 𝑀o
2. Test Statistic: X = Number of data values in the sample above the median value given in
the null hypothesis 𝐻0.
3. Significant level: Choose appropriate, say 𝛼.
4. Critical Region: When 𝐻0 is true, X has a binomial distribution with 𝑛 sample size and 𝑝
= 0.5. We use table of binomial probabilities to find the critical region. (Note ≠, >, <) If 𝛼 is the
desire level of significance, we choose critical value so that the probability that X falls in the
critical region is as close to 𝛼 as possible.
5. Decision: We check whether the observed value, X is in the critical region or not; if X
falls into the critical region, we reject 𝐻0, otherwise, we accept 𝐻0.

Comparison of two populations


The sign test may be used to compare two populations when the samples are dependent (i.e. the
values from the two samples occur in pairs).
THE WILCOXON SIGNED-RANK TEST

Although the sign test is very simple to use, it is not a very sensitive test. Sometimes it will fail to
reject a false null hypothesis when another test would be successful in detection of the false of
the null hypothesis. This is because, sign test throws away a good deal of information about the
data-it ignores the magnitude of the data value, it only uses the information about whether the
data value is above the conjectured value of the median or not.
The Wilcox on signed-rank test is better than sign test because it uses more information. We use
this test to investigate a single population median and to compare two populations using a
paired experiment.
Procedure for a single population median
Let M be the value of the median in question that appears in the null hypothesis. Calculate D = X
– M for each data value X. we then rank the values of D and place a minus sign in front of each
rank corresponding to a negative difference D. Let W+ = Sum of the positive ranks. W- =
Absolute value of sum of negative ranks.
(a) To test: 𝐻0: 𝑀𝑑 = 𝑀 𝐻1: 𝑀𝑑 > 𝑀
Use W- as test statistic. Find the critical value C for a one-tailed test with desire significant level
in the table of critical value for the Wilcox on signed-rank test. If W-≤ 𝐶 reject 𝐻0.
(b)To test 𝐻0: 𝑀𝑑 = 𝑀 𝐻1: 𝑀𝑑 < 𝑀
Use W+ as test statistic and if W+ ≤ 𝐶 reject 𝐻0.
(c) To test 𝐻0: 𝑀𝑑 = 𝑀 𝐻1: 𝑀𝑑 ≠ 𝑀
If either W+ ≤ 𝐶 or W− ≤ 𝐶, reject 𝐻0. This means that we can use the minimum of W+ or W− as a
test statistic. We denote this value by W. If W ≤ 𝐶, reject 𝐻0.

You might also like