Biostatistics Notes
Biostatistics Notes
Biostatistics Notes
Chapter 1
Introduction to Biostatistics
Statistics
Statistics is a field of study concerned with (1) collection, organization,
summarization and analysis of data; and (2) the drawing of inferences about a body
of data when only a part of the data is observed.
Statistic
A characteristics, or value, derived from sample data.
Data
The raw material of statistics is data. We may define data as number. The two
kinds of numbers that we use in statistics are
1. numbers that result from the taking of a measurement
2. numbers that result from the process of counting.
Sources of Data
1. Routinely kept records
2. Surveys
3. Experiments
4. External sources
Biostatistics
When the data analyzed are derived from the biological sciences and
medicine, we use the term biostatistics.
Variable
A characteristic, takes on different value in different persons, places or things.
A variable is any quality, characteristic or constituent of a person or thing that
can be measured.
A variable is any measured characteristic or attribute that differs for different
subjects.
Quantitative Variable
Copyright Dr. Win Khaing (2007)
Qualitative Variable
A qualitative variable is one that cannot be measured in the usual sense. Many
characteristics can be categorized only. Measurements made on qualitative
variables convey information regarding attribute. Eg.. ethnic of a person.
Random Variable
When the values obtained arise as a result of chance factors, so that they
cannot be exactly predicted in advance, the variable is called a random variable.
Eg., Adult height
Discrete Random Variable
A discrete random variable is characterized by gaps or interruptions in the
values that it can assume. These gaps or interruptions indicate the absence of
values between particular values that the variable can assume.
Eg., number of daily admission in a hospital
Continuous Random Variable
A continuous random variable does not possess the gaps or interruptions
characteristic of a discrete random variable. A continuous random variable can
assume any value within a specified relevant interval of values assumed by the
variable.
Eg., Height, Weight, Head circumference.
Population
A population of entities as the largest collection of entities for which we have
an interest at a particular time.
A population of values as the largest collection of values of a random variable
for which we have an interest at a particular time.
Sample
Copyright Dr. Win Khaing (2007)
This scale is not only possible to order measurements, but also the
distance between any two measurements is known.
The selected zero point is not necessarily a true zero in that it does not
have to indicate a total absence of the quantity being measured.
The interval scale unlike the nominal and ordinal scales is a truly
quantitative scale.
Statistical Inference
Statistical inference is the procedure by which we reach a conclusion about a
population on the basis of the information contained in a sample that has been
drawn from that population.
Simple Random Sample
If a sample of size n is drawn from a population of size N in such a way that
every possible sample of size n has the same chance of being selected, that sample
is called a simple random sample.
Sampling with replacement
When sampling with replacement is employed, every members of the
population is available at each draw.
Sampling without replacement
In sampling without replacement, a drawn member is not returned, so a given
member could appear in the sample only once.
Chapter 2
Descriptive Statistics
Descriptive Statistics
Descriptive statistics are methods for organizing and summarizing a set of
data that help us to describe the attributes of a group or population.
Descriptive
statistics
are
means of
organizing
and
summarizing
A commonly followed rule of thumb no fewer than six intervals and no more
than 15.
o Fewer than six the information they contain has been lost
o More than 15 the data have not been summarized enough.
Sturges's Rule
Deciding how many class intervals are needed, we may use a formula given by
Sturges's rule.
k = 1 + 3.322 (log10n)
where,
The answer obtained by applying Sturges's rule should not be regarded as final,
but should be considered as a guide only.
Copyright Dr. Win Khaing (2007)
where,
R
k
Statistic
A descriptive measure computed from the data of a sample is called a
statistic.
Parameter
A descriptive measure computed from the data of a population is called a
parameter.
Measures of Central Tendency
1. Mean (Arithmetic Mean)
2. Median
3. Mode
Mean
The mean is obtained by adding all the values in a population or sample and
dividing by the number of values that are added.
Formula
N
Population mean
i 1
N
n
Sample mean
where,
x
i 1
Median
Dianel -
The median of a finite set of values is that value which divides the set
into two equal parts such that the number of values equal to or greater than the
median is equal to the number of values equal to or less than the median. (Dianel)
Pagano -
The median of a finite set of values is that value which divides the set
into two equal parts, if the all values have been arranged in order of magnitude, half
the values are greater than or equal to the median, whereas the other half are less
than or equal to it.
If the number of values is odd, the median will be the middle value when all
values have been arranged in order of magnitude.
When the number of values is even, there is no single middle value. Instead
there are two middle values. In this case the median is taken to be the mean of these
two middle values, when all values have been arranged in the order of their
magnitudes.
Median - (
n 1
)
2
th
one.
Mode
The mode of a set of values is that value which occurs most frequently.
Measure of Dispersion
1. Range
2. Variance
3. Standard Deviation
Range
The range is the difference between the largest and smallest value in a set of
observations.
R x L xs
Advantage
Disadvantage
The Variance
Population variance
N
(x )
i 1
Sample variance
The sum of squared deviations of the values from their mean is divided by the
sample size, minus 1 is sample variance.
n
s2
where,
(x x )
i 1
n 1
s2
= sample variance
xi
= sample mean
n xi xi
s 2 i 1 i 1
n (n 1)
n
N xi xi
i 1
i 1
2
N gN
N
Standard Deviation
The square root of the variance is called the standard deviation.
n
(x x )
s s2
i 1
n 1
C.V .
s
(100)
x
Advantages
When one desires to compare the dispersion in two sets of data, however,
comparing the two standard deviations may lead to fallacious results because
of different units. Although the same unit of measurement is used, the two
means may be quite different.
Since unit of measurement is cancels out in computing the CV, we could use
CV to compare the variability independent of the scale of measurement. Eg.,
Weight in lb, kg.
Percentile
Given a set of n observations x1,x2, , xn, the pth percentile P is the value of
X such that p percent or less of the observations are less than P and (100 P)
percent or less of the observations are greater than P.
First quartile (or) 25th percentile
Q1
n 1
th
4
ordered observation
Second quartile (or) middle quartile (or) 50th percentile (or) Median
Q2
2( n 1) n 1
th
4
2
ordered observation
Q3
3(n 1)
th
4
ordered observation
Copyright Dr. Win Khaing (2007)
10
Interquartile range
The interquartile range (IQR) is the difference between the third and first
quartiles.
IQR Q3 Q1
Measures of Central Tendency Computed from Group Data
1. Mean computed from Grouped Data
k
m f
i 1
k
f
i 1
where,
k
mi
fi
i i
Median Li
Li
j
(U i Li )
fi
11
Chapter 3
Some Basic Probability Concepts
Probability
Probability is the relative possibility or chance or likelihood of an event will
occur.
Event
An event is a collection of one or more outcomes of an experiment.
Outcome
An outcome is a particular result of an experiment.
Experiment
An experiment is a process that leads to the occurrence of one of several
possible observations.
Mutually exclusive event
The occurrence of two events cannot occur simultaneously. The occurrence of
any one event means that none of the others can occur at the same time.
Independent event
The occurrence of one event has no effect on the possibility of the occurrence
of any other event.
TWO Views of Probability
1. Subjective Probability
is personalistic
2. Objective Probability
(a) Classical Probability
If an event can occur in N mutually exclusive and equally likely ways, and if m of
these possess a trait, E, the probability of the occurrence of E is equal to m / N.
P( E )
m
N
P (occurrence of E )
12
P( E )
m
n
P ( Ei ) 0
2. The sum of probabilities of the mutually exclusive outcomes is equal to 1.
(Property of Exhaustiveness)
P ( E1 ) P( E2 ) ... P ( En ) 1
3. In mutually exclusive events, E1 and E2, the probability of the occurrence of
either E1 or E2 is equal to the sum of their individual probabilities.
P ( E1 or E2 ) P ( E1 ) P ( E2 )
In not mutually exclusive events, E1 and E2, the probability of the event E 1, or
event E2, or both occur is equal to the probability that event E 1 occurs, plus
the probability that event E2 occurs, minus the probability that the events
occur simultaneously.
P ( E1 or E2 ) P ( E1 ) P ( E2 ) P ( E1 and E2 )
Rules of Probability
1. Multiplication Rule
If two events are Independent,
P ( A B ) P ( A) P ( B )
If two events are Not Independent,
P ( A B ) P ( A) P ( B | A)
P( A B) P( B ) P ( A | B )
2. Additional Rule
If two events are mutually exclusive,
Copyright Dr. Win Khaing (2007)
(or)
13
P ( A B ) P ( A) P ( B )
If two event are not mutually exclusive,
P ( A B ) P ( A) P ( B ) P ( A B )
3. Complementary Rule
P( A) 1 P ( A)
(or)
P ( A) 1 P (not A)
Types of Probability
1. Marginal Probability
The probability of one of the marginal total is used as numerator and
the total group as the denominator.
Marginal Total
Grand Total
2. Joint Probability
The probability that a subject is picked at random from a group of
subjects possesses two characteristics at the same time.
P ( A and B )
For independent,
P ( A B ) P ( A) P ( B )
Joint Probability is the product of Marginal Probability
For not independent,
P ( A B ) P ( A) P ( B | A)
Joint Probability is the product of Marginal and Conditional Probability
Note:
If
P ( A B ) P ( A) P ( B ) , event is independent.
If
3. Conditional Probability
14
P( A | B)
P ( B | A)
Conditional
Joint
Marginal
Statistical independence
P( A B ) P( A) P( B )
Thus, if A and B are independent, then their joint probability can be expressed
as a simple product of their individual probabilities.
P( A | B) P( A)
and
P( B | A) P( B)
Mutual exclusivity
P( A B ) 0
as long as
P( A) 0
and
P( B ) 0
Then
P( A | B) 0
and
P( B | A) 0
15
Chapter 4
Probability Distributions
Probability Distribution of Discrete Variables
1. Binomial Distribution
2. Poisson Distribution
Probability Distribution of Continuous Variables
3. Normal Distribution
Binomial Distribution (Swiss mathematician James Bernoulli, Bernoulli trial)
The Bernoulli Process A sequence of Bernoulli trials forms a Bernoulli process
under the following conditions.
1. Each trial results in one of two possible, mutually exclusive, outcomes. One of
the possible outcomes is denoted as a success, and the other is denoted as a
failure.
2. The probability of a success, denoted by p, remains constant from trial to trial.
The probability of a failure, 1 p, is denoted by q.
3. The trials are independent; that is, the outcome of any particular trial is not
affected by the outcome of any other trial.
An experiment with a fixed number of independent trials, each of which can
only have 2 possible outcomes and the probability of each outcome remains
constant from trial to trial.
Formula,
P ( x) n C x gp x gq n x
P ( x)
where
n!
gp x g(1 p ) n x
x !(n x)!
= probability of success
q (or) 1 p
= probability of failure
= number of trial
16
P( X x | n, p 0.5) P ( X n x | n, 1 p )
P( X x | n, p 0.5) P ( X n x | n, 1 p )
P( X x | n, p 0.5) P ( X n x | n, 1 p )
Binomial Parameters
Mean of Binomial Distribution
np
Variance of Binomial Distribution
2 np (1 p )
Poisson Distribution (French mathematician Simeon Denis Poisson)
Poisson Process
1. The occurrences of the events are independent. The occurrence of an event
in an interval of space or time has no effect on the probability of a second
occurrence of the event in the same, or any other, interval.
2. Theoretically, an infinite number of occurrences of the event must be possible
in the interval.
3. The probability of the single occurrence of the event in a given interval is
proportional to the length of the interval.
4. In any infinitesimally small portion of the interval, the probability of more than
one occurrence of the event is negligible.
Poisson probabilities are useful when there are a large number of independent trials
with a small probability of success on a single trial and the variables occur over a
period of time.
Formula,
e x
P( x)
x!
where
17
An interesting feature of the Poisson distribution is the fact that the mean and
variance are equal.
18
The standard normal distribution is the normal distribution with a mean of zero
and a standard deviation of one.
Mean = 0
Standard Deviation = 1
Z-score
19
Chapter 5
Some Important Sampling Distributions
Sampling Distribution
The distribution of all possible values that can be assumed by some statistic,
computed from samples of the same size randomly drawn from the same population,
is called the sampling distribution of that statistic.
Sampling distribution serve 2 purposes:
1. they allow us to answer probability questions about sample statistics
2. they provide the necessary theory for making statistical inference
procedure valid.
Central Limit Theorem
Given a population of any nonnormal functional form with a mean and finite
variance 2, the sampling distribution of x , computed from samples of size n from
this population, will have mean and variance 2/n and will be approximately
normally distributed when the sample size is large.
x
/ n
Sampling Distribution of Sample Mean
1. When sampling is from a normally distributed population with a known
population variance, the distribution of the sample mean will possess the
following properties:
a) the mean of distribution of sampling distribution ( x ) will be equal to the
mean of the population ( )from which the sample were drawn.
b) the variance of distribution of sampling distribution ( x2 ) will be equal to the
variance of the population ( n ) divided by the sample size.
2
x2
/ n
20
x =
b)
x / n
when n / N 0.5
x ( / n )
N n
N 1
( x1 x2 ) ( 1 2 )
12 22
n1 n2
p p
p (1 p )
n
( p1 p 2 ) ( p1 p2 )
p1 (1 p1 ) p2 (1 p2 )
n1
n2