Biostatistics Notes

1
Chapter 1
Introduction to Biostatistics
Statistics
Statistics is a field of study concerned with (1) collection, organization,
summarization and analysis of data; and (2) the drawing of inferences about a body
of data when only a part of the data is observed.
Statistic
A characteristics, or value, derived from sample data.
Data
The raw material of statistics is data. We may define data as number. The two
kinds of numbers that we use in statistics are
1. numbers that result from the taking of a measurement
2. numbers that result from the process of counting.
Sources of Data
1. Routinely kept records
2. Surveys
3. Experiments
4. External sources
Biostatistics
When the data analyzed are derived from the biological sciences and
medicine, we use the term biostatistics.
Variable
A characteristic, takes on different value in different persons, places or things.
A variable is any quality, characteristic or constituent of a person or thing that
can be measured.
A variable is any measured characteristic or attribute that differs for different
subjects.
Quantitative Variable
Copyright Dr. Win Khaing (2007)
A quantitative variable is one that can be measured in the usual sense.

Measurements made on quantitative variables convey information regarding
amount. Eg., height of adult males, weight of preschool children
Quantitative variable is one that can be counted in the usual sense.
Qualitative Variable
A qualitative variable is one that cannot be measured in the usual sense. Many
characteristics can be categorized only. Measurements made on qualitative
variables convey information regarding attribute. Eg.. ethnic of a person.
The characteristic that cannot be counted and can be categorized only.
Random Variable
When the values obtained arise as a result of chance factors, so that they
cannot be exactly predicted in advance, the variable is called a random variable.
Eg., Adult height
Discrete Random Variable
A discrete random variable is characterized by gaps or interruptions in the
values that it can assume. These gaps or interruptions indicate the absence of
values between particular values that the variable can assume.
Eg., number of daily admission in a hospital
Continuous Random Variable
A continuous random variable does not possess the gaps or interruptions
characteristic of a discrete random variable. A continuous random variable can
assume any value within a specified relevant interval of values assumed by the
variable.
Eg., Height, Weight, Head circumference.
Population
A population of entities as the largest collection of entities for which we have
an interest at a particular time.
A population of values as the largest collection of values of a random variable
for which we have an interest at a particular time.
Sample
A sample may be defined simply as a part of a population.

A sample is a selected subset of a population.
Finite Population
If a population of values consists of a fixed number of these values, the
population is said to be finite.
Infinite Population
If a population of values consists of an endless succession of values, the
population is said to be infinite.
Measurement
Measurement may be defined as the assignment of numbers to objects or
events according to a set of rules.
Measurement Scales
1. The Nominal Scale
This is the lowest measurement scale. It consists of "naming"

observations or classifying them into various mutually exclusive and
collectively exhaustive categories.
Eg., Male Female, Well Sick, Child Adult
2. The Ordinal Scale
Whenever observation are not only different from category to category

but can be ranked according to some criterion, they are said to be
measured on an ordinal scale.
Eg. SE Status low, medium, high ; Intelligence above average,

average, below average
3. The Interval Scale
It is more sophisticated scale than the nominal or ordinal.
This scale is not only possible to order measurements, but also the
distance between any two measurements is known.
The selected zero point is not necessarily a true zero in that it does not
have to indicate a total absence of the quantity being measured.
Eg., Temperature "zero degrees" does not indicate a lack of heat.
The interval scale unlike the nominal and ordinal scales is a truly
quantitative scale.
4. The Ratio Scale
It is the highest level of measurement.
This scale is characterized by the fact that equality of ratios as well as

equality of intervals may be determined.
Fundamental to the ratio scale is a true zero point.
Eg., Weight, Height, Length
Statistical Inference
Statistical inference is the procedure by which we reach a conclusion about a
population on the basis of the information contained in a sample that has been
drawn from that population.
Simple Random Sample
If a sample of size n is drawn from a population of size N in such a way that
every possible sample of size n has the same chance of being selected, that sample
is called a simple random sample.
Sampling with replacement
When sampling with replacement is employed, every members of the
population is available at each draw.
Sampling without replacement
In sampling without replacement, a drawn member is not returned, so a given
member could appear in the sample only once.
Chapter 2
Descriptive Statistics
Descriptive Statistics
Descriptive statistics are methods for organizing and summarizing a set of
data that help us to describe the attributes of a group or population.
Descriptive
statistics
are
means of
organizing
and
summarizing
observations, which provide us with an overview of the general features of a set of

data.
Raw Data
Measurements that have not been organized, summarized, or otherwise
manipulated are called raw data.
The Ordered Array
A first step in organizing data is the preparation of an ordered array. An
ordered array is a listing of the values of a collection (either population or sample) in
order of magnitude from the smallest value to the largest value.
Class Intervals
To group a set of observations we select a set of contiguous, non-overlapping
intervals such that each value in the set of observations can be placed in one, and
only one, of the intervals. These intervals are usually referred to as class interval.
A commonly followed rule of thumb no fewer than six intervals and no more
than 15.
o Fewer than six the information they contain has been lost
o More than 15 the data have not been summarized enough.
Sturges's Rule
Deciding how many class intervals are needed, we may use a formula given by
Sturges's rule.
k = 1 + 3.322 (log10n)
where,
k = number of class intervals

n = number of values in the data set
The answer obtained by applying Sturges's rule should not be regarded as final,
but should be considered as a guide only.
The number of class intervals specified by the rule should be increased or

decreased for convenience and clear presentation.
Width of Class interval

w
where,
R
k
R = the difference between the smallest and the largest observation in

the data set.
k = number of class intervals
Statistic
A descriptive measure computed from the data of a sample is called a
statistic.
Parameter
A descriptive measure computed from the data of a population is called a
parameter.
Measures of Central Tendency
1. Mean (Arithmetic Mean)
2. Median
3. Mode
Mean
The mean is obtained by adding all the values in a population or sample and
dividing by the number of values that are added.
Formula
N
Population mean
i 1
N
n
Sample mean
where,
x
i 1
xi = a typical value of a random variable

N = number of value in the population
n = number of value in the sample
Properties of the Mean

1. Uniqueness. For a given set of data there is one and only arithmetic mean
2. Simplicity. The arithmetic mean is easily understood and easy to compute.
3. Since each and every value in a set of data enters into the computation of the
mean, extreme values have an influence on the mean and can distort it.
The mean is extremely sensitive to unusual values.
Median
Dianel -
The median of a finite set of values is that value which divides the set
into two equal parts such that the number of values equal to or greater than the
median is equal to the number of values equal to or less than the median. (Dianel)
Pagano -
The median of a finite set of values is that value which divides the set
into two equal parts, if the all values have been arranged in order of magnitude, half
the values are greater than or equal to the median, whereas the other half are less
than or equal to it.
If the number of values is odd, the median will be the middle value when all
values have been arranged in order of magnitude.
When the number of values is even, there is no single middle value. Instead
there are two middle values. In this case the median is taken to be the mean of these
two middle values, when all values have been arranged in the order of their
magnitudes.
Median - (
n 1
)
2
th
one.
Properties of the Median

1. Uniqueness. There is only one median for a given set of data
2. Simplicity. The median is easy to calculate
3. It is not as drastically affected by extreme value as is the mean.
Median is said to be robust; it is much less sensitive to unusual data

points.
Mode
The mode of a set of values is that value which occurs most frequently.
If all the values are different there is no mode.
A set of values may have more than one mode.
Measure of Dispersion
1. Range
2. Variance
3. Standard Deviation
Range
The range is the difference between the largest and smallest value in a set of
observations.
R x L xs
Advantage
simplicity of its computation
Disadvantage
it takes into account only 2 values causes it to be a poor

measure of dispersion. The usefulness of the range is limited.
The Variance
Population variance
N
(x )
i 1
Sample variance
The sum of squared deviations of the values from their mean is divided by the
sample size, minus 1 is sample variance.
n
s2
where,
(x x )
i 1
n 1
s2
= sample variance
= number of values in the sample
xi
= a typical value of random variable
= sample mean
Alternative Variance Formula
n xi xi
s 2 i 1 i 1
n (n 1)
n
N xi xi
i 1
i 1
2
N gN
N
Standard Deviation
The square root of the variance is called the standard deviation.
n
(x x )
s s2
i 1
n 1
The Coefficient of Variation

It expresses the standard deviation as the percentage of the mean.
C.V .
s
(100)
x
Advantages
When one desires to compare the dispersion in two sets of data, however,
comparing the two standard deviations may lead to fallacious results because
of different units. Although the same unit of measurement is used, the two
means may be quite different.
In CV, it expresses the standard deviation as the percentage of the mean. It

measure relative variation, rather than absolute variation.
Since unit of measurement is cancels out in computing the CV, we could use
CV to compare the variability independent of the scale of measurement. Eg.,
Weight in lb, kg.
Percentile
Given a set of n observations x1,x2, , xn, the pth percentile P is the value of
X such that p percent or less of the observations are less than P and (100 P)
percent or less of the observations are greater than P.
First quartile (or) 25th percentile
Q1
n 1
th
4
ordered observation
Second quartile (or) middle quartile (or) 50th percentile (or) Median
Q2
2( n 1) n 1
th
4
2
ordered observation
Third quartile (or) 75th percentile
Q3
3(n 1)
th
4
ordered observation
10
Interquartile range
The interquartile range (IQR) is the difference between the third and first
quartiles.
IQR Q3 Q1
Measures of Central Tendency Computed from Group Data
1. Mean computed from Grouped Data
k
m f
i 1
k
f
i 1
where,
k
mi
fi
i i
= the number of class interval

= the midpoint of the ith class interval
= the frequency of the ith class interval
2. Median computed from Grouped Data
Median Li
Li
j
(U i Li )
fi
= the true lower limit of the interval containing Median
U i = the true upper limit of the interval containing Median

j = the number of observations still lacking to reach the median, after the
lower limit of the interval containing the median has been reach
fi = frequency of the interval containing the median
11
Chapter 3
Some Basic Probability Concepts
Probability
Probability is the relative possibility or chance or likelihood of an event will
occur.
Event
An event is a collection of one or more outcomes of an experiment.
Outcome
An outcome is a particular result of an experiment.
Experiment
An experiment is a process that leads to the occurrence of one of several
possible observations.
Mutually exclusive event
The occurrence of two events cannot occur simultaneously. The occurrence of
any one event means that none of the others can occur at the same time.
Independent event
The occurrence of one event has no effect on the possibility of the occurrence
of any other event.
TWO Views of Probability
1. Subjective Probability
is personalistic
measures the confidence that a particular individual has in the truth of a

particular proposition
not fully accepted by statisticians
2. Objective Probability
(a) Classical Probability
If an event can occur in N mutually exclusive and equally likely ways, and if m of
these possess a trait, E, the probability of the occurrence of E is equal to m / N.
P( E )
m
N
P (occurrence of E )
no. of favorable outcome

no. of all possible outcome
(b) Relative Frequency Probability

12
If some process is repeated a large number of times, n, and if some resulting

event with the characteristics E occurs m times, the relative frequency of
occurrence of E, m/n, will be approximately equal to the probability of E.
P( E )
m
n
Elementary Properties of Probability

1. Given some process (or experiment) with n mutually exclusive outcomes
(called events), E1, E2, , En, the probability of any event E i , is assigned a
nonnegative number.
P ( Ei ) 0
2. The sum of probabilities of the mutually exclusive outcomes is equal to 1.
(Property of Exhaustiveness)
P ( E1 ) P( E2 ) ... P ( En ) 1
3. In mutually exclusive events, E1 and E2, the probability of the occurrence of
either E1 or E2 is equal to the sum of their individual probabilities.
P ( E1 or E2 ) P ( E1 ) P ( E2 )
In not mutually exclusive events, E1 and E2, the probability of the event E 1, or
event E2, or both occur is equal to the probability that event E 1 occurs, plus
the probability that event E2 occurs, minus the probability that the events
occur simultaneously.
P ( E1 or E2 ) P ( E1 ) P ( E2 ) P ( E1 and E2 )
Rules of Probability
1. Multiplication Rule
If two events are Independent,
P ( A B ) P ( A) P ( B )
If two events are Not Independent,
P ( A B ) P ( A) P ( B | A)
P( A B) P( B ) P ( A | B )
2. Additional Rule
If two events are mutually exclusive,
(or)
13
P ( A B ) P ( A) P ( B )
If two event are not mutually exclusive,
P ( A B ) P ( A) P ( B ) P ( A B )
3. Complementary Rule
P( A) 1 P ( A)
(or)
P ( A) 1 P (not A)
Types of Probability
1. Marginal Probability
The probability of one of the marginal total is used as numerator and
the total group as the denominator.
Marginal Total
Grand Total
2. Joint Probability
The probability that a subject is picked at random from a group of
subjects possesses two characteristics at the same time.
P ( A and B )
no. of occurrence possessing A and B

Grand Total
For independent,
P ( A B ) P ( A) P ( B )
Joint Probability is the product of Marginal Probability
For not independent,
P ( A B ) P ( A) P ( B | A)
Joint Probability is the product of Marginal and Conditional Probability
Note:
If
P ( A B ) P ( A) P ( B ) , event is independent.
If
P ( A B ) P ( A) P ( B ) , event is not independent.
3. Conditional Probability
14
The probability of an event occurring given that another event has

occurred.
P( A | B)

P( A B )
, P( A | B )
Marginal Total ( B )
P( B)
(or)
P ( B | A)

P( A B )
, P( B | A)
Marginal Total ( A)
P( A)
Conditional
Joint
Marginal
Statistical independence
Two random events A and B are statistically independent if and only if
P( A B ) P( A) P( B )
Thus, if A and B are independent, then their joint probability can be expressed
as a simple product of their individual probabilities.
In other words, if A and B are independent, then the conditional probability of

A, given B is simply the individual probability of A alone; likewise, the
probability of B given A is simply the probability of B alone.
P( A | B) P( A)
and
P( B | A) P( B)
Mutual exclusivity
Two events A and B are mutually exclusive if and only if
P( A B ) 0
as long as
P( A) 0
and
P( B ) 0
Then
P( A | B) 0
and
P( B | A) 0
In other words, the probability of A happening, given that B happens, is nil

since A and B cannot both happen in the same situation; likewise, the
probability of B happening, given that A happens, is also nil.
15
Chapter 4
Probability Distributions
Probability Distribution of Discrete Variables
1. Binomial Distribution
2. Poisson Distribution
Probability Distribution of Continuous Variables
3. Normal Distribution
Binomial Distribution (Swiss mathematician James Bernoulli, Bernoulli trial)
The Bernoulli Process A sequence of Bernoulli trials forms a Bernoulli process
under the following conditions.
1. Each trial results in one of two possible, mutually exclusive, outcomes. One of
the possible outcomes is denoted as a success, and the other is denoted as a
failure.
2. The probability of a success, denoted by p, remains constant from trial to trial.
The probability of a failure, 1 p, is denoted by q.
3. The trials are independent; that is, the outcome of any particular trial is not
affected by the outcome of any other trial.
An experiment with a fixed number of independent trials, each of which can
only have 2 possible outcomes and the probability of each outcome remains
constant from trial to trial.
Formula,
P ( x) n C x gp x gq n x
P ( x)
where
n!
gp x g(1 p ) n x
x !(n x)!
= probability of success
q (or) 1 p
= probability of failure
= number of occurrence of an event
= number of trial
P (x) use Binomial table

P ( x or x ) can be easily calculated by using Binomial table, if the probability of
an event (x) and the number of trials (n) are known.
16
Using Binomial Table when p > 0.5
P( X x | n, p 0.5) P ( X n x | n, 1 p )
P( X x | n, p 0.5) P ( X n x | n, 1 p )
P( X x | n, p 0.5) P ( X n x | n, 1 p )
Binomial Parameters
Mean of Binomial Distribution
np
Variance of Binomial Distribution
2 np (1 p )
Poisson Distribution (French mathematician Simeon Denis Poisson)
Poisson Process
1. The occurrences of the events are independent. The occurrence of an event
in an interval of space or time has no effect on the probability of a second
occurrence of the event in the same, or any other, interval.
2. Theoretically, an infinite number of occurrences of the event must be possible
in the interval.
3. The probability of the single occurrence of the event in a given interval is
proportional to the length of the interval.
4. In any infinitesimally small portion of the interval, the probability of more than
one occurrence of the event is negligible.
Poisson probabilities are useful when there are a large number of independent trials
with a small probability of success on a single trial and the variables occur over a
period of time.
Formula,
e x
P( x)
x!
where
= average number of occurrences of the random event in the

interval
e = constant (2.7183)
x = number of occurrence
P (x) use Poisson Distribution table

17
P ( x or x ) can be easily calculated by using Poisson table, if the average

number of occurrence ( ) known.
An interesting feature of the Poisson distribution is the fact that the mean and
variance are equal.
Normal Distribution (Gaussian Distribution)

Characteristics of the Normal Distribution
1. It is symmetrical bell-shaped curve.
2. It is symmetrical about its mean. The curve on either side of mean is a mirror
image of the other side.
3. The mean, the median and the mode are all equal.
4. The total area under the curve above x-axis is one square unit
5. Mean 1 SD covers 68 % of total area under the curve.
Mean 2 SD covers 95 % of total area under the curve.
Mean 1 SD covers 99.7 % of total area under the curve.
6. The normal distribution is completely determined by the parameters and .
Different values of shift the graph of the distribution along the x-axis.
Different values of determine the degree of flatness or peakedness of the
graph of the distribution.
The Standard Normal Distribution (Unit Normal Distribution)

18
The standard normal distribution is the normal distribution with a mean of zero
and a standard deviation of one.
Mean = 0
Standard Deviation = 1
Z-score
19
Chapter 5
Some Important Sampling Distributions
Sampling Distribution
The distribution of all possible values that can be assumed by some statistic,
computed from samples of the same size randomly drawn from the same population,
is called the sampling distribution of that statistic.
Sampling distribution serve 2 purposes:
1. they allow us to answer probability questions about sample statistics
2. they provide the necessary theory for making statistical inference
procedure valid.
Central Limit Theorem
Given a population of any nonnormal functional form with a mean and finite
variance 2, the sampling distribution of x , computed from samples of size n from
this population, will have mean and variance 2/n and will be approximately
normally distributed when the sample size is large.
x
/ n
Sampling Distribution of Sample Mean
1. When sampling is from a normally distributed population with a known
population variance, the distribution of the sample mean will possess the
following properties:
a) the mean of distribution of sampling distribution ( x ) will be equal to the
mean of the population ( )from which the sample were drawn.
b) the variance of distribution of sampling distribution ( x2 ) will be equal to the
variance of the population ( n ) divided by the sample size.
2
The square root of the variance of distribution of sampling distribution (

) = Standard Error of the mean = x
x2
/ n
c) the sampling distribution of sample mean is normal.
20
2. When sampling is from a nonnormally distributed population with a known

population variance, the distribution of the sample mean will possess the
following properties:
a)
x =
b)
x / n
when n / N 0.5
x ( / n )
N n
N 1
c) the sampling distribution of sample mean is approximately normal.

Distribution of the Difference between two Sample Means
( x1 x2 ) ( 1 2 )
12 22
n1 n2
Distribution of Sample Proportion
p p
p (1 p )
n
Distribution of the Difference between two Sample Proportions
( p1 p 2 ) ( p1 p2 )
p1 (1 p1 ) p2 (1 p2 )
n1
n2

Biostatistics Notes

Uploaded by

Document Informationclick to expand document informationBiostatistics Notes by Dr. Win Khaing. Notes from Daniel Statistics

Document Informationclick to expand document information

Copyright:

Available Formats

Biostatistics Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Biostatistics Notes

Uploaded by

Copyright:

Available Formats

1

A quantitative variable is one that can be measured in the usual sense.

Quantitative variable is one that can be counted in the usual sense.

The characteristic that cannot be counted and can be categorized only.

A sample may be defined simply as a part of a population.

This is the lowest measurement scale. It consists of "naming"

Eg., Male Female, Well Sick, Child Adult

2. The Ordinal Scale

Whenever observation are not only different from category to category

Eg. SE Status low, medium, high ; Intelligence above average,

3. The Interval Scale

It is more sophisticated scale than the nominal or ordinal.

Eg., Temperature "zero degrees" does not indicate a lack of heat.

Copyright Dr. Win Khaing (2007)

4. The Ratio Scale

It is the highest level of measurement.

This scale is characterized by the fact that equality of ratios as well as

Fundamental to the ratio scale is a true zero point.

Eg., Weight, Height, Length

Copyright Dr. Win Khaing (2007)

observations, which provide us with an overview of the general features of a set of

k = number of class intervals

The number of class intervals specified by the rule should be increased or

Width of Class interval

R = the difference between the smallest and the largest observation in

xi = a typical value of a random variable

Copyright Dr. Win Khaing (2007)

Properties of the Mean

The mean is extremely sensitive to unusual values.

Properties of the Median

Median is said to be robust; it is much less sensitive to unusual data

If all the values are different there is no mode.

A set of values may have more than one mode.

Copyright Dr. Win Khaing (2007)

simplicity of its computation

it takes into account only 2 values causes it to be a poor

= number of values in the sample

= a typical value of random variable

Alternative Variance Formula

Copyright Dr. Win Khaing (2007)

The Coefficient of Variation

In CV, it expresses the standard deviation as the percentage of the mean. It

Third quartile (or) 75th percentile

= the number of class interval

2. Median computed from Grouped Data

= the true lower limit of the interval containing Median

U i = the true upper limit of the interval containing Median

fi = frequency of the interval containing the median

Copyright Dr. Win Khaing (2007)

measures the confidence that a particular individual has in the truth of a

not fully accepted by statisticians

no. of favorable outcome

(b) Relative Frequency Probability

If some process is repeated a large number of times, n, and if some resulting

Elementary Properties of Probability

no. of occurrence possessing A and B

P ( A B ) P ( A) P ( B ) , event is not independent.

Copyright Dr. Win Khaing (2007)