Bio Statistics
Bio Statistics
Bio Statistics
Introduction
• to Biostatistics
•
• Name: Jiregna Indalu
• E-mail: [email protected]
• Mob.: 0926-987846
•
Introduction
• What is statistics?
• Statistics: A field of study concerned with:
– collection, organization, analysis, summarization
and interpretation of numerical data, and
– the drawing of inferences about a body of data
when only a small part of the data is observed.
• Statistics helps us use numbers to
communicate ideas.
• Statisticians try to interpret and communicate
the results to others.
2
Cont.
· Biostatistics: The application of statistical
methods to the fields of biological and medical
sciences.
· Concerned with interpretation of biological data
& the communication of information derived
from these data.
· Has central role in medical investigations.
3
Uses of biostatistics
• Provide methods of organizing information
• Assessment of health status
• Health program evaluation
• Resource allocation
• Magnitude of association
– Strong vs weak association between exposure
and outcome
4
Cont.
• Assessing risk factors
– Cause & effect relationship
• Evaluation of a new vaccine or drug
– What can be concluded if the proportion of people free
from the disease is greater among the vaccinated than
the unvaccinated?
– How effective is the vaccine (drug)?
– Is the effect due to chance or some bias?
• Drawing of inferences
– Information from sample to population
5
What does biostatistics cover?
Research Planning
Presentation
Interpretation
Publication 6
variable:
It is a characteristic that takes on different values in
different persons, places, or things.
For example:
- heart rate,
- the heights of adult males,
- the weights of preschool children,
- the ages of patients seen in a dental clinic.
7
Types of variables
Quantitative Qualitative
Interval Nominal
Ordinal
Ratio
9
Types of Statistics
1. Descriptive statistics:
10
Cont.
2. Inferential statistics:
13
Population and Sample
• Population:
– Refers to any collection of objects.
• Target population:
– A collection of items that have something in common
for which we wish to draw conclusions at a particular
time.
• E.g., All hospitals in Ethiopia.
– The whole group of interest.
14
Cont.
Study Population:
• The subset of the target population that has at least some chance of
being sampled.
• The specific population group from which samples are drawn and
data are collected.
Sample:
• A subset of a study population, about which information is actually
obtained.
• The individuals who are actually measured and comprise the actual
data.
15
Cont.
E.g.: In a study of the prevalence
of HIV among adolescents in
Ethiopia, a random sample of
adolescents in Ayder of Mekelle
Sample were included.
Target Population: All
Study Population adolescents in Ethiopia
17
Parameter and Statistic
• Parameter: A descriptive measure computed
from the data of a population.
– E.g., the mean (µ) age of the target population
18
Sampling and Sampling
Distributions
19
Cont.
• Researchers often use sample survey methodology to
obtain information about a larger population by
selecting and measuring a sample from that population.
20
Cont.
21
Cont.
Sample Information
Population
22
Steps needed to select a sample and ensure that
this sample will fulfill its goals.
23
Cont.
2. Define the target population
24
Cont.
3. Decide on the data to be collected
– The data requirements of the survey must be established.
– To ensure that the requirements are operationally sound, the necessary data
terms and definitions also need to be determined.
25
Cont.
5. Decide on the methods on measurement
6. Preparing Frame
– List of all members of the population
– The elements must not overlap
26
Sampling
27
Advantages of sampling:
• Feasibility: Sampling may be the only feasible method of
collecting information.
28
Disadvantages of sampling:
• There is always a sampling error.
• Sampling may create a feeling of
discrimination within the population.
• Sampling may be inadvisable where every unit
in the population is legally required to have a
record.
29
Errors in sampling
1) Sampling error: Errors introduced due to errors
in the selection of a sample.
– They cannot be avoided or totally eliminated.
2) Non-sampling error:
- Observational error
- Respondent error
- Lack of preciseness of definition
- Errors in editing and tabulation of data
30
Sampling Methods
31
Probability sampling
• Involves random selection of a sample
32
Most common probability
sampling methods
33
1. Simple random sampling
• Involves random selection
• Each member of a population has an equal
chance of being included in the sample.
• To use a SRS method:
– Make a numbered list of all the units in the
population
– Each unit should be numbered from 1 to N
(where N is the size of the population)
– Select the required number.
34
Cont.
35
Example
• Suppose your school has 500 students and
you need to conduct a short survey on the
quality of the food served in the cafeteria.
36
Cont.
37
Cont.
• Ignore all random numbers after 500 because they do
not correspond to any of the students in the school.
38
Cont.
39
2. Systematic random sampling
• Sometimes called interval sampling,
systematic sampling means that there is a gap,
or interval, between each selected unit in the
sample
40
Cont.
42
Example
• To select a sample of 100 from a population of 400,
you would need a sampling interval of 400 ÷ 100 = 4.
• Therefore, K = 4.
• You will need to select one unit out of every four units
to end up with a total of 100 units in your sample.
43
Cont.
• If you choose 3, the third unit on your frame
would be the first unit included in your
sample;
44
Cont.
• Using the above example, you can see that
with a systematic sample approach there are
only four possible samples that can be
selected, corresponding to the four possible
random starts:
A. 1, 5, 9, 13...393, 397
B. 2, 6, 10, 14...394, 398
C. 3, 7, 11, 15...395, 399
D. 4, 8, 12, 16...396, 400
45
3. Stratified random sampling
46
Why do we need to create strata?
47
Cont.
• Equal allocation:
– Allocate equal sample size to each stratum
• Proportionate allocation:
n
nj N j , j = 1, 2, ..., k where, k is
N the number of strata and
49
Steps in cluster sampling
• Cluster sampling divides the population into groups or clusters.
50
Example
51
Cont.
• Sometimes a list of all units in the population is not available,
while a list of all clusters is either available or easy to create.
• Another drawback to cluster sampling is that you do not have total control
over the final sample size.
52
5. Multi-stage sampling
• Similar to the cluster sampling.
• In the first stage, large groups or clusters are identified and selected.
• In the second stage, population units are picked from within the
selected clusters (using any of the possible probability sampling
methods) for a final sample.
53
Cont.
• If more than two stages are used, the process of choosing
population units within clusters continues until there is a final
sample.
• Also, you do not need to have a list of all of the units in the
population. All you need is a list of clusters and list of the units
in the selected clusters.
54
B. Non-probability sampling
• The difference between probability and non-
probability sampling has to do with a basic
assumption about the nature of the population
under study.
55
Cont.
56
Cont.
• Reliability cannot be measured in non-probability sampling;
the only way to address data quality is to compare some of
the survey results with available information about the
population.
58
The most common types of non-
probability sampling
59
1. Convenience or haphazard sampling
60
Cont.
61
Cont.
62
2. Volunteer sampling
• As the term implies, this type of sampling occurs
when people volunteer to be involved in the study.
63
Cont.
64
Cont.
• Sampling voluntary participants as opposed to
the general population may introduce strong
biases.
66
Cont.
• Judgment sampling is subject to the
researcher's biases and is perhaps even more
biased than haphazard sampling.
67
Cont.
• Researchers often use this method in exploratory
studies like pre-testing of questionnaires and focus
groups.
• They also prefer to use this method in laboratory
settings where the choice of experimental subjects
(i.e., animal, human) reflects the investigator's pre-
existing beliefs about the population.
• One advantage of judgment sampling is the
reduced cost and time involved in acquiring the
sample.
68
4. Quota sampling
• This is one of the most common forms of
non-probability sampling.
• Sampling is done until a specific number of
units (quotas) for various sub-populations
have been selected.
• Since there are no rules as to how these
quotas are to be filled, quota sampling is
really a means for satisfying sample size
objectives for certain sub-populations.
69
Cont.
70
Cont.
71
Cont.
• It is also easy to administer, especially considering
the tasks of listing the whole population, randomly
selecting the sample and following-up on non-
respondents can be omitted from the procedure.
• Quota sampling is an effective sampling method
when information is urgently required and can be
carried out sampling frames.
• In many cases where the population has no
suitable frame, quota sampling may be the only
appropriate sampling method.
72
5. Snowball sampling
• A technique for selecting a research sample
where existing study subjects recruit future
subjects from among their acquaintances.
• Thus the sample group appears to grow like
a rolling snowball.
73
Cont.
• This sampling technique is often used in hidden
populations which are difficult for researchers to
access; example populations would be drug users or
commercial sex workers.
• Because sample members are not selected from a
sampling frame, snowball samples are subject to
numerous biases. For example, people who have
many friends are more likely to be recruited into the
sample.
74
Estimation
• Up until this point, we have assumed that the values
of the parameters of a probability distribution are
known.
76
Estimation
• It is concerned with estimating the values of
specific population parameters based on
sample statistics.
• It is about using information in a sample to
make estimates of the characteristics
(parameters) of the source population.
77
Estimation, Estimator & Estimate
78
Point versus Interval Estimators
• Point estimation involves the calculation of a single
number to estimate the population parameter
Thus,
– A point estimate is of the form: [ Value ],
– Whereas, an interval estimate is of the form:
[ lower limit, upper limit ]
79
1. Point Estimate
• A single numerical value used to estimate
the corresponding population parameter.
Sample Statistics are Estimators of Population Parameters
Sample mean, µ
Sample variance, S2 2
Sample proportion, P or π
Sample Odds Ratio,
OR
OŔ
RR
Sample Relative Risk, RŔ
ρ
Sample correlation coefficient, r
80
2. Interval Estimation
• Interval estimation specifies a range of reasonable values for the population
parameter based on a point estimate.
CIs can also answer the question of whether or not an association exists
81
Confidence Level
• Confidence Level
– Confidence in which the interval will contain
the unknown population parameter
• P (L, U) = (1 - α)
82
Estimation for Single Population
83
1. CI for a Single Population Mean
(normally distributed)
A. Known variance (large sample size)
84
Cont.
Assumptions
Population standard deviation () is known
Population is normally distributed
85
Margin of Error
(Precision of the estimate)
86
Cont.
As n increases, the CI decreases.
87
Example:
1. Waiting times (in hours) at a particular hospital are
believed to be approximately normally distributed
with a variance of 2.25 hr.
88
Solution:
2.25
a. 1.52 1.96 1.52 1.96(.33)
20
1.52 .65 (0.87, 2.17)
89
Cont.
b.
2.25
1.52 1.96 1.52 1.96(.27)
32
1.52 .53 (.99, 2.05)
c. The larger the sample size makes the CI
narrower (more precision).
90
Cont.
B. Unknown variance (small sample size, n ≤ 30)
• What if the for the underlying population is
unknown and the sample size is small?
91
Cont.
92
Example
• Standard error =
• t-value at 90% CL at 19 df =1.729
93
Cont.
94
Exercise
• Compute a 95% CI for the mean birth
weight based on n = 10, sample mean =
116.9 oz and s =21.70.
95
2. CIs for single population proportion, p
96
Cont.
97
Example 1
• A random sample of 100 people shows that 25
are left-handed. Form a 95% CI for the true
proportion of left-handers.
98
Interpretation
99
Example 2
• Suppose that among 10,000 female operating-room nurses,
60 women have developed breast cancer over five years. Find
the 95% for p based on point estimate.
• Point estimate = 60/10,000 = 0.006
• The 95% CI for p is given by the interval:
100
• Hypothesis Testing
102
Examples of Research Hypotheses
Population Mean
• The average length of stay of patients
admitted to the hospital is five days
103
Types of Hypothesis
1. The Null Hypothesis, H0
104
Cont.
2. The Alternative Hypothesis, HA
105
Steps in Hypothesis Testing
1. Formulate the appropriate statistical hypotheses
clearly
• Specify HO and HA
H0: = 0 H0: ≤ 0 H0: ≥ 0
H1: 0 H1: > 0 H1: < 0
two-tailed one-tailed one-tailed
2. State the assumptions necessary for computing
probabilities
• A distribution is approximately normal (Gaussian)
• Variance is known or unknown
106
Cont.
3. Select a sample and collect data
• Categorical, continuous
OR
107
Cont.
5. Specify the desired level of significance for
the statistical test (=0.05, 0.01, etc.)
6. Determine the critical value.
– A value the test statistic must attain to be
declared significant.
108
7. Obtain sample evidence and compute the
test statistic
8. Reach a decision and draw the conclusion
• If Ho is rejected, we conclude that HA is true
(or accepted).
• If Ho is not rejected, we conclude that Ho may
be true.
109
Rules for Stating Statistical Hypotheses
1. One population
• Indication of equality (either =, ≤ or ≥) must
appear in Ho.
Ho: μ = μo, HA: μ ≠ μo
Ho: P = Po, HA: P ≠ Po
• Can we conclude that a certain population mean
is
– not 50?
Ho: μ = 50 and HA: μ ≠ 50
– greater than 50?
Ho: μ ≤ 50 HA: μ > 50
110
Cont.
111
Statistical Decision
• Reject Ho if the value of the test statistic
that we compute from our sample is one of
the values in the rejection region
112
Another way to state conclusion
113
Types of Errors in Hypothesis Tests
114
Type I Error
• The probability of a type I error is the
probability of rejecting the Ho when it is true
115
Type II Error
• The error committed when a false Ho is not
rejected
116
Cont.
Action Reality
(Conclusion)
Ho True Ho False
117
Type I & II Error Relationship
118
Hypothesis Testing of a Single Mean
(Normally Distributed)
119
Known Variance
120
Example: Two-Tailed Test
1. A simple random sample of 10 people from a certain
population has a mean age of 27. Can we conclude that
the mean age of the population is not 30? The variance is
known to be 20. Let = .05.
121
Cont.
C. Hypotheses
Ho: µ = 30
HA: µ ≠ 30
D. Test statistic
As the population variance is known, we use Z
as the test statistic.
122
Cont.
E. Decision Rule
Reject Ho if the Z value falls in the rejection region.
Don’t reject Ho if the Z value falls in the non-rejection region.
Because of the structure of Ho it is a two tail test. Therefore,
reject Ho if Z ≤ -1.96 or Z ≥ 1.96.
123
F. Calculation of test statistic
G. Statistical decision
We reject the Ho because Z = -2.12 is in the rejection
region. The value is significant at 5% α.
H. Conclusion
We conclude that µ is not 30. P-value = 0.0340
A Z value of -2.12 corresponds to an area of 0.0170. Since there are two
parts to the rejection region in a two tail test, the P-value is twice this
which is .0340.
124
Hypothesis test using
confidence interval
• A problem like the above example can also be solved
using a confidence interval.
125
Example: One -Tailed Test
• A simple random sample of 10 people from a certain
population has a mean age of 27. Can we conclude that the
mean age of the population is less than 30? The variance is
known to be 20. Let α = 0.05.
• Data
n = 10, sample mean = 27, 2 = 20, α = 0.05
• Hypotheses
Ho: µ ≥ 30, HA: µ < 30
126
• Test statistic
• Rejection Region
• With α = 0.05 and the inequality, we have the entire rejection region at
the left. The critical value will be Z = -1.645. Reject Ho if Z < -1.645.
127
Cont.
• Statistical decision
– We reject the Ho because -2.12 < -1.645.
• Conclusion
– We conclude that µ < 30.
– p = .0170 this time because it is only a one tail test and not a two tail test.
128
Unknown Variance
129
Example: Two-Tailed Test
• A simple random sample of 14 people from a certain population gives
a sample mean body mass index (BMI) of 30.5 and sd of 10.64. Can we
conclude that the BMI is not 35 at α 5%?
• Test statistic
• If the assumptions are correct and Ho is true, the test statistic follows
Student's t distribution with 13 degrees of freedom.
130
Cont.
• Decision rule
– We have a two tailed test. With α = 0.05 it means that each tail is 0.025. The
critical t values with 13 df are -2.1604 and 2.1604.
– We reject Ho if the t ≤ -2.1604 or t ≥ 2.1604.
131
Sampling from a population that is not
normally distributed
• Here, we do not know if the population displays a
normal distribution.
132
Cont.
• With a large sample, we can use Z as the test
statistic calculated using the sample sd.
133
Hypothesis Tests for Proportions
• Involves categorical values
134
Proportions
135
Hypothesis Testing about a Single Population
Proportion
(Normal Approximation to Binomial Distribution)
136
137
Example
• We are interested in the probability of developing asthma
over a given one-year period for children 0 to 4 years of age
whose mothers smoke in the home. In the general
population of 0 to 4-year-olds, the annual incidence of
asthma is 1.4%. If 10 cases of asthma are observed over a
single year in a sample of 500 children whose mothers
smoke, can we conclude that this is different from the
underlying probability of p0 = 0.014? Α = 5%
H0 : p = 0.014
HA: p ≠ 0.014
138
Cont.
• The test statistic is given by:
139
Cont.
• The critical value of Zα/2 at α=5% is ±1.96.
• P-value = 0.2548
140
Sample size determination
• Sample Size: The number of study subjects selected to
represent a given study population.
142
Cont.
143
Sample size for single sample
144
A. Sample size for estimating a single
population mean
145
Examples:
1. Find the minimum sample size needed to
estimate the drop in heart rate (µ) for a new
study using a higher dose of propranolol than the
standard one. We require that the two-sided 95%
CI for µ be no wider than 5 beats per minute and
the sample sd for change in heart rate equals 10
beats per minute.
2 2 2
n = (1.96) 10 /(2.5) = 62 patients
146
2. Suppose that for a certain group of cancer patients, we
are interested in estimating the mean age at diagnosis.
We would like a 95% CI of 5 years wide. If the population
SD is 12 years, how large should our sample be?
147
Cont.
• Suppose d=1
• Then the sample size increases
148
Cont.
149
But the population 2 is most of
the time unknown
As a result, it has to be estimated from:
• Pilot or preliminary sample:
– Select a pilot sample and estimate 2 with
the sample variance, s2
• Previous or similar studies
150
B. Sample size to estimate a single
population proportion
151
Cont.
152
Sample Size: Two Samples
153
A. Sample size for estimating a difference in
two means
154
B. Sample size for estimating a difference
in two proportions
155
Data check entry
157
Normality
• All of the continuous data we are covering
need to follow a normal curve
158
Cont.
• skewness statistic is output by SPSS and SE
skewness is
S Skewness
Z skewness
SESkewness
Z skewness 3.2 violation of skewness assumption
159
Cont.
• Kurtosis (univariate) – is how peaked the data is; Kurtosis stat
output by SPSS
• Kurtosis standard error
S Kurtosis
Z kurtosis
SEKurtosis
Z kurtosis 3.2 violation of kurtosis assumption
160
Outliers
161
Linearity
• relationships among variables are linear in
nature; assumption in most analyses
162
Homoscedasticity
• For grouped data this is the same as
homogeneity of variance
163
Multicollinearity/Singularity
• If correlations between two variables are excessive
(e.g. 0.95) then this represents multicollinearity
164