1.sampling Methods and Sample Size Determination

Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

Research Methods

Sampling methods and sample size determination

By Gerbaba Guta

2020

1
Sampling methods and sample size determination

Sample
• In research terms a sample is a group of people,
objects, or items that are taken from a larger
population for measurement.
• The sample should be representative of the
population to ensure that we can generalize the
findings from the research sample to the population
as a whole

2
Cont…
Sampling
• procedure by which some members of a given
population are selected as representatives of
the entire population
Why sampling than census?
• Greater speed
• More accuracy
• Resource allocation (monetary, human power
or time)

3
…cont
• Studying the whole population is impossible when
population contains infinitely many members
• It is the only choice when a test involves the
destruction of the item
Definition of sampling terms
Sampling unit (element)
• A subject under observation on which information is
collected
Example: children under 5 years, hospital discharges,
and health events

4
…cont
Sampling fraction
• A ratio between sample size and population size
Example: 300 out of 1500 individuals (20%)
Sampling frame
• A list of all the sampling units from which a sample is
drawn
Example: Lists of all children under 5 years
Sampling scheme
Method of selecting sampling units from sampling
frame. It can be done either by probability or non-
probability sampling method
5
Sources of error in sampling

Types of errors
Non- sampling errors (bias)
• Not random in nature
• occur both in census and sample survey
• It involves problems of sample design such as:
 Choice of sampling frame
 Choice of sampling units
• Technically faulty done on observations during data
recording or during processing of data

6
Common types of non-sampling errors
Measurement errors
• Obtaining inaccurate answers to survey questions
Example:
Interviewer error
• It includes:
 Recording error (when the interviewer fail to record
the correct response of the participants )
 Interviewers may distort an interview (make the
judgment that they already know what the
respondent would say to a question based on their
prior responses)
 Questions may not be clearly stated 7
…cont
Response error
 The tendency of respondents to give socially
acceptable answers
 Respondents do not possess the correct information.
 Respondents deliberately lie.
 Respondents may twist the responses so that it
makes them look better
 Instructions are vague or not clear ( more serious
when we use questionnaire to collect data)

8
…cont
Processing error
Using wrong values of measurements for
analysis
Transcription error
Selection error
An error that occurs during the selection of
sample units
Measurement errors can be controlled by
using suitable, reliable, and valid instruments

9
Sampling errors
• Random variations in the sample estimates around
the true population parameters or
• The difference between the estimate of a value
obtained from a sample and the actual value of the
population (parameter)
• Random in nature
• Cannot be avoided
• Can be minimized by increasing sample size
• Takes smaller magnitude in homogeneous
population

10
Factors that increase the magnitude of sampling error
• Non-representative sample
• Small sample size (below optimum size)
The larger the sample size, the smaller the
sampling error.
However, too large sample is costly and no
more advantage than optimum sample size
• Heterogeneity of population

11
Random error is unavoidable
• Different samples drawn from the same population
can have different properties
• Sample is only a portion of the population we are
trying to understand

12
Sampling techniques (methods)
• Techniques or procedure how to take a sample from
the population
• If the entire population is sufficiently small, census
is appropriate
• If the population is too large, sample survey is
appropriate
• Sampling methods are classified into probability and
non-probability

13
Probability (random) sampling techniques
• Each member of the population has a non-zero
chance of being selected
• Representativeness and generalize-ability will be
achieved (standard statistical tests were developed
for them)

14
The Sampling Design Process
i. Define the Target Population

ii. Determine the Sampling Frame

iii. Select a Sampling Technique

iv. Determine the Sample Size

v. Execute the Sampling Process

15
Types of random sampling methods
Types of random sampling methods

1. Simple random sampling


• every unit of the population has equal chance of
being selected
• The selection of units is purely as a matter of chance
• A sample size ‘n’ is drawn from a population
‘N’ in such a way that every possible element
in the population has the same chance of
being selected.

16
…cont
• To use this method:
 the population should be homogeneous (similar)
with regard to the characteristics under
consideration
 Sampling frame should be known/available

17
Procedures to select the sample
• How do we actually take a random sample?
• The specific procedures that you follow may vary depending
on your resources, but all involve some type of random
process.
• Depending on the complexity of the population, we can use
different tools to select n samples from the frame.
 These are lottery method,
 table of random number (they are available in the
appendix of many research methods and statistics
textbooks) or
 Computer generated random number.

18
…cont
Advantage
• Base for comparing the precision of different
methods of sampling and teaching general
probabilistic sampling rules
Disadvantage
• In large population and wide geographical sampling
areas it is not easy to take a list form of all units and
randomly selecting them

19
2. Systematic random Sampling
• Units of the population are arranged in some order (
e.g. ID NO or alphabetical order)
• Starting from a random point on a sampling
frame, every nth element in the frame is
selected at equal intervals (sampling interval).
• Every kth unit will be selected after the first unit is
selected randomly between 1 and k
• Sampling frame is required to use this method
• In the order of sampling frame, there should not be
some pattern
th th th st
• The last unit can be computed by: n  N k  1
20
Example
Suppose a population consists of 1000 units and we
wish to take a sample of 200 units by systematic
random sampling:

N 1000
• k  5
n 200

• k=sampling interval (must be integer)


• N=population size
• n=sample size
• Suppose we selected the 3rd unit at random between
1 and 5 inclusively
21
…cont
• Every 5th unit from the list constitute our sample:
 3rd, 8th, 13th, 18th, …, 998th
• The last unit can be computed as:

(1000  5  3)
th

th
 998

22
…cont
Advantage
• Very easy
• Less time consuming
Disadvantage
• The chance of selecting a non-representative sample
is very high (when there is a correlation between the
place of the unit in the population list with respect to
the characteristics of the unit that should be
observed)

23
3. Stratified random sampling
• When individual members of a population are different
from each other, the population is considered to be
heterogeneous (having significant variation among
individuals).
• The population can be divided in to sub population called
strata
• The strata should be non-overlapping
• The strata are internally homogenous but heterogeneous
externally
• Strata are usually formed based on:
 Age , Sex, Income level, Occupation , Educational status ,
Marital status, Culture and etc

24
…cont
• Sample will be taken from each sub-group (stratum)
by:
 A simple random or a systematic sample is taken
from each stratum relative to the proportion of that
stratum to each of the others
Example: To estimate the prevalence of STI among
female sex workers in a city, we can have two strata
of FSWs: street based and hotel based
• The strata differ from each other regarding:
 Percentage of high risk behavior and
 Medical consultant and available health care services

25
…cont
• A desired sample should be selected from each
group/strata to get a precise estimate of STI
prevalence of the target population

FSWs FSWs
Street-based Hotel-based

26
…cont
Advantages
• When we want to achieve certain information for specific
subdivision of the population
• Helpful in studies in multiple administrative areas (each area
as a stratum )
• Dividing population into subdivisions will enable us to define
specific methods and criteria for work in each division
• The overall precision of the estimates will be more exact
Disadvantage
• The assumption of little variation and similarity within strata’s
in real world is not easily achievable

27
…cont

4. Cluster (area division) random sampling


A cluster sample is a simple random sample of groups
or clusters of elements (vs. a simple random sample
of individual objects).

• This method is useful when it is difficult or costly to


develop a complete list of the population members
or when the population elements are widely
dispersed geographically.

28
Cont…

• The population is divided in to sub divisions


called clusters
• Clusters are heterogeneous internally and
homogeneous externally

29
…cont
• Cluster sampling can be done in:
One step
• Few clusters will be selected and the units in the
selected clusters will be taken
Multi stage
• Some units within clusters will be choosen randomly
Example: In assessing the satisfaction of HIV positive
patients from hospital based health care services in
the capital city, assume that there are about 200
hospitals in Addis

30
…cont
• Suppose our sample size is 20
• In one step method, 20 hospitals will be selected
and all patients form the selected hospitals should be
included in the study
• In multistage cluster sampling, first 20 hospitals will
be selected and second patients within selected
hospitals (cluster) will be selected randomly

31
…cont
Advantage
• Reduce cost in sampling from wide geographical
area’s by defining neighboring regions as a cluster
Disadvantage
• Its precision is lower compared to stratified sampling
and it needs bigger sample sizes to bring same
precision

32
Selecting a sampling method
• Population to be studied
– Size/geographical distribution
– Heterogeneity with respect to variable
• Availability of list of sampling units
• Level of precision required
• Resources available

33
Non-probability (Non-random) sampling techniques
• Generally used in research area where computation
of sampling error can be overlooked
• Used only in preliminary research or
• Used only in studies where error rates are not
considered important
• Probability theories and concepts are not used in the
method
• Members are selected from the population in some
non-random manner

34
When to use non-probability sampling?
• To demonstrate that a particular trait exists in the
population
• To do a qualitative, pilot or exploratory study
• when randomization is impossible (when the
population is almost infinite)
• When the aim is not to generate results that will be
used to create generalizations pertaining to the
entire population
• If we have limited budget, time and workforce
• For initial study which will be carried out again using
a randomized, probability sampling
35
Types of non-random sampling
1. Convenience/accidental/haphazard sampling
• Relies upon convenience and access
• Obtain a sample of convenient elements
• Respondents are selected because they happen to be
in the right place at the right time (i.e. they are easily
available)
Example :
• Patients with specific cancer diagnosis attending a
clinic.
• Interview only people on the main street

36
2. Judgment (purposive) Sampling

• Subjects will be chosen with a specific purpose in


mind
• Some subjects are purposively chosen because they
are believed to be more fit for the research
compared to other individuals
• The selection takes place on the basis of some
predetermined idea (e.g. clinical knowledge etc.)

37
Examples:
• A researcher may decide to draw the entire
sample from one "representative" city, even
though the population includes all cities
• Samples based on the clinical condition of
patients (e.g. select all hypertensives)

3. Quota sampling
• Non-probability equivalent to stratified sampling
• First the population are divided into strata

38
…cont
• The bases of the quota are usually:
 Age, gender, education, race, religion and
socioeconomic status
Example
Taking college year level as a base for a quota requiring
equal representation from each level, we can take a
sample size of 100, by selecting 25 1st year students,
another 25 2nd year students, 25 3rd year and 25
4th year students

39
4. Snowball sampling
• Used when there is a very small population size
• Initial subject is used to identify another potential
multiple subjects who also meets the eligibility criteria
of the research
• Useful when we want to reach populations that are
inaccessible or hard to find (popn. With no address or
no sampling frame)
Examples:
• Sampling heroin addicts
 An addict may be asked for the names of other addicts that
he or she knows
• Studying the homeless people
 Identify one or two and ask other homeless in their area 40
Advantages and Disadvantages of Probability and
Non-probability sampling methods
Sampling methods Advantages Disadvantages

 Minimal bias  Expensive


Probability  Allow for estimation of sampling  Inconvenient
errors  Time consuming
 More authentic  Problematic with large
 Results can be generalized population
 Technically skilled operator is
required
 Convenient  Results cannot be generalized
Non-probability  Economical  Maximum bias.
 Less time consuming  Sample error cannot be
 Less skilled operator required estimated.
 Authenticity very debatable
 Weaker type of sampling.

41
Summary
• Probability sampling methods are the best
Ensure
– Representativeness
– Precision
• …..within available constraints
• Non- Probability sampling methods could be
used for exploratory/ preliminary research

42
Exercise

• For your proposed topic of research:


– Define the target population

– Define the study population

– What is the appropriate sampling method

43
Sample Size Determination
• Taking a large sample than is needed to
achieve the desired results is wasteful
resources
• Very small samples often lead to results that
push us to give wrong conclusion
• Thus, Optimum/adequate sample size is
recommendable

44
Rules of thumb for determining the sample size
1. The larger the population size, the smaller the percentage of the
population required to get a representative sample
2. For smaller samples (N ‹ 100), there is little point in sampling.
Survey the entire population.
3. If the population size is around 500 (give or take 100), 50%
should be sampled.
4. If the population size is around 1500, 20% should be sampled.
5. Beyond a certain point (N = 5000), the population size is almost
irrelevant and a sample size of 400 may be adequate.

45
Methods of sample size determination
1. Precision based sample size determination
• If our aim is to estimate unknown population parameter
(population mean or proportion),
 our sample size determination is related to estimation.
 The method is said to be precision based sample size
determination.

46
Precision based sample size determination

• Descriptive study Design-Single population


• Sample size determination for quantitative response
variable: Estimating mean of single population ( µ )
• If it is known that sampling is with replacement or from an
infinite population, sample size is given by:
2 * 2
Z 2
n
E2
• if sampling is without replacement or from a small finite
population (N<10,000) sample size is given by:

n
 nf  , FPC is considered
n 1
1
N

47
Example
A health officer wishes to estimate the mean
haemoglobin level in a defined community.
Preliminary information shows that the mean is
about 150 mg/l with a standard deviation of 32mg/1.
If a sampling error of up to 5mg/l in the estimate is
to be tolerated at 95% confidence level, how many
subjects should be included in the study?
Solution:
Here, s= 32mg/l, and E=5 mg/1

  0 . 05 , z 
2
 z 0 . 025
 1 . 96 , E  5

48
…cont
If the population is assumed to be very large, the
required minimum sample size would be:

n
z  / 2  2  1 .96 32 

2

 157 .4  158
2

2 2
E 5
If the community to be sampled has 1000 people, the
required minimum sample size would be:
n 158 158
n f

n 1
 
157 1.157
 136.6  137
1 1
N 1000

49
Sample size determination for qualitative/categorical response
variable: Estimation single population proportion(  )

Sample size is given by:


2 * p (1 p )
Z 2
n
E2

Example: A researcher wants to estimate the proportion of adults


who are allergic to trees, weeds, flowers, and grasses. If it is
desired to be 99% confident that the estimate will not differ from
the population proportion by more than 3%:

50
…cont
a) What sample size is required if a previous survey
shows that 15% of adults were allergic to trees,
weeds, flowers, and grasses?
b) What would the sample size be, for the same
degree of confidence and same maximum allowable
error, if no such previous survey had been taken?
Solution:
a) p=0.15, q=0.85 and E=3%=0.03.
  0 . 01 , z  z  2 . 58

2
0 . 005

Assuming the population is to be very large, the


required minimum sample size would be:
51
…cont

n
2
z /2
* p*q

2 .58  * 0 .15 * 0 .85  0 .8487
2

 943
2 2
E 0 . 03 0 . 0009

b) p=0.5, q=0.5 and E=3%=0.03


  0 . 01 , z 
2
 z 0 . 005
 2 . 58

n z
2
/2
* p*q

2 . 58  2
* 0 . 5 * 0 . 5 1 . 6641
  1849
2 2
E 0 . 03 0 . 0009

52
2. Power based sample size determination
• Analytical study design
• The primary purpose of an analytical study is to test
(one or more) null hypotheses
• Determination of the sample size requires the
specification of:
 Significance level (probability of committing type I
error-rejecting true null hypothesis, α )
 Power of the test, (probability of rejecting false null
hypothesis-correct decision, 1-β)
 Margin of error and
 Probability distribution of the estimator.
53
Cont…
• If our aim is to test a hypothesis about unknown
population parameter,
our sample size determination is related to
hypothesis testing.
• The method is said to be power based sample
size determination.
• The minimum statistical power required for
hypothesis test is 80%.

54
1. Sample size for Testing equality of two means
(quantitative response variable)
• The hypotheses to be tested are:
H 0: 1  2  0 vs H1: 1  2  0
• If the two groups have equal variance the sample size
per group is given:

n
  2 2
2* Z  2 Z  *
1  22

55
Cont…
• If the two groups have different variances the sample
size per group is given:

Z  2 Z  2 12 22 
n 
 1  2 2

56
Example
The trial was designed to assess the effectiveness of a new
therapy treatment on the treatment of severe sepsis and
septic shock. The clinicians measure the effectiveness of
the therapies of the treatments using mean arterial
pressures and wish to detect a difference of at least
14mmHg between the two groups. Assuming the standard
deviation of the two groups is 20mmHg, in order to detect
a difference of this magnitude that is significant at 95%
confidence level and a power of 80%, how many patients
are required in the treatment (new therapy) and control
(standard therapy) groups?
57
cont…
The study will require 32 patients in each group:

2* Z  2  Z  2* 2 2 20 21.96  0.84 2 6272
n     32
 1  2 2 14 2 196

2. Sample size for Testing equality of two proportions


(qualitative response variable)
We are interested in testing the hypotheses:

H 0 :  1   2  0 vs H1 :  1   2  0

58
Cont…
• If the two groups have equal variances the sample size
per group is given:

 
2
2* Z  2 Z  * p (1 p )
n 
 p1 p2 2

• where p is pooled prevalence (average of the


prevalence in the two groups)

59
Cont…
• If the two groups have equal variances the
sample size per group is given:

Z  2 2 p (1 p )  Z  p1(1 p1) p 2(1 p 2 ) 2


n 
 p1 p2 2
• Where p is pooled prevalence (average of the
prevalence in the two groups)

60
Example
• Consider a study investigating the effectiveness
of aspirin in reducing the mortality rate due to
myocardial infarction (heart attacks).
• Let  denote the proportion of deaths for
A

aspirin users in the population and  denoteN

the corresponding proportion for nonusers.


• We are interested in testing the hypotheses
H 0 :  A   N  0 vs H1 :  A   N  0

61
Cont…
Previous studies indicate that the proportion of
deaths due to heart attacks is 0.015 for nonusers
and 0.001 for users. Investigators wish to determine
the sample size required to detect an absolute
difference of |0.001 − 0.015| = 0.014 with 80%
power using a two-sided 5% level of significance
test.
In order to detect a difference of this magnitude,
calculate the sample size required.
62
Cont…
• Assuming different variances in the two populations:

Z 2 2 p(1 p)Z  p1(1 p1) p2(1 p2) 2


n
 p1 p22

n
1.96 20.008(10.008) 0.84 0.001(10.001)0.015(10.015)  0.124206
2
 633.7634
 0.0010.0152 0.000196

63
Cont…
• Assuming the same variance in the two populations:


2* Z  2  Z  2* p (1 p )
n 
 p1 p 2 2
2*1.960.842*0.008(10.008) 0.124436
n  2   634.88  635
 0.0010.015 0.000196

• A total sample of 1,270 individuals, 635 individuals per


group, must be obtained to detect an absolute
difference of 0.014 between proportions of aspirin users
and nonusers with 80% power using a two-sided 5%-
level. 64
3. Sample size for Testing equality of exposure
proportions: A Case-control study design
• In case control study, the case group (the group with
disease/condition under consideration) is compared
with control (the group without disease/condition under
consideration) regarding their proportion exposed to
the risk factor under question (exposure).
• We are interested to test the hypothesis:
H 0 : p1  p 2  0 vs H1 : p1  p 2  0

Equivalently:
H 0 : OR  1 vs H1 : OR  1
65
Cont…
• The sample size required for this study was determined
by using Kelsey formula as follows:
• Sample size for cases group:

 
2

  p 1- p  z z
r 1 2
n1
2
r
 p1 p2
• Sample size for control group:

n 2
 r * n1

66
Cont…
• Where:
 r= ratio of controls to cases (r=1 for equal sample size for controls
and cases)
 p
p1
p
2
is the average proportion of exposed to the risk factor
2
under question for the entire pooled population.
 p1 is proportion of cases exposed to risk factor under question
 p is proportion of controls exposed to risk factor under question
2

p1  p2 is the difference in proportions exposed between the cases


and controls
 z 2 represents the value of Z for the desired level of significance
z
  represents the value of Z for desired power of the test

67
Example
A researcher wants to see the effect of childhood sexual
abuse (risk factor under question) on psychiatric disorder
in adulthood. The researcher will retrospectively assess
childhood sexual abuse in cases (adult person with
psychiatric disorders) and controls (adult person without
psychiatric disorders) to compare the proportion of cases
exposed to childhood sexual abuse with the proportion of
controls exposed to childhood sexual abuse. Suppose that
35% of the cases and 20% of the controls were exposed to
childhood sexual abuse. At 5% significant level, 80% power
of the test, calculate sample size for this study assuming
equal sample size allocation to case and control groups. 68
Cont…
• Sample size for case group:
 
2 p1 p 2 0.35  0.2
 
 
r 1 p 1- p  z 2 z  p   0.275
n1 2 2
2
r
 p1 p2 
2
n1  
11
1
0.275 1- 0.275


 
0.350.2
1.96
2


0.84 
138.9139

• Sample size for control group: n2r*n11*139139


The study requires a total of 278 participants:
139 controls (adult person without psychiatric disorders)
and 139 cases (adult person with psychiatric disorders).
69
Cont…
When the input is provided as an odds ratio (OR- the
magnitude of effect measure) we want to detect) rather
than the proportion cases exposed to the risk factor, the
proportion of cases exposed to the risk factor is calculated
as:
p 2 OR 
p1
1  p 2 OR  1 

70
The other version of the above example is given as follows:
A researcher wants to see the effect of childhood sexual
abuse (risk factor under question) on psychiatric disorder
in adulthood. The researcher will retrospectively assess
childhood sexual abuse in cases (adult person with
psychiatric disorders) and controls (adult person without
psychiatric disorders) to compare the proportion of cases
exposed to childhood sexual abuse with the proportion of
controls exposed to childhood sexual abuse.
Suppose that 20% of the controls were exposed to childhood
sexual abuse and we want to detect an odds ratio (OR) of 2.0
or greater at 5% significant level, 80% power of the test,
calculate sample size this study assuming equal sample size
allocation to case and control groups. 71
Cont…
• The proportion of cases exposed to childhood sexual
abuse can be estimated as follows:
0 .2  2 0 .4
p1    0 . 33
1  0 .2 2  1  1 .2

• Sample size for case group:

 r  1
p1 - p  z 
2
z 
2

p 
p1 p2 0.330.2
  0.265
n1  r 

p1 p2 2
2 2

 
2
0.265 1- 0.265  1.96 0.84 
11
n1   180.7  181
2
1  0.33 0.2 

• Sample size for control group: n 2  r *n11*181 181

72
Sample size for Testing equality of proportion of
events among exposed group and unexposed group
: Cohort study design
• Our interested is to test the hypothesis:
H 0 : RR  1 vs H1 : RR  1
• Information required :
 Power (1-β): probability of detecting a real effect (a true
relative risk or experimental event rate). Common
values are 80% or 90%.
 Significance level (α): Fixing probability of type I error/
rejecting true null hypothesis (Common values are 5% or
1%).

73
Cont…
 P0: probability of event (e.g. death/disease) in non-
exposed/controls.
 Can be estimated by prevalence of population under
study.
 P1: probability of event (e.g. death/disease) in
exposed/experimental subjects.
 RR =P1/P0: relative risk of events between
exposed/experimental subjects and non-exposed
/controls if P1 is not known.

74
Cont…
• In this case, we will determine a relative risk (RR) that we
want to detect with the statistical power and then
calculate P1 as follows:
p1
RR   p 1  p 0  RR
p0

 m: ratio of non-exposed /control subjects to exposed /


experimental subject (m=1 is equal sample size in both
groups)

75
Cont…
• The sample size for exposed group (n1) is given
by:

n1 2
1
Z (1 m) p(1pZ
p0(1p0)
m
p1(1p1) 
2

2
 p0p1
p1 m p 0 p0 p1
p p  , if m 1(equal sample size)
m 1 2

• The sample size for unexposed group (n2) is given by:

n 2 m*n1
76
Example
Suppose that a researcher wants to see the impact of
weight training exercise on cardiovascular mortality.
According to the previous studies, the proportion of
cardiovascular death in people who participate in weight
training exercise (exposed) is 20% and in people who did
not participate in weight training exercise (unexposed) is
40%.
At 5% significance level, 80% statistical power with
equal number of exposed and non-exposed, calculate
sample size for this study.

77
Cont…
• Sample size for exposed group (n1):

n1
1.96 (1 1 ) 0.3 (10.3 )  0.84 0.4 (10.4 )  0.2 (10.2 )
1 1  81.1382
2
p 
0.4  0.2
2  0.3
2
 0.40.2 

m=1(equal number of exposed and non-exposed


subjects
• Sample size for non-exposed group (n2): n 2 1 82  82
The study requires a total of 164 participants:
82 non-exposed (persons who will not participate in
weight training exercise) and 82 exposed (persons who will
participate in weight training exercise).
78
Sample size determination using Online sample size
calculator

Online open source sample size calculator


• An online open source calculator allows us to
determine the sample size for different study
designs.
• Thus, we can use online Open Source Epidemiologic
Calculator, OpenEpi version 3.01, 2016 (available
from:
https://www.openepi.com/SampleSize/SSCC.htm)

79
Exercise
• Determine sufficient sample size for your
proposed topic of research?

80

You might also like