Lecture 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

Data collection

Amine Hadji

Leiden University

February 15, 2022


Outline

• Inferential statistics

• Confidence interval, margin of error

• Sampling methods and problems

• Surveys

• Introducing causality

• Experimental vs. observational studies


Inferential Statistics
Statistical techniques:
• Descriptive Statistics:
• Inferential Statistics:
Inferential Statistics
Statistical techniques:
• Descriptive Statistics: using and analyzing numerical/graphical summaries.
• Inferential Statistics: using sample data to make conclusions on broader range
of individuals
The fundamental rule for using data for inference:
Inferential Statistics
Statistical techniques:
• Descriptive Statistics: using and analyzing numerical/graphical summaries.
• Inferential Statistics: using sample data to make conclusions on broader range
of individuals
The fundamental rule for using data for inference: available data can be used to
make inference about a larger group if the data can be considered to be representative
with regard to the question of interest.

Examples:
• Are first ladies representative for women?
• Are LUC students representative for university students in the Netherlands?
Definitions
Goal: Use a small group of units to make inference about a larger group.
Definitions
Goal: Use a small group of units to make inference about a larger group.
Definitions:

• Population:

• Census:

• Sample:

• Simple random sample:

• Sample survey:
Definitions
Goal: Use a small group of units to make inference about a larger group.
Definitions:

• Population: the larger group of individuals we want to make inference about.

• Census: every unit in the population is measured/surveyed.

• Sample: a small group of units that are measured/surveyed.

• Simple random sample:

• Sample survey:
Definitions
Goal: Use a small group of units to make inference about a larger group.
Definitions:

• Population: the larger group of individuals we want to make inference about.

• Census: every unit in the population is measured/surveyed.

• Sample: a small group of units that are measured/surveyed.

• Simple random sample: every conceivable group of units of the required size
from the population has the same chance to be the selected sample.
• Sample survey: a subgroup of a large population is questioned on a set of topics.
It is a typical usage of simple random samples.
Sample survey vs. Census
Advantages of sample survey:
Sample survey vs. Census
Advantages of sample survey:
• Census is not always possible: quality control in a company, ecology,...

• Sample survey is faster: census in US every 10 years (takes years to plan)...

• Accuracy: training the interviewers, track non-respondents.


Sample survey vs. Census
Advantages of sample survey:
• Census is not always possible: quality control in a company, ecology,...

• Sample survey is faster: census in US every 10 years (takes years to plan)...

• Accuracy: training the interviewers, track non-respondents.

Issues with sample survey:


Sample survey vs. Census
Advantages of sample survey:
• Census is not always possible: quality control in a company, ecology,...

• Sample survey is faster: census in US every 10 years (takes years to plan)...

• Accuracy: training the interviewers, track non-respondents.

Issues with sample survey:


• Selection bias:

• Nonparticipation bias:

• Response bias:
Sample survey vs. Census
Advantages of sample survey:
• Census is not always possible: quality control in a company, ecology,...

• Sample survey is faster: census in US every 10 years (takes years to plan)...

• Accuracy: training the interviewers, track non-respondents.

Issues with sample survey:


• Selection bias: the selection procedure produces a non-representative sample

• Nonparticipation bias: some of the selected individuals do not participate (they


cannot be contacted or no response)
• Response bias: participants provide incorrect information (e.g. wording of the
question, will to please the interviewer, sensitive subjects,...)
Selection bias
Margin of error

Sample survey is often used to estimate the proportion (or percentage) of people who
have certain trait or opinion.
Question: How accurate is the estimation?
Margin of error

Sample survey is often used to estimate the proportion (or percentage) of people who
have certain trait or opinion.
Question: How accurate is the estimation?
Margin of error: Is a measure of the accuracy of a sample proportion as an estimate
of the population proportion. The difference between the sample proportion and the
population proportion is less than the margin of error (typically) 95% of the time.
Note: It assumes a representative sample.
Margin of error

Sample survey is often used to estimate the proportion (or percentage) of people who
have certain trait or opinion.
Question: How accurate is the estimation?
Margin of error: Is a measure of the accuracy of a sample proportion as an estimate
of the population proportion. The difference between the sample proportion and the
population proportion is less than the margin of error (typically) 95% of the time.
Note: It assumes a representative sample.

Conservative margin of error: Calculated as 1/ n, where n is the sample size. (In
percentage: √1n × 100%)
For n = 1600 we have m.e. of 2.5%
Margin of error
Influence of the sample size:
Margin of error
Influence of the sample size:
• Sample size increases, margin decreases.
• Margin decreases slower than sample size increases
(In order to divide the margin by 2, we must multiply the sample size by 4)
• To get accurate estimates about a sub-group large sample are required
• Sample size increases, sub-group size increases
• Sub-group size increases, margin of the sub-group decreases

Effect of population size:


Margin of error
Influence of the sample size:
• Sample size increases, margin decreases.
• Margin decreases slower than sample size increases
(In order to divide the margin by 2, we must multiply the sample size by 4)
• To get accurate estimates about a sub-group large sample are required
• Sample size increases, sub-group size increases
• Sub-group size increases, margin of the sub-group decreases

Effect of population size:


• No influence on the margin of error
• Formulas were derived assuming infinite population and sampling with replacement
(so it works better in practice)
Confidence interval
Confidence interval
Confidence interval: an interval of values that estimates an unknown population
value.
Approximate 95% confidence interval:
1
sample proportion ± √ .
n
Confidence interval
Confidence interval: an interval of values that estimates an unknown population
value.
Approximate 95% confidence interval:
1
sample proportion ± √ .
n
Example: n = 2002 adult Americans were asked: “Do you favor or oppose scientific
experimentation on the cloning of human beings?” 340 people were in favor. What is
the 95% confidence interval?
Confidence interval
Confidence interval: an interval of values that estimates an unknown population
value.
Approximate 95% confidence interval:
1
sample proportion ± √ .
n
Example: n = 2002 adult Americans were asked: “Do you favor or oppose scientific
experimentation on the cloning of human beings?” 340 people were in favor. What is
the 95% confidence interval?
Solution:
340 1
±√ = [0.099, 241].
2002 2002
Confidence interval
Simple random sample

• Similar to lottery games:


Each particular group has very small chance of being selected, but most of them
will be representative. (e.g. 1000 people out of 100.000.000, Example 5.5)
• Need a list of units in a population.

• Random numbers can be generated in R


Other sampling methods
Not always feasible/practical to take simple random sample (e.g. polling of voters).
• Stratified random sampling: First divide the populations into subgroups and
then take a simple random sample from each of them (e.g. different regions in the
country).
Other sampling methods
Not always feasible/practical to take simple random sample (e.g. polling of voters).
• Stratified random sampling: First divide the populations into subgroups and
then take a simple random sample from each of them (e.g. different regions in the
country).
• properly represents important sub-groups.
• reduce variance by grouping (if there is little natural variability of the answers
within the strata)
Other sampling methods
Not always feasible/practical to take simple random sample (e.g. polling of voters).
• Stratified random sampling: First divide the populations into subgroups and
then take a simple random sample from each of them (e.g. different regions in the
country).
• properly represents important sub-groups.
• reduce variance by grouping (if there is little natural variability of the answers
within the strata)
• Cluster sampling: The population is divided into clusters from which a random
sample of clusters is selected.
Other sampling methods
Not always feasible/practical to take simple random sample (e.g. polling of voters).
• Stratified random sampling: First divide the populations into subgroups and
then take a simple random sample from each of them (e.g. different regions in the
country).
• properly represents important sub-groups.
• reduce variance by grouping (if there is little natural variability of the answers
within the strata)
• Cluster sampling: The population is divided into clusters from which a random
sample of clusters is selected.
• easier to accomplish (units in a cluster are close to each other physically).
• many small clusters vs few large strata.
• use list of clusters (e.g. city blocks, flights), not a full list of individuals.
• different analysis, because of many similarities!
Other sampling methods II

• Systematic sampling: Divide the list into consecutive segments, randomly


choose a starting point, and then sample at the same point in each segment.
• company samples every 200th product: much simpler than listing all
products.
Other sampling methods II

• Systematic sampling: Divide the list into consecutive segments, randomly


choose a starting point, and then sample at the same point in each segment.
• company samples every 200th product: much simpler than listing all
products.
• Multistage sampling: Combine the stratified and cluster sampling to get smaller
portions of the population to reach individual units.
• Stratify by region, then by urban, then by suburban,... and sample
communities within those strata by cluster sampling (e.g. clusters city blocks
or fixed areas). E.g. Example 5.8 on survey about unemployment rate.
Difficulties in sampling
Some problems can occur even if you have a proper sampling plan.
Difficulties in sampling
Some problems can occur even if you have a proper sampling plan.
• Wrong sample frame: (e.g. using paper telephone directory to target the
general adult population; problems: only people with a telephone can answer).
Difficulties in sampling
Some problems can occur even if you have a proper sampling plan.
• Wrong sample frame: (e.g. using paper telephone directory to target the
general adult population; problems: only people with a telephone can answer).
• Selected individual cannot be reached: (e.g. housewives are overrepresented in
telephone survey)
Difficulties in sampling
Some problems can occur even if you have a proper sampling plan.
• Wrong sample frame: (e.g. using paper telephone directory to target the
general adult population; problems: only people with a telephone can answer).
• Selected individual cannot be reached: (e.g. housewives are overrepresented in
telephone survey)
• Volunteer response: (e.g. only people with strong opinions tend to respond to
surveys)
Difficulties in sampling
Some problems can occur even if you have a proper sampling plan.
• Wrong sample frame: (e.g. using paper telephone directory to target the
general adult population; problems: only people with a telephone can answer).
• Selected individual cannot be reached: (e.g. housewives are overrepresented in
telephone survey)
• Volunteer response: (e.g. only people with strong opinions tend to respond to
surveys)
Study
Difficulties in sampling - Example

Convenience sample: using the most convenient group available. This usually breaks
the fundamental rule of using data for inference.
Difficulties in sampling - Example

Convenience sample: using the most convenient group available. This usually breaks
the fundamental rule of using data for inference.
Example: failure of Literary Digest Poll of 1936: Roosevelt vs. Landon
• mailed 10 million people of magazine subscribers, car owners, telephone owners.

• only 2,3 million people answered.

• Gallup got both the election and the prediction of Literary Digest right. (surveying
only 50 000 people).
Pitfalls of Asking Survey Questions
Response bias: the wording of the question influences the answer.
Pitfalls of Asking Survey Questions
Response bias: the wording of the question influences the answer.

• Deliberate bias in question (e.g. leading questions)

• Unintentional bias in question (meaning can be misinterpreted, e.g. drugs)

• Desire of respondents to please (e.g. sensitive/controversial subjects)

• Asking the uninformed (e.g. fictional group of ’Wisians’)

• Unnecessary complexity (e.g. hard to understand the question)

• Ordering the question

• Confidential and anonymous concerns. (e.g. computer vs. paper survey)

• Words can have different meaning to different people


What exactly is measured

Form of questions:
• Closed question: respondents are given a list of alternatives (easy to summarize
and analyse, might exclude possible important answers)
• Open question: are allowed to answer with their own words (difficult to
summarize, wording of question might exclude answers)

They can provide very different results; both are pros and cons
Advises about questionnaires

• Do not ask too many questions.

• Neutral opinion if possible.

• Instead of specific questions relying on memory ask more vague ones (e.g. How
much chocolate did you eat in the past month? ⇒ In a typical month how much
chocolate do you eat?)
• Ordering:
• Introduction: simple questions
• Important questions should be asked earlier (loss of interest)
• Sensitive questions should not be the first ones
Cause and effect

Goal: The goal in scientific studies is to detect causal relationships.


Cause and effect

Goal: The goal in scientific studies is to detect causal relationships.

Examples:
• Does smoking cause cancer?

• Does owning a BMW result in violating more traffic rules?


Cause and effect

Goal: The goal in scientific studies is to detect causal relationships.

Examples:
• Does smoking cause cancer?

• Does owning a BMW result in violating more traffic rules?

Fundamental problem of causal inference: The outcome of an individual can only


be seen in one group: control group or treatment group (i.e. we can never answer the
question ’What if?’)
Cause or effect?
Statistical analysis
Statistical answer:
• Well designed research experiments and correct statistical analysis

• Evaluating relationships between two phenomenons (using rigorous mathematical


tools)
Statistical analysis
Statistical answer:
• Well designed research experiments and correct statistical analysis

• Evaluating relationships between two phenomenons (using rigorous mathematical


tools)

Be careful! False comparison:


• more police ⇒ more crime.

• more doctors ⇒ more deaths in hospitals.

• Aaron Ramsey theory.


Correlation vs Causation
Research Studies
Types of research studies:
• Observational study: Researchers only observe or measure the participants and
do not assign any treatments or conditions.
• Randomized experiment: Participants are randomly assigned to participate in
one condition or another (very few studies are of this type, mainly in medicine)
Research Studies
Types of research studies:
• Observational study: Researchers only observe or measure the participants and
do not assign any treatments or conditions.
• Randomized experiment: Participants are randomly assigned to participate in
one condition or another (very few studies are of this type, mainly in medicine)

Terminology:
• Unit/Subject/Participant: single individual or object being measured.

• In a relationship between two variables explanatory variable is one that partially


explains the value of a response variable for an individual.
• Confounding variable: affects the response variable + is related to the
explanatory variable.
Research Studies - Examples

• Does church visits result in lower blood pressure? (2391 people, people attending
church regularly are 40% less likely to have high blood pressure).
• Malaria prevention with nets. (Two similar village, one gets nets, the other does
not. Measure the difference of malaria cases.)
Randomized experiment

Participants are randomly assigned to be in a specific group.


Idea: This makes the groups approximately the similar in all respects except for the
explanatory variable.
Randomized experiment

Participants are randomly assigned to be in a specific group.


Idea: This makes the groups approximately the similar in all respects except for the
explanatory variable.

Pros: strongest evidence of a cause-and-effect.


Randomized experiment

Participants are randomly assigned to be in a specific group.


Idea: This makes the groups approximately the similar in all respects except for the
explanatory variable.

Pros: strongest evidence of a cause-and-effect.


Cons:
• ethical issues (e.g. oral contraceptive vs placebo for women)

• practical issues (smaller sample size, not enough data, expensive...)


Designing randomized experiments

• Goal: extend results for larger groups (data needs to be representative with
respect to the research question)
Designing randomized experiments

• Goal: extend results for larger groups (data needs to be representative with
respect to the research question)
• Participants: volunteers. (sometimes problematic ⇒ non-representative
population)
Designing randomized experiments

• Goal: extend results for larger groups (data needs to be representative with
respect to the research question)
• Participants: volunteers. (sometimes problematic ⇒ non-representative
population)
• Randomization:
• Randomizing the type of treatment (using computer programs)
• Randomizing the order of treatment
Designing randomized experiments II

• Replication: single experiment provides typically not sufficient evidence.

• Control groups: to check whether the treatment has effect (no active treatment).

• Placebo: special treatment for control group. Looks like an actual treatment, but
no active “ingredient”. Proved to have significant effects
• Blinding:
• single blinding: participant or the researcher does not know which treatment
was assigned.
• double blinding: neither the participant nor the researcher knows which
treatment was assigned (not always possible)
Observational studies

Example:
• Case-control study: “cases” are compared to “controls” (having the attribute or
not)
• Example: Does baldness cause heart attack? (665 heart attack vs 772 other
disease, higher chance for baldness)
Case control study

Advantages:
• Efficiency. (e.g. heart attacks are rare, so choosing people among people who
already have had a heart attack is cheaper/more efficient)
• Reduced potential confounding variables. (e.g. maybe balding men are less
healthy; hence choosing other patients).
Difficulties
• Confounding variables:
• Do not have an effect in randomized experiments: cause-and-effect
relationships can be inferred.
• In observational study take them into account.
• Only extend if data can be considered to be representative with regard to the
question of interest.
• Hawthorne effect: Participants respond, because they are in an experiment (e.g.
problem in medical research).
• Experimenter effects: recording the data erroneously, treating subjects
differently, make the subject aware of the desired outcome.
• Ecological validity: Variables removed from their natural setting. (e.g. no social
pressure).

You might also like