Acaps Technical Brief Spotting Dubious Data November 2015
Acaps Technical Brief Spotting Dubious Data November 2015
Acaps Technical Brief Spotting Dubious Data November 2015
SPOTTING DUBIOUS
DATA
November 2015
SUMMARY–SPOTTING DUBIOUS DATA. Amid the volumes of information
available on humanitarian crises, there are only few statistics worth remembering and using.
Look out for the following sources of errors, scrutinize the data and spot the difference
between solid stats and dubious data (Adapted from Joel Best).
1. WHO HAS BEEN COUNTING AND WHY ? 2. WHAT HAS BEEN COUNTED ?
Example: Headline 14 September 2015, The Daily Example: Colombia has the
Mail: Two in every 100 Syrian migrants are IS fighters, second highest number of
according to the Lebanese Minister of Education. IDPs in the world, after Syria.
Why is it dubious: It is unlikely that the Lebanese Why is it dubious: The concept of an IDP in Colombia
Minister of Education has the expertise speak to the is very broadly defined - displacement figures for
ratio of IS fighters to individuals fleeing Syria. It is Colombia commonly count all people who were
likely that the Daily Mail, by some described as a internally displaced since the 1990s.
‘sensationalist’ newspaper, did not check this fact
before publication. Keep in mind
Look out for concepts that are widely used within
Keep in mind the humanitarian community, but lack a common
Why was the data collected? What is the agenda of definition such as affected, in need, vulnerable,
the source? Could it be biased? household, urban.
What is the expertise of those who have collected, Consider whether the concepts used could have
reproduced and disseminated the data? been defined too narrowly or too broadly. Has
Are they sufficiently knowledgeable to research the something been excluded?
matter? Have definitions remained the same at the
Is there a strong track record of producing different points in time? Has there been domain
accurate information? expansion? (Definitions that have been broadened
over time?).
ACAPS - Spotting dubious data
3. HOW WAS IT COUNTED ? comparison group. Sphere standards put total basic
water needs per person per day at 7.5 to 15 liters a
Example: 6.5 million people have been internally day.
displaced in Syria as of October 2015. Why is it
dubious: Data gathering in Syria is severely Keep in mind:
hampered by the active conflict and lack of access Could the calculations be flawed?
to parts of the country. (IDMC 07/2015) Statistics Are there any misleading comparisons,
regarding the Syria conflict are therefore broad timeframes, comparison groups or standards
guesstimates, computed in a politically charged used?
context. Are there any stated relationships between two
variables (look out for reports that claim to identify
Keep in mind: the key cause of complex problems, it is
Does the data consist of numbers that seem hard impossible determine causality through
to produce—how could anyone calculate that? experimental design)
Closely scrutinise information on sensitive topics, Calculations that highlight or muffle outliers?
such as SGBV or informal activities.
Numbers presented without sufficient information
about measurement choices or assessment tools? 5. HOW WAS IT PACKAGED ?
Unusual units of analysis (e.g. extended families
instead of households) that might affect the Example: Of the more than 80 million people
resulting statistic? estimated to have been in need of humanitarian
Criticisms of measurement choices by others assistance in 2014, over 75% were women and
Particular caution is required when reviewing children. Why is it dubious: 75% of all people in high
forecasts or estimates about future trends fertility countries are women and children – it is
unclear how this was calculated and it is most likely
only included for shock purposes.
4. HOW WAS IT PROCESSED AND ANALYSED ?
Example: UNHCR says most of the Syrians arriving in
Greece are students. Why is it dubious: The results of
Example: 7.4
the survey indicate that ‘student’ was the most
million people are
frequently mentioned occupation, indicated by 16%
in need in
of respondents.
Afghanistan.
Why is it
Example: Before the outbreak of violence in Burundi
dubious: Double-
following mass-protest, under 5 global acute
counting the number of people in need is common
malnutrition rates were already at 41%. Why is it
and this example is illustrative of the underlying
dubious: Global Acute Malnutrition (GAM) rates
thinking-error. The number of people in need per
above 15% are considered critical, the most severe
sector has been combined to total 7.4 million.
level of the WHO scale. One of the highest levels of
However, the units of analysis are not mutually
GAM recently recorded was in South Sudan, at
exclusive categories - some people who are severely
22.7% (Generation Nutrition 2014)
food insecure, will have been affected by natural
disasters too, etc.
Keep in mind:
Dramatic statements that take the form of
Example: Water shortages for refugees in camps in
statistical claims, such as hyperboles, ‘the best’,
Jordan have reached emergency levels; the supply is
the most’, ‘myth’, ‘new discovery’?
as low as 30 liters per person per day — one-tenth of
Words that imply causation (such as leads to,
what the average American uses. Why is it dubious:
attributable to, caused by etc.). It is highly difficult
A crisis situation is often compared to the reference
to determine causality, particularly in an
standards of those that organizations want to
emergency setting.
provide funding. The United States is one of the
Unhelpful denominators (x per hour) used for
countries with the highest per capita water use in the
shock purpose?
world and is therefore not an appropriate
Have results been misinterpreted? Are visual
representation accurate or misleading?
Blunders (numbers that seem surprisingly large or
small)?
Are the figures in line with what I know and expect
or surprisingly different? Have decimal points been
misplaced?
Example: By the end of 2013, the UN estimated that
Table of Contents
6.5 million people had been displaced in Syria as a
Table of Contents .................................................... 5 result of the civil war. The conflict, which had been
Introduction ............................................................ 5 ongoing for over two years at that point, had resulted
Benchmarks............................................................. 5 in a widespread shortage of staff, damage to
Who Is Counting and Why?..................................... 6 infrastructure, and a lack of inputs such a medicines
What Has Been Counted? ....................................... 6 and water purification tablets. As a result, the health
How Was It Counted? ............................................. 7 and WASH cluster estimated that 21 million people
were in need of humanitarian assistance.
How Was It Processed and Analysed? .................... 8
How Is It Packaged? .............................................. 10
It is generally agreed that an unprecedented number
Sources and Background Readings ....................... 12 of people in Syria were (and still are) in need of
support. However, a quick look at the total
Introduction population in Syria shows us that the 21 million
people in need is most likely an exaggeration.
Estimates on the pre-crisis population range from 21
“27% of statistics are false” People often assume
to 24 million people. By the end of 2013, over 2
that all numbers are hard facts: if it is reported,
million Syrians had already registered as refugees in
someone must have calculated and checked the
neighbouring countries, with a significant additional
figures. Some available figures are indeed accurately
number of Syrians estimated to be unregistered.
reported findings of sound research. Others are
This means that the reported WASH and Health
based on flawed research, or intended to mislead
people in need (PIN) numbers actually total at least
the user. Bad numbers often take on a life of their
the whole population in the country. By November
own: they continue being repeated, even after they
2015, the estimation on the number of people in
have been thoroughly debunked. This is particularly
need of WASH and health support had decreased to
true in the Internet age, when it is so easy to
around 12 million – still an unprecedented high
circulate information.
number, but more likely to be a reflection of the
situation than the previously used 21 million.
The figure itself will not give away its true character -
(SHARP 12/2013, SHARP 10/2015)
a 9 million looks like a 9 million even if it is used to
present dubious data. The context is needed to
Keep in mind that the most dramatic situations are
understand if numbers reflect an accurate statistic, a
relatively rare, whereas the most common situations
wild guesstimate or anything in between. This
are not especially dramatic. This point is important
chapter provides practical guidance on how to
because media coverage and fundraising
interpret the context. It provides a list of common
campaigns often include extreme examples that are
problems found in the numbers appearing in
presented to illustrate a humanitarian crisis. These
humanitarian reports and illustrates these problems
examples are usually atypical.
with examples.
Example: Most humanitarian crises display this
This note is adapted from Stat-Spotting: A Field Guide pattern: there are lots of less serious cases, and
to Identifying Dubious Data by Joel Best (2013). relatively few very serious ones:
Number of people dying of starvation < number
Benchmarks of people borderline food insecure
Number of people killed < number of people
Knowledge of some basic statistical benchmarks is displaced
the most effective method to spot dubious data and Number of children trafficked < number of
recognise questionable statistics. Always be aware children unable to attend school every day
of the following statistics for the relevant country:
The total pre-crisis population in affected areas As a ‘rule of thumb’, subject every statistic to the
The demographic profile of the population following 5 questions:
Estimated number of people affected or Who is counting and why?
displaced What has been counted?
Humanitarian profile of similar crises How was it counted?
Sector specific pre-crisis facts, such as the price How was it calculated and analysed?
of staple foods, school attendance rates, etc. How has it been packaged?
dubious data: Look out for concepts that are widely
Who Is Counting and Why?
used within the humanitarian community but lack a
common definition, including: affected, in need,
“There are three kinds of lies: lies, vulnerable, household, recently displaced, casualties,
damned lies, and statistics” (Disraeli) injured, etc.
Always scrutinise the original source of the BROAD DEFINITIONS. Be aware of definitions that are
information and the entity that has (re)produced the too broad. When advocating on social problems, it
‘fact’. Start with considering the expertise of the often seems preferable to have broad definitions.
individual or organisation that has collected and Broad definitions entail larger numbers, and can
disseminated the data. Specific expertise is an asset therefore generate more attention to a problem.
as well as a handicap. It provides the skills and
knowledge to count and analyse complex matters. Example: Displacement figures for Colombia
At the same time, subject matter experts are commonly count all people who were internally
vulnerable to confirmation bias, seeking only displaced (IDPs) since the 1990s. The figure stood at
information that is consistent with their worldview. 5.7 million IDPs by June 2014. If this number is
In humanitarian settings, individual agency biases compared with other displacement crises, the figure
and agendas are a well-known risk to accurate looks enormous, surpassed only by displacement in
reporting. Syria. However, the cumulative count of IDPs in
Colombia includes people who have since returned
Example: Interpreting data in a way that supports a to their place of origin; who have been displaced only
belief: How people interpret scientific reports related for a very short period of time; who have since died,
to climate change is influenced by their political etc. This method includes too many cases to be
preferences. A research in 2013 showed that 70% of used for comparison or to give an accurate
US Democratic voters saw evidence of man-made representation of the current situation. For 2015, the
climate change in recent weather patterns, whereas number of IDPs in Colombia is cited as 224,000 by
only 19% of Republican voters did when reviewing OCHA. (HDX 03/2015; UNHCR 2015)
the same set of data. (Economist 28/11/2015)
Given the advantages of defining problems broadly,
Therefore, closely review the agenda, interests and definitions might be broadened over time, a
motive for bias of the source. Why has this data phenomenon called domain expansion. The obvious
been collected or quoted? Look out for studies consequence of a broader definition is that
initiated or funded by groups supporting a specific statistical estimates for the problem’s size will
idea or cause. expand. Bigger numbers generate more attention to
the problem.
Example: Deaths in the war in Iraq. During the 2003
Iraq intervention, critics used civilian deaths to prove Example: The death toll for the Syrian civil war has
that the intervention was a mistake, while the Bush been controversial and hard to verify, with differing
administration insisted that the numbers were estimates given by a number of different actors. The
exaggerated. Suspicions that the administration’s Syrian Observatory for Human Rights, whose data is
death toll was too low led to new methodologies for widely reported in international media, changed their
counting civilian deaths, notably incident-based definition of civilian casualties in early 2014.
reporting and mortality surveys. Wide variations Previously, opposition forces and civilian deaths had
between their estimations shows that counting been listed separately. This was changed to include
conflict casualties is fraught with difficulties, even both armed opposition and civilians in the category
without competing interest influencing the results. “civilian deaths”. Under the new definition the
number of reported civilian casualties increased
What Has Been Counted? from around 50,000 to 75,000. (Council on Foreign
Relations, Washington Post 10/02/104, SOHR)
Example: If the dataset is large enough, correlations It is for instance likely that some of the severely food
can be found for anything. The website ‘Spurious insecure have been displaced by the conflict and
Correlations’ identified 40,000 correlations that can have unmet shelter needs. Beware that double
be made by putting together data from several counting is a common flaw of figures on people in
databases, including the US Census and CDC. The need or displaced.
site for instance shows the correlation between the
number of people who drowned by falling into a pool MUFFLING AVERAGES. See if the mean or median
with films Nicholas Cage appeared in (r = 0.666). was used to calculate the average and how the other
The age of Miss America correlates with murders by method of calculation might affect the result. The
steam, hot vapors and hot objects at r = 0.870. mean is calculated by adding the scores of each of
the cases and then dividing by the number of cases.
In social sciences, it is impossible to determine But if there are extreme scores, this method is less
causality through experimental design, as it is not useful and can actually hide large variations. The
possible to control for all factors in people’s lives to median involves listing cases from lowest to highest
isolate the effect of some specific cause. Further, value and then identifying the middle score.
there are many competing explanations for social
problems. Look out for reports that claim to identify Example: How aggregation can hide large variations:
the key cause of complex problems.
Hyperboles: ‘the greatest’, ‘the largest’ ‘the most’, Epidemics: Be wary of announcements of a new
‘record setting’. Superlatives imply comparison, “epidemic”. These often involves comparisons
suggesting that someone has measured two or between old numbers (when no one was paying
more phenomena and determined which one is close attention) and new figures (collected by people
most significant. However, just as often, this keeping much closer tabs on things).
qualification is not based on a comparison of similar
examples. MISLEADING CALCULATIONS. Every 3.6 seconds
one person dies of starvation. Every minute 28 girls
Myths: Watch out when something is called a ‘myth’, younger than 18 are married off. Every hour, more
which signals a contentious issue, that people than 10,000 sharks are killed by humans.
disagree about what is true and false. The evidence
supporting all parties should be reviewed.
Social problems are often presented as occurring REPORTING BLUNDERS. Not all dubious reporting is
every X minutes to increase the shock factor. People intentional. Innumeracy (the mathematical
who package statistics choose the mathematical equivalent of illiteracy) affects most of us to one
format that will make the most powerful impression. degree or another, including those who produce
figures, others who repeat them, and the audience
Quantities can be expressed in different ways: that hears them. Common blunders include the
percentages, proportions, absolute numbers and still misplacement of a decimal point, confusion about
refer to the same amount mathematically. The the denominator, misleading graphs and erroneous
choice of format used to present the statistic can calculation. Be aware of possible mistakes that could
influence the reader’s perception of the reality. have slipped into the reporting on otherwise accurate
statistics:
Example: “Every two hours a woman dies from an
unsafe abortion in India.” This statistic is striking and Peculiar Percentages: Look for surprisingly large or
memorable. However, if the absolute number is small percentages.
presented, 4,380, it is just another figure that will be
forgotten as soon as it is read. (The HINDU. Unsafe Example: In Burundi, the National Red Cross reported
abortions killing a woman every two hours, 2013) a chronic malnutrition rate of 58% in February 2015,
but without providing any source, explanation or
Big round numbers: Big numbers make big methodology. It is probable that they were quoting
impressions. However, big round numbers are often the demographic and health survey from 2010.
just estimates. These guesstimates are likely to err However, the last SMART survey from 2013 showed
on the side of exaggeration. a 31.5% chronic malnutrition rate among children
under 59 months. (Red Cross 2015, DHS 2010, WFP
MISLEADING GRAPHS. Look out for graphs that are 2014)
hard to decipher and graphs in which the image does
not seem to fit the data. The computer revolution has The slippery decimal point: Beware of misplaced
made it vastly easier to create graphs and to decimal points. Misplacing a decimal point is an
produce jazzy, eye-catching displays of data. easy mistake to make. If the decimal point is moved
However, a beautiful graph is not necessarily just one place to the right, there is ten times as many
correct. A graph is no better than the thinking that of whatever you were counting. Move it just one digit
went into its design. Always carefully review the axis, to the left and only a tenth as many remain.
as playing around with these is a common method
to influence the interpretation of data. A set of statistical benchmarks can lead us to
suspect that some number is improbably large (or
Compare: small), but errors can be harder to spot if one cannot
get a good sense of the correct number in the first
place.