(30 Second) Liberty Vittert - 30-Second Data Science - The 50 Key Concepts and Challenges, Each Explained in Half A Minute-Ivy Press - Quarto Publishing (2020)
(30 Second) Liberty Vittert - 30-Second Data Science - The 50 Key Concepts and Challenges, Each Explained in Half A Minute-Ivy Press - Quarto Publishing (2020)
(30 Second) Liberty Vittert - 30-Second Data Science - The 50 Key Concepts and Challenges, Each Explained in Half A Minute-Ivy Press - Quarto Publishing (2020)
DATA SCIENCE
30-SECOND
DATA SCIENCE
50 KEY CONCEPTS AND
CHALLENGES, EACH EXPLAINED
IN HALF A MINUTE
Editor
Liberty Vittert
Contributors
Maryam Ahmed
Vinny Davies
Sivan Gamliel
Rafael Irizarry
Robert Mastrodomenico
Stephanie McClellan
Regina Nuzzo
Rupa R. Patel
Aditya Ranganathan
Willy Shih
Stephen Stigler
Scott Tranter
Liberty Vittert
Katrina Westerhof
Illustrator
Steve Rawlings
First published in North America in 2020 by
Ivy Press
An imprint of The Quarto Group
The Old Brewery, 6 Blundell Street
London N7 9BH, United Kingdom
T (0)20 7700 6700
www.QuartoKnows.com
Foreword g 7
INTRODUCTION
Liberty Vittert
If God doesn’t exist and I believe in Him then I might have a wasted
life with false belief, but nothing happens.
If God doesn’t exist and I don’t believe in Him then I didn’t waste
my life with false belief, but again, nothing happens.
If God does exist and I do believe in Him then I have a wonderful
eternity in Heaven.
But if God does exist and I don’t believe in Him then it’s eternal
hell-fire for me.
8 g Introduction
Pascal used the data he had to make a decision to optimize his future
happiness and mitigate potential risk. Really, that is what data science is:
taking past and current information in order to predict the likelihood of
future events, or, rather, the closest thing to a crystal ball that the world
has at its disposal. The only difference between us and Pascal is that we
live in a world with far more than four bits of data to analyse; we have
endless amounts.
It is estimated that we produce over 2.5 exabytes of data per day.
A quick calculation makes that the same amount of information as
stacking Harry Potter books from the Earth to the Moon, stacking them
back, and then going around the circumference of the Earth 550 times.
And that is simply the amount of data produced per day!
10 g Introduction
g
BASICS
BASICS
GLOSSARY
14 g Basics
Enigma code Method of scrambling or normal (Gaussian) distribution Bell-shaped
encrypting messages employed by the curve describing the spread or distribution
German armed forces during the Second of data across different values. Data sets
World War which was cracked by Alan Turing that are often normally distributed include
and his colleagues at Bletchley Park. exam scores, the heights of humans and
blood pressure measurements. Normal
epidemiology The study of the incidence distribution shows the probability of a
of health conditions and diseases, which random variable taking different values.
populations are most vulnerable and how Many statistical analyses assume the data
the associated risks can be managed. is normally distributed.
Glossary g 15
DATA COLLECTION
the 30-second data
Data science was born as a
subject when modern computing advances
allowed us to suddenly capture information
3-SECOND SAMPLE in huge amounts. Previously, collecting and RELATED TOPICS
Since the invention of analysing data was limited to what could done See also
modern computing, TOOLS
by hand. Modern advances now mean that
‘big data’ has become page 22
a new currency, helping
information is collected in every part of our
SURVEILLANCE
companies grow from lives, from buying groceries to smart watches page 82
conception to corporate that track every movement. The vast amount
giants within a decade. REGULATION
now collected is set to revolutionize every page 150
aspect of our lives, and massive companies
3-MINUTE ANALYSIS have emerged that collect data in almost
The amount of data that unimaginable quantities. Facebook and 3-SECOND BIOGRAPHIES
GOTTFRIED LEIBNIZ
we now collect is so Google, to name just a couple, collect so much 1646–1716
massive that the data itself
has its own term – big data.
information about each of us that they could Helped develop the binary
number system, the foundation
The big data collected in probably work out things about us that even of modern computing.
the modern era is so huge closest friends and family don’t know. Every MARK ZUCKERBERG
that companies and 1984–
time we click on a link on Google or like a post
researchers are in a Co-founded Facebook with his
constant race to keep up on Facebook, this data is collected and these college room-mates in 2004,
and is now CEO and chairman.
with the ever-increasing companies gain extra knowledge about us.
requirements of data Combining this knowledge with what they
storage, analysis and
privacy. Facebook
know about other people with similar profiles 30-SECOND TEXT
supposedly collects 500+ to ourselves means that these companies can Vinny Davies
terabytes of data every target us with advertising and predict things
day – it would take over about us that we would never think possible,
15,000 MacBook Pros per
day to store it all.
such as our political allegiances.
3-MINUTE ANALYSIS
Why is this the case? Well, data isn’t usually
Physically visualizing simple and nor is summarizing it; I may
the massive amounts of summarize it one way, you another. But who is 3-SECOND BIOGRAPHIES
BENJAMIN DISRAELI
complex data collected is right? Therein lies the problem: it is possible to 1804–81
a challenge in itself. Most
modern data sets are
be manipulated by the data summaries we are Former British Prime Minister
to whom the quote ‘there
almost impossible to shown. Even summaries that are true may not are three types of lies: lies,
damned lies and statistics’
visualize in any sensible provide information that is a fair and accurate is often attributed.
way and therefore any
representation of the data which that summary STEPHAN SHAKESPEARE
visual summaries are
usually a very simplified represents. For instance, did you know that 1957–
Co-founder and CEO of opinion
interpretation of the data. teenage pregnancies dramatically reduce when polls company YouGov, which
This also means that visual girls reach 20 years of age? Technically true, but collects and summarizes data
summaries can easily related to world politics.
be misrepresented, and
realistically not a useful summary. The next time
what is seen isn’t always you see a summary, think about how it could
as straightforward as have been manipulated, and then consider the 30-SECOND TEXT
it seems. Vinny Davies
results of the summary accordingly.
3-MINUTE ANALYSIS
will get arthritis in the next five years. Creating
Learning from data is not a model with age and gender from previous
a modern phenomenon. individuals (knowing whether they got arthritis 3-SECOND BIOGRAPHIES
In 1854, during an outbreak JOHN SNOW
or not) allows us to predict what could happen 1813–58
of cholera in London,
Dr John Snow collected
to a new individual. As well as simply trying to British physician considered
the founding father of
and used data to show predict future data, data can also be used epidemiology who is known for
tracing the source of a cholera
the source of the disease. to try to establish the cause of a particular outbreak in London in 1854.
He recorded where cholera
outcome. This process is called ‘causal inference’ ALAN TURING
cases occurred and used
the data to map them back and is often used to help understand disease, 1912–54
British mathematician who
to the Broad Street Pump. for example via analysing DNA. However, even used data from messages to
Residents then avoided though both examples mentioned are trying help crack the Enigma code
the pump, helping to in the Second World War.
end the outbreak of the
to predict cases of arthritis, the modelling
disease. The pump remains problems they represent are subtly different
as a landmark in London and are likely to require vastly different 30-SECOND TEXT
to this day. Vinny Davies
modelling processes. Choosing the best model
based on the data and aims associated with a
particular project is one of the major skills all
data scientists must have. Once gathered, data
can be put through
modelling processes,
which can enhance
20 g Basics understanding.
TOOLS
the 30-second data
Dealing with the massive
data sets that are collected and the complex
processes needed to understand them requires
3-SECOND SAMPLE specialist tools. Data scientists use a wide RELATED TOPICS
Data is big, models variety of tools to do this, often using multiple See also
are complex, so data DATA COLLECTION
different tools depending on the specific
scientists have to use all page 16
the computational tools
problem. Most of these tools are used on a
LEARNING FROM DATA
at their disposal. But what standard computer, but in the modern era of page 20
are these tools? cloud computing, work is beginning to be done
STATISTICS & MODELLING
on large clusters of computers available via the page 30
3-MINUTE ANALYSIS
internet. A lot of large tech companies offer
While not explicitly a tool this service, and these tools are often available
in the same sense as to data scientists. In terms of the more 3-SECOND BIOGRAPHIES
Python, SQL, etc, parallel WES MCKINNEY
standard options in a data scientist’s toolbox, 1985–
computing is an important
part of modern data
they can generally be divided into tools for Python software developer
who founded multiple
science. When you buy a managing data and tools for analysing data. companies associated with
the development of Python.
computer, you will likely Often, data is simply stored in spreadsheets,
have bought either a dual HADLEY WICKHAM
but sometimes, when data gets larger and more
or quad core machine, fl. 2006–
meaning that your complex, better solutions are required, normally Researcher and Chief Scientist
at RStudio, known for the
computer is capable of SQL or Hadoop. There is a much larger variety development of a number
processing two or four of tools used for analysing data, as the of key tools within the R
things simultaneously. programming language.
Many data science
methods used often come from different
processes are designed communities, for instance statistics, machine
to use multiple cores in learning and AI, with each community tending 30-SECOND TEXT
parallel (simultaneously), Vinny Davies
to use different programming languages.
giving faster performance
and increased processing The most common programming languages
capabilities. used to analyse data tend to be R, Python and
MATLAB, although often data scientists will
know multiple languages. Data scientists will
choose a tool or
programming language
22 g Basics to suit the task at hand.
REGRESSION
the 30-second data
Regression is a method used to
explain the relationship between two or more
measurements of interest, for example height
3-SECOND SAMPLE and weight. Based on previously collected data, RELATED TOPICS
Regression predicts values regression can be used to explain how the value See also
based on the data collected DATA COLLECTION
observed for one measurement is related to the
and is one of the most page 16
important tasks in
value observed for another quantity of interest.
REGRESSION TO THE MEAN
data science. Generally, regression allows for a simple page 44
relationship between the different types
OVERFITTING
of measurements, such that as the value of page 56
3-MINUTE ANALYSIS
Regression is not always
one measurement changes, then we would
as simple as predicting expect the other measurement to change
one measurement from proportionally. Regression allows data scientists 3-SECOND BIOGRAPHIES
CARL FRIEDRICH GAUSS
another. Sometimes there to do a couple of useful things. Firstly, it enables 1777–1855
are millions of pieces of
related data that need
them to interpret data, potentially providing German mathematician
who discovered the normal
to go into the regression the chance to understand the cause of the (Gaussian) distribution in
1809, a critical part of most
model, for example DNA relationship behind the measurements of regression methods.
data, and sometimes the
interest. For instance, a relationship between FRANK E. HARRELL
different pieces of data
have complex relationships data related to smoking and cancer could be fl. 2003–
Professor of Biostatistics at
with each other. More identified, which would help to identify that the Vanderbilt University,
complex regression smoking increases the risk of cancer. Secondly, Nashville, and author of
methods allow for renowned textbook Regression
situations such as this, but
it allows for predictions of future measurements Modelling Strategies.
17 January 1911
Dies in Surrey, England
FRANCIS GALTON
Francis Galton created the key to Species in 1859, Galton’s main interest shifted
modern data analysis: the framework for the to the study of heredity, anthropology and
study of statistical association. Galton was psychology. His most lasting inventions were the
born into a notable English family in 1822. statistical methods he devised in those pursuits.
Subsequently, however, the family would be Galton invented correlation and discovered
best known for him and for his cousin Charles the phenomenon of regression, and he may,
Darwin. Galton attended Cambridge, where he with some justice, be credited with taking the
learned that formal mathematics was not for first major steps to a real multivariate analysis.
him, and while he then studied medicine, that His ideas are basic to all proper studies of
profession, too, did not inspire him. When statistical prediction, and to twentieth-century
Galton was 22 years old his father died, leaving Bayesian analysis as well. Galton coined the
him sufficient wealth that he was able to live term ‘eugenics’ and he promoted certain parts
the rest of his life independent of need. For a of this, but also wrote against others that
few years he travelled, and for a nearly two-year would lead much later to associating eugenics
period from 1851, went deep into southwest with genocidal practices in the mid-twentieth
Africa, where he explored and met the people. century. Galton opposed the practice of
At one point he helped negotiate a peace creating heritable peerages and he encouraged
between two tribes. granting citizenship to talented immigrants and
In 1853 Galton married and settled down to their descendants. Some of his studies of
a life in science. At first he wrote about travel, inheritance came close to but did not reach
and he invented new forms of weather maps Mendelian genetics, but he did help create
that incorporated glyphs showing details on the methods that would lead to the explosive
wind, temperature and barometric readings. development of biology after Mendel’s work
From these he discovered the anti-cyclone was rediscovered in 1901. Galton pioneered
phenomenon, where a drop of barometric the use of fingerprints as a method of
pressure reverses the cyclonic wind motion in identification. He died childless in 1911, leaving
the northern hemisphere. With the publication his moderate fortune to endow a professorship
of his cousin Darwin’s book The Origin of and research at University College London.
Stephen Stigler
Francis Galton g 27
CLUSTERING
the 30-second data
Splitting data samples into
relevant groups is an important task in data
science. When the true categories for collected
3-SECOND SAMPLE data are known, then standard regression RELATED TOPICS
Sometimes data scientists techniques – often called ‘supervised learning’ – See also
don’t have all the LEARNING FROM DATA
can be used, to understand the relationship
necessary data to carry page 20
out regression, but in
between data and associated categories.
REGRESSION
many cases clustering Sometimes, however, the true categories for page 24
can be used to extract collected data are unknown, in which case
structure from data. STATISTICS & MODELLING
clustering techniques, or unsupervised learning, page 30
can be applied. In unsupervised learning, the
3-MINUTE ANALYSIS aim is to group samples of data into related
Netflix users aren’t divided groups or clusters, usually based on the 3-SECOND BIOGRAPHIES
TREVOR HASTIE
into specific categories, similarity between measurements. The meaning 1953–
but some users have similar
film tastes. Based on the of these groups is then interpreted, or the Professor at Stanford
University and co-author of
films that users have groups are used to inform other decisions. The Elements of Statistical
Learning.
watched or not watched, A simple example of clustering would be
users can be clustered WILMOT REED HASTING JR
to group animals into types based on
into groups based on the 1960–
similarity of their watched/ characteristics. For instance, by knowing the Chairman and CEO of Netflix,
who co-founded the company
unwatched movies. While number of legs/arms an animal has, a basic in 1997 as a DVD postage
trying to interpret the grouping can be created without knowing the service.
meaning of these groups is
difficult, the information
specific type of animal. All the two-legged TONY JEBARA
1974–
can be used to make film animals would likely be grouped together, and
Director of Machine Learning
recommendations. For similarly animals with four and six legs. These at Netflix and Professor at
instance, a user could be Columbia University, USA.
groups could then easily be interpreted as birds,
recommended to watch
Ironman if they hadn’t mammals and insects respectively, helping us
watched it but everyone learn more about our animals. 30-SECOND TEXT
in their cluster had. Vinny Davies
3-MINUTE ANALYSIS
future, helping the system to operate and page 148
supermarket where you based around neural networks, but with a much
don’t need to scan items. larger number of layers of interconnecting
You just pick up items, put 3-SECOND BIOGRAPHIES
artificial neurons. One of the uses of deep FRANK ROSENBLATT
them in your bag and walk
learning is analysing and responding to 1928–71
out. The supermarket American psychologist famous
works by videoing messages, either in the form of text (customer for developing the first method
that resembles a modern-day
everyone as they shop service chat bots for example) or speech (such neural network.
and using deep learning
as Alexa or Siri). However, the biggest use of
to identify each item YOSHUA BENGIO
customers pick up, noting deep learning is in image processing. Deep 1964–
whether they put it in their learning can be used to analyse the images Canadian computer scientist
famous for his work on neural
bag or back on the shelf. captured by driverless cars, interpreting the networks and deep learning.
When you walk out, the
cost of your items is simply
results and advising the car to adjust its course
charged to your account. as needed. It is also beginning to be applied in
30-SECOND TEXT
medicine, with its ability to analyse images such Vinny Davies
as MRIs or X-rays, making it a good way of
identifying abnormalities, such as tumours.
While deep learning is
a highly sophisticated
process, its prevalence
in the future will
depend on the level
34 g Basics of trust it can garner.
g
UNCERTAINTY
UNCERTAINTY
GLOSSARY
algorithmic bias Algorithms learn how to Gallup poll A series of regular surveys,
make decisions by processing examples of conducted by the company Gallup, to gauge
humans performing the same task. If this public opinion on a range of political,
data is taken from a prejudiced source, the economic and social issues.
model will learn to replicate those prejudices.
natural variation Changes or fluctuations
automated system Repetitive tasks or that occur in populations or the natural
calculations carried out by computers, world over time, e.g. natural variations
e.g. automated passport gates at airports, in a country’s birth rate over time.
self-driving cars and speech-to-text software.
noise Random variations in data collected
causation If a change in one variable directly or measured from the real world. Minimizing
causes a change in another variable, or accounting for the effects of noise in data
causation exists. is a crucial step in many statistical analyses.
38 g Uncertainty
p-value The probability that the results was investigating whether students who
observed in an experiment would occur if drink coffee perform better in exams than
the null hypothesis was true. students who don’t, the null hypothesis
would be ‘there is no difference in exam
predictive model A mathematical model performance between students who do
which predicts the value of an output, and don’t drink coffee.’ If a study found
given values of an input. significant differences in performance
between coffee drinking and non-coffee
regularization A technique to discourage drinking students, the null hypothesis
overfitting in models. could be rejected.
sample A subset of a population, selected time series analysis The analysis of a signal
for participation in a study, experiment or variable that changes over time. This
or analysis. can include identifying seasonal trends or
patterns in the data, or forecasting future
sampling Selecting members of a values of the variable.
population as participants in a study
or analysis. training data Many machine learning
models are fitted to training data, which
selection bias Introduced when samples consists of inputs and their corresponding
for a study are selected in a way that does outputs. The model ‘learns’ the relationship
not result in a representative sample. between the inputs and outputs, and is then
able to predict the output value for a new,
self-selection bias Introduced when unseen input value.
participants assign themselves to a study,
or a group within a study. This may lead to a univariate and multivariate time-dependent
sample that is biased and unrepresentative data Univariate time-dependent data
of the population. consists of the values of a single variable
over time, whereas multivariate time-
statistically significant A result that is dependent data consists of the values
very unlikely to have occurred if the null of more than one variable.
hypothesis were true. For example, if a study
Glossary g 39
SAMPLING
the 30-second data
‘Garbage in, garbage out’: data
scientists know that the quality of their data
determines the quality of their results, so most
3-SECOND SAMPLE of them have learned to pay careful attention RELATED TOPICS
When the entire population to measurement collection. When analysts can See also
of interest can’t be DATA COLLECTION
work with an entire population’s data – such as
measured or questioned, page 16
a sample is taken – but how
Netflix tracking the film-watching habits of its
SAMPLING BIAS
that is done is as much an subscribers – drawing conclusions can be a page 48
art as it is a science. straightforward matter of just crunching
VOTE SCIENCE
numbers. But that completeness is not page 90
3-MINUTE ANALYSIS
always practical. In criminal healthcare fraud
In 1936, the US was in the investigations, the ‘full population’ would be
Great Depression, and a health claims records numbering in the trillions. 3-SECOND BIOGRAPHIES
ANDERS NICOLAI KIÆR
conservative small-town Instead, lawyers might have data scientists 1838–1919
mayor was challenging
President Roosevelt for
strategically choose a subset of records from First to propose that a
representative sample be used
office. The most influential which to draw conclusions. Other times, as rather than surveying every
member of a population.
magazine of the time, with political polling, all that is available is a
Literary Digest, polling W. EDWARDS DEMING
sample. If the sample is a randomly chosen
2.4 million voters, 1900–93
predicted a challenger’s one, statistical theories exist to tell us how Wrote one of the first books
on survey sampling, in 1950,
landslide. Wrong: confident we should be in our generalizations which is still in print.
Roosevelt swept the from sample to population. Increasingly, data
nation. What happened? GEORGE HORACE GALLUP
The sample was large
scientists are relying on what is known as 1901–84
but biased; the magazine ‘non-probability sampling’, where the sample American pioneer of survey
sampling techniques and
polled its subscribers – car is not chosen according to any randomization inventor of the Gallup poll.
owners and telephone scheme. So using Twitter data to track the
users – all wealthier than
average. Within two years
buzz of a candidate or brand will not give a
30-SECOND TEXT
Literary Digest had folded, random sample representing the entire Regina Nuzzo
and a new science of population – but it still has worth.
statistical sampling
was launched.
Statisticians work to
find out the accuracy of
conclusions even from
40 g Uncertainty irregular samples.
CORRELATION
the 30-second data
A correlation is a kind of dance –
a ‘co-relation’ – between two features in a data
set. A positive correlation means the dancers
3-SECOND SAMPLE are moving more or less in the same direction RELATED TOPICS
At the heart of modern together: when crude oil prices rise, for See also
data science lies a REGRESSION TO THE MEAN
example, retail petrol prices also tend to
surprisingly simple page 44
concept: how much do
rise. A negative correlation means the dancers
OVERFITTING
two things move in sync are still in sync but are moving in opposite page 56
with each other? directions: longer website loading times are
associated with lower customer purchase
rates. Correlations can only capture linear 3-SECOND BIOGRAPHIES
3-MINUTE ANALYSIS KARL PEARSON
In 2014, for a fun project relationships, where two features can be 1857–1936
before final exam week, visualized on a graph together as a straight English mathematician
who developed Pearson’s
Harvard law student Tyler line. That means an analysis of business correlation coefficient, the
Vigen purposely set out to most common way to measure
find as many coincidental
characteristics such as staff cheerfulness and correlation.
correlations as possible customer satisfaction might return a ‘zero
JUDEA PEARL
across multiple data sets. correlation’ result, hiding a more interesting 1936–
His website Spurious Israeli-American computer
story underneath: a curvilinear relationship, scientist and philosopher
Correlations quickly went
viral, allowing millions of where customers dislike too little cheerfulness whose work has helped
researchers distinguish
visitors to view graphs but also too much. Another problem is that correlation from causation.
showing the high correlation is not the same as causation. Sales
correlation over time
between oddball variable
of ice cream and drowning deaths are positively
30-SECOND TEXT
pairs, such as the number correlated, but of course that does not mean Regina Nuzzo
of people who have died by that banning the sale of ice cream will save
becoming tangled in their lives. The causation culprit is often a third
bedsheets and the per
capita cheese consumption
characteristic (daily temperature). It is up to
in the US. the analyst to intelligently use all available
information to figure out whether the
apparent cause-and-effect is real. Graphs illustrating
dynamic relationships
can be a data scientist’s
42 g Uncertainty most powerful tool.
REGRESSION
TO THE MEAN
the 30-second data
Can stats explain the strange
phenomenon where top rookie athletes fall
from glory and go on to a disappointing second
3-SECOND SAMPLE season? The usual explanation is that stars hit RELATED TOPICS
‘What goes up must this slump because they choke under pressure See also
come down’ – it may REGRESSION
and attention from a stellar debut. But data
seem obvious, but in page 24
stats this is easy to miss,
whizzes know better – it is just a statistical
CORRELATION
and it can lead to some affair called regression to the mean. And it’s page 42
puzzling trends. not unique to sports; you can find examples
everywhere. Why do the most intelligent
women tend to marry men less intelligent than 3-SECOND BIOGRAPHIES
3-MINUTE ANALYSIS FRANCIS GALTON
Regression to the mean is themselves? Why was a company’s surprisingly 1822–1911
especially important when profitable quarter immediately followed by First coined the concept of
regression to the mean in his
analysing data that has a downturn? Why do hospital emergency study of genetics and height.
been chosen based on a
measurement that has
departments get slammed the moment DANIEL KAHNEMAN
exceeded some threshold – someone remarks, ‘Wow, it’s quiet today’? 1934–
Nobel Laureate who suggested
for example, patients It is probably not a cause-and-effect story (or regression to the mean might
whose last blood pressure explain why punishment seems
superstitious jinx). Regression to the mean says to improve performance.
measurement was
considered dangerous, that extreme events don’t stay extreme forever;
or patients with a sudden they tend back towards the average, just on
worsening of depression their own. It is not that any true effect in the 30-SECOND TEXT
symptoms. In fact, about Regina Nuzzo
a quarter of patients with
data disappears – to the contrary, native athletic
acute depression get better talent persists, good fiscal management carries
no matter what – with on – but the extreme luck that pushed an
drugs, therapy, placebo individual into the top tiers today is likely to fade
or nothing at all – leading
some researchers to
out tomorrow. Data scientists know to be on
question the usefulness guard for this effect, lest they be fooled into
of standard depression spotlighting trends that aren’t real.
treatments.
Stats can help explain
dramatic swings of
fortune in sports,
44 g Uncertainty as well as in life.
CONFIDENCE
INTERVALS
the 30-second data
When you’re lucky enough to get
data on an entire population – all customer
purchases from a web vendor last year, say –
3-SECOND SAMPLE then getting the true average is easy: just RELATED TOPICS
Confidence intervals are crunch the numbers. But when all you get is See also
almost magical in their SAMPLING
a sample of the population – like satisfaction
ability to take a piece of page 40
limited information and
ratings from only 1,000 customers out of
STATISTICAL SIGNIFICANCE
extrapolate it to the 1 million – knowing the true average value is page 54
entire population. much trickier. You can calculate the average
satisfaction rating of your sample, but that’s
just a summary of these particular 1,000 3-SECOND BIOGRAPHY
3-MINUTE ANALYSIS JERZY NEYMAN
Beware journalists customers. If you had taken another random 1894–1981
reporting numbers without 1,000 customers, you would get a different Polish mathematician and
statistician who introduced
confidence intervals. For average. So how can we ever talk about the confidence intervals in a paper
example, a 2017 Sunday published in 1937.
Times article highlighted
average satisfaction of all million people?
a reported drop of 56,000 That is where confidence intervals come to
employed people in the UK, the rescue – one of the tools statisticians use 30-SECOND TEXT
saying ‘it may signal the
in pursuit of their ultimate goal of drawing Regina Nuzzo
start of a significantly
weaker trend’. Digging conclusions about the world based on limited
deeper into Office for information. Statisticians have worked out
National Statistics reports, ingenious maths that takes information from
however, reveals a
confidence interval for
one sample and uses it to come up with a whole
the true change in number range of plausible values for the average of the
employed running from entire population. So instead of just saying the
a 202,000 decline to a average satisfaction rating in one sample was
90,000 increase. So
employment may not have 86 per cent, you can say, with some confidence,
dropped at all – it might that the average satisfaction in the entire
have actually improved! customer population is between 84 and 88 per Making conclusions
cent – which is much more valuable information. about the big picture
with confidence is
where the field of
46 g Uncertainty statistics shines.
SAMPLING BIAS
the 30-second data
Data points are like gold nuggets,
so data scientists eagerly scoop up whatever
they can find. Smart analysts do something
3-SECOND SAMPLE even more valuable: they stop, look around RELATED TOPICS
It’s almost a paradox in and ask what happened to all the nuggets that See also
data science: what’s not DATA COLLECTION
aren’t lying around in plain sight. Are those
in a data set can be even page 16
more important than
left-out data different in any systematic way
SAMPLING
what’s in it. from the data that were easy to collect? Take, page 40
for example, a report’s estimate that 10 per
OVERFITTING
cent of men suffer from impotence – results page 56
3-MINUTE ANALYSIS
In the Second World War
that were based on a survey of patients at
the American military an andrology health clinic. This selection bias
gathered data on bullet happens when the participants chosen differ in 3-SECOND BIOGRAPHIES
holes from planes returned ABRAHAM WALD
important ways from the ones not chosen (such 1902–50
from European battles.
Where were the highest
as, here, their sexual health). Related to this is Hungarian mathematician
whose work on Second
bullet densities, they self-selection bias, where, for example, service World War aircraft damage
illustrates the concept of
asked, so extra armour satisfaction ratings can easily skew negatively survivorship bias.
could be added to spots
if only the most irate customers take time to CORINNA CORTES
where planes are shot
at the most? Statistician respond. Likewise, there is non-response bias; 1961–
Danish computer scientist
Abraham Wald turned medical studies, for example, can be misleading and Head of Google Research,
the question on its head. if researchers ignore the fact that those who works on sample bias
These data only show correction theory.
where planes that managed
participants most likely to drop out are also
to make it back home had the ones who are the sickest. Sometimes it
been hit, he pointed out. is possible to statistically correct for a bias 30-SECOND TEXT
Planes were getting shot Regina Nuzzo
problem, but recognizing the problem in the
at in other places, but
these planes, hit in other
first place is often the hardest part.
spots, didn’t survive. So
the armour belonged,
A skilled data scientist
he said, where the bullet
holes weren’t.
will seek out gaps in the
data collection process
and analyse their
48 g Uncertainty potential impact.
BIAS IN
ALGORITHMS
the 30-second data
Algorithms learn how to make
decisions by processing examples of humans
performing the same task. An algorithm for
3-SECOND SAMPLE sentencing criminals might be trained on RELATED TOPICS
Can a computer be racist, thousands of historic decisions made by judges, See also
sexist or homophobic? SAMPLING BIAS
together with information about the offenders
Human biases are often page 48
built into automated
and their crimes. If this training data is taken
ARTIFICIAL INTELLIGENCE (AI)
systems, with serious from judges who give harsher sentences to page 148
consequences for the people of colour, the model will learn to replicate
most vulnerable groups REGULATION
those prejudices. In 2018, the Massachusetts page 150
in society.
Institute of Technology’s (MIT) Media Lab
showed that face recognition systems
3-MINUTE ANALYSIS developed by Microsoft, IBM and China’s 3-SECOND BIOGRAPHY
As many machine learning JOY BUOLAMWINI
Face++ were all significantly worse at detecting fl. 2011–
models are developed by
private companies, their
female faces, and performed poorly on images Computer scientist and digital
activist, based at the MIT
training data and source of darker-skinned women. With police forces Media Lab, and founder of the
Algorithmic Justice League.
code are not open to in the UK and US testing automated facial
scrutiny. This poses
recognition systems for crime prevention, low
challenges for journalists
investigating algorithmic accuracies and false alarms could have far- 30-SECOND TEXT
bias. In 2016, an reaching consequences for civil liberties. In 2018 Maryam Ahmed
investigation by the news Amazon scrapped an automated CV screening
outlet ProPublica used
tool due to gender bias. The system was trained
Freedom of Information
requests to reverse- on data from previous successful candidates,
engineer the COMPAS who were mostly male, due to existing
algorithm, used in the US imbalances in the technology industry. This
to predict the likelihood
of criminals reoffending.
produced a tool that penalized applications
They uncovered racial containing phrases more likely to appear in The potential for bias
discrimination, raising women’s résumés, such as ‘women’s football might sound far-
questions on regulation
team’. The algorithm learned to equate men’s fetched, but algorithm
and transparency in AI.
CVs with success, and women’s with failure. bias poses a very real
problem requiring
50 g Uncertainty creative solutions.
18 October 1919 1959 1973
Born in Kent, England Marries Joan Fisher; he Published Bayesian
later gives her statistical Inference in Statistical
advice as she writes her Analysis (with George C.
1953 1978 biography of her Tiao)
Receives his PhD at father, Ronald A. Fisher
University College
London 1978–9
1960 Serves as President of
Moves to Madison, the American Statistical
Wisconsin, to start a new Association, and of the
Department of Statistics Institute of Mathematical
Statistics
1970
Publishes Time Series 1985
Analysis (with Gwilym Elected Fellow of the
Jenkins). In subsequent Royal Society of London
years he also develops
forecasting methods,
based upon difference 28 March 2013
equation methods, with Dies in Madison,
other authors Wisconsin, USA
GEORGE BOX
George Box was born in England (where he met and married Joan Fisher, one of
in 1919. He had studied chemistry before being Ronald’s daughters), Box moved in 1960 to start
called up for service during the Second World a new department of statistics at the University
War, and he gained his first introduction to of Wisconsin in Madison, where he spent the
statistics when, while engaged in war work, he rest of his life and did his most influential work.
encountered problems with the interpretation Box was a great catalyst in collaborative
of experimental data. Someone suggested he scientific investigations. He ran a famous
visit British statistician and geneticist Ronald evening ‘Beer seminar’ weekly, where a scientist
A. Fisher, at the time working from home would briefly present a problem and the
because his laboratory at Cambridge had been assembled group would produce innovative
closed for the duration of the war. The visit solutions, some with great lasting effect. With
opened Box’s eyes to the world of ‘data various co-authors he created new methods
science’ (a then unknown term), and after of time series analysis for univariate and
the war he went to University College London multivariate time-dependent data, new ideas
for graduate study. There, as later in life, he for the use of Bayesian methods and new
plotted his own course, concentrating on approaches to experimental design, including
understanding the role of statistics in scientific ‘evolutionary operation’, an approach that
and engineering investigation. permitted bringing experiments to the
Box’s early work was as a statistician at manufacturing floor and involving line workers
Imperial Chemical Industries, where he was in continuously improving processes without
involved with the design of experiments. In one interrupting production. He was a great
early paper he introduced the word and concept advocate for keeping the scientific question
of ‘robustness’ to statistics: the idea that the always at the forefront, and for the importance
validity of some (‘robust’) statistical procedures of good experimental design. He employed
could withstand even large departures from mathematical models, but famously is quoted
conditions thought to be key to their use. After as cautioning that ‘all models are wrong, but
a few years that included time in Princeton some are useful’. He died in Madison in 2013.
Stephen Stigler
George Box g 53
STATISTICAL
SIGNIFICANCE
the 30-second data
It is worth getting to know the
p-value, because this tiny number boasts
outsized importance when it comes to drawing
3-SECOND SAMPLE conclusions from data. The tininess is literal: RELATED TOPICS
Are those interesting a p-value is a decimal number between 0 and 1. See also
patterns in a data set just a STATISTICS & MODELLING
It is calculated when you have a question about
random fluke? A century- page 30
old stats tool can help
the world but only limited data to answer it.
SAMPLING
answer that. Usually that question is something like, ‘Is there page 40
something real happening here in the world, or
are these results just a random fluke?’ If you
3-MINUTE ANALYSIS 3-SECOND BIOGRAPHIES
P-values are easy to hack.
toss a coin 100 times and it comes up heads
KARL PEARSON
In 2015, media around the every time, you might suspect that the coin is 1857–1936
world excitedly reported double-headed, but there is still the possibility British statistician who first
formally introduced the
on a study showing that (however negligible) that the coin is fair. The p-value.
chocolate leads to weight
loss. Then the author
p-value helps support your scepticism that this SIR RONALD FISHER
revealed the truth: he was event didn’t happen by accident. By tradition, 1890–1962
British statistician who
a journalist, the data was results with a p-value smaller than 0.05 get popularized the p-value in his
random and his results just 1925 book for researchers.
labelled ‘statistically significant’ (in the case of
a false-positive fluke. He
knew that in 5 per cent the coin, getting all heads from five flips). It is
of studies p-values will this label that people often use for reassurance 30-SECOND TEXT
be smaller than 0.05 just when making decisions. But there is nothing Regina Nuzzo
by chance. So he ran 18
separate analyses of
magical about the 0.05 threshold, and some
random data – and experts are encouraging researchers to abandon
then reported only the statistical significance altogether and evaluate
deliciously statistically each p-value on its own sliding scale.
significant one.
P-values help
statisticians work out
whether results are a
random fluke – or not:
the gold standard of
statistical evidence has
54 g Uncertainty some major flaws.
OVERFITTING
the 30-second data
Building a predictive model
involves finding a function that describes
the relationship between some input and an
3-SECOND SAMPLE output. For example, a data scientist may want RELATED TOPICS
Beware of complex models to predict a university student’s final grade See also
that fit the data perfectly. REGRESSION
based on their attendance rate in lectures. They
It is likely they are page 24
overfitted, and will predict
would do this by fitting a function to a ‘training’
STATISTICS & MODELLING
poorly when presented set of thousands of data points, where each page 30
with new data points. point represents a single student’s attendance
MACHINE LEARNING
and grade. A good model will capture the page 32
3-MINUTE ANALYSIS
underlying relationship between grades and
There are ways to attendance, and not the ‘noise’, or natural
avoid overfitting. variation, in the data. In this simple example, 30-SECOND TEXT
Cross-validation gives Maryam Ahmed
a reliable model may be a linear relationship.
an estimate of how well a
model will work in practice,
When a new student’s attendance is added,
by training the model on a the model will use it to predict their final grade
subset of the training data because it generalizes to the student population
and testing its performance
as a whole. An overfitted model will involve
on the remaining subset.
Regularization is a more parameters than necessary; instead of
technique that penalizes fitting a straight line to the university data set,
a model for being too an overenthusiastic data scientist might use a
complex; in the university
example, a line would be
very complex model to perfectly fit a contorted,
preferred over a curve. meandering curve to the training data. This
will not generalize well, and will perform
poorly when presented with data for a new
student. Understanding that a complex model is
not always better is a crucial part of responsible
and thoughtful data science practice.
If a model’s
performance seems
too good to be true,
56 g Uncertainty then it probably is!
g
SCIENCE
SCIENCE
GLOSSARY
causal relationship If a change in one DNA The genetic code that governs
variable directly causes a change in the development, characteristics and
another variable, a causal relationship functioning of every living organism. DNA
exists between them. is usually found in the nucleus of a cell, and
consists of two long chains of building
correlation Two variables are correlated if blocks called ‘nucleotides’, arranged in a
a change in one is associated with a change double helix shape. In most humans, an
in the other. A positive correlation exists if individual’s genome, or genetic code, is
one variable increases as the other increases, unique. Recent advances in genetic
or if one variable decreases as the other engineering have enabled the insertion,
decreases. A negative correlation exists if one deletion and modification of genetic
variable increases as the other decreases. material in DNA.
60 g Science
experimental design The process of independent replication Validation of
designing robust studies and experiments, a study or experiment by independent
to ensure that any conclusions drawn from researchers. This is done by repeating
the results are reliable and statistically the procedure followed by the original
significant. This includes careful selection researchers, to ensure the results can
of experimental subjects to avoid sampling be replicated.
bias, deciding on a sample size, and choosing
suitable methods for analysing results. randomized trials Experimental design
where participants or subjects are randomly
gene editing Process of editing the genome allocated to treatment groups. For example,
of a living organism by inserting, removing or participants in a randomized drug trial could
modifying its DNA. be randomly allocated to a group where they
would either receive a placebo or a drug.
genome Genetic material, or chromosomes,
present in a particular organism. The human trendlines A way of visualizing the overall
genome consists of 23 pairs of chromosomes. direction, or trend, of a variable over time.
There are different methods for calculating
greenhouse gas A gas in the atmosphere trendlines, including a moving average, or
which absorbs and radiates energy, a line of best fit calculated through linear
contributing to the warming of Earth’s regression.
surface. This causes the so-called
‘greenhouse effect’, which is necessary for
supporting life on Earth. Human activity has
led to an increase in greenhouse gases in the
atmosphere, which have amplified the
greenhouse effect and contributed to global
warming. Greenhouse gases include water
vapour, carbon dioxide and methane.
Glossary g 61
CERN & THE
HIGGS BOSON
the 30-second data
In 1964, Peter Higgs, Francois
Englert, Gerald Guralnik, C.R. Hagen and Tom
Kibble proposed the Higgs Mechanism to
3-SECOND SAMPLE explain how mass was created in the universe. RELATED TOPIC
CERN, a laboratory in But evidence of the mechanism lay in the See also
Switzerland, is synergy of MACHINE LEARNING
(elusive) discovery of an essential particle,
multinational proportions: page 32
here, top scientists
dubbed ‘Higgs boson’, from which other
convene to inspect and fundamental particles derived their mass. By
decode the constituents blasting particles into each other at incredibly 3-SECOND BIOGRAPHIES
of matter via particle PETER HIGGS
high energies and then gathering data on the 1929–
colliders, i.e. how the
universe works. number of emergent particles as a function First proposed the Higgs
Mechanism.
of particle energy, scientists hoped to identify
spikes (in collisions at particular energy levels), FRANCOIS ENGLERT
1932–
3-MINUTE ANALYSIS which in turn would point to the creation Also proposed the Higgs
The LHC in CERN is mechanism, independently
connected to four separate
of a particle, such as the Higgs boson, with of Higgs.
detectors into which highly that energy. Enter CERN, the world-famous
accelerated particles can European laboratory. Here, scientists built
be slammed. For the Higgs 30-SECOND TEXT
the Large Hadron Collider (LHC). Even in its
boson experiments, two Aditya Ranganathan
detectors, ATLAS and CMS, infancy (2008), LHC’s enormous capability was
were used. The fact that stunning: it was able to accelerate particles to
the same results were about 14 billion times their energy at rest. By
observed on both
detectors lent significant
2011, CERN had collected enough data – over
credibility to the Higgs 500 trillion collision events – for analysis. Not
discovery, once again long after, several independent groups caught
emphasizing the an energy spike in the very field where the Higgs
importance of independent
replication in data analysis.
was predicted to lie. This discovery was soon
acknowledged by the scientific community, and The reach of data
both Higgs and Englert won acclaim as joint science knows no
recipients of the 2013 Nobel Prize for Physics. bounds, being applied
to explain the very
workings of the
62 g Science universe.
ASTROPHYSICS
the 30-second data
Astrophysics has become a big
user and supplier of data science expertise.
Most cosmology experiments involve scanning
3-SECOND SAMPLE large amounts of data to make measurements RELATED TOPIC
Photons from stars that can only be statistically derived. The data See also
billions of light years away CERN & THE HIGGS BOSON
is also searched for rare events. These statistical
strike Earth, furnishing page 62
telescopes with eons-old
insights, in turn, elucidate the past – and future
galactic images – masses – of our universe. One example of a rare
of data awaiting analysis. cosmological event is the production of a 3-SECOND BIOGRAPHIES
EDWIN HUBBLE
supernova – a star that explodes during its 1889–1953
3-MINUTE ANALYSIS
demise. Supernovae were used in the discovery American astronomer who
discovered the original
A major problem in data of the accelerating expansion of the universe, expansion of the universe.
analysis is the tendency for which Saul Perlmutter, Brian Schmidt and SAUL PERLMUTTER
to interpret results Adam Reiss won the 2011 Nobel Prize. The 1959–
as confirmations of American astrophysicist
pre-existing beliefs, which
discovery hinged on automatically searching and Director of the Berkeley
leads to furious debugging the sky for supernovae and collecting enough Institute for Data Science who
won the 2011 Nobel Prize in
when outcomes clash measurements of supernova brightness and Physics for the Accelerating
with expectations and Expansion of the Universe.
redshift (a measure of how much photons have
to slackening of error-
detection when the been stretched) in order to make statistically
two correspond. To acceptable conclusions about trendlines. 30-SECOND TEXT
decontaminate the Aditya Ranganathan
Supernovae have homogenous brightness,
debugging, physicists
developed blind analysis,
and it is this brightness that indicates how far
wherein all analysis a supernova is from a telescope, and how long
happens before the final light takes to reach us from that supernova;
experimental results are if light from older supernovae stretched less
revealed to the researcher.
Blind analysis has gained
than from new supernovae, the universe must
popularity in areas of be stretching more now than before, implying Data-driven
physics and may be making that over time, the universe will continue to measurements and
a foray into other fields
stretch ever more rapidly. experiments highlight
such as psychology.
the importance of data
science to cosmology,
64 g Science and vice versa.
CRISPR & DATA
the 30-second data
Scientists are harnessing the
power of a gene-editing tool called CRISPR that
has revolutionized labs around the world. The
3-SECOND SAMPLE precision engineering tool allows scientists to RELATED TOPICS
Editing the human genome chop and change DNA in a cell’s genetic code See also
conjures images of science THE MILLION GENOME
and could one day correct mutations behind
fiction, but it could be PROJECT
closer to reality thanks to
devastating diseases such as Huntington’s, page 68
the data science that is cystic fibrosis and some cancers. CRISPR works CURING CANCER
helping researchers to like a pair of molecular scissors and cuts DNA at page 74
correct nature’s mistakes.
target genes to allow scientists to make changes ETHICS
to the genome. This technique has been used by page 152
1975
Invited to join the
National Academy
of Sciences
17 October 1978
Dies in Durham, North
Carolina, USA
GERTRUDE COX
Aditya Ranganathan
Gertrude Cox g 71
CLIMATE CHANGE
the 30-second data
Climate trend predictions ensue
after compiling and processing volumes of data:
average global temperatures over the years,
3-SECOND SAMPLE for example. Average global temperature is a RELATED TOPIC
A firm prediction of the nuanced function of variables. Above-average See also
future of our planet CORRELATION
build-ups of greenhouse gases in the
rests upon the collection page 42
and analysis of massive
atmosphere trap above-average amounts of
amounts of data on heat, creating a barrier to prompt disposal.
global temperatures Other factors that slow down rates of heat 3-SECOND BIOGRAPHIES
and greenhouse gas JAMES HANSEN
emission include rising ocean levels, asphalt 1941–
concentrations.
levels and decreasing ice. The result of this NASA scientist and climate
change advocate.
retardation is an upset of equilibrium – the
3-MINUTE ANALYSIS desired state in which the rate of heat RICHARD A. MULLER
1944–
Anthropogenic absorption equals the rate of heat emission, Climate sceptic converted to
contributions, including climate change advocate.
expanding agricultural
and average global temperature stays constant.
and industrial practices, Even though the disequilibrium is temporary, AL GORE
1948–
correlate with an increase it is a period when heat lingers. And, when Published what was at the time
in global greenhouse gas
equilibrium returns, rather than catching up to a controversial documentary
concentrations and rising on the impacts of climate
global temperatures, also the earlier temperature, we find ourselves in the change called An Inconvenient
Truth.
known as global warming midst of a new normal. There is a range of new
or climate change. The ‘normals’ we could potentially reach: some
more data that is collected
on anthropogenic
mildly uncomfortable, some deadly. In order to 30-SECOND TEXT
contributions, the more understand which of these scenarios we might Aditya Ranganathan
80 g Society
machine learning Finding a mathematical randomized experiments Experimental
relationship between input variables and an design where participants or subjects are
output. This ‘learned’ relationship can then randomly allocated to treatment groups.
be used to output predictions, forecasts or Participants in a randomized drug trial could
classifications given an input. For example, be randomly allocated to a group where they
a machine learning model may be used would either receive a placebo or a drug.
to predict a patient’s risk of developing
diabetes given their weight. This would be sensitive information/data Reveals
done by fitting a function to a ‘training’ set personal details, such as ethnicity, religious
of thousands of historic data points, where and political beliefs, sexual orientation, trade
each point represents a single patient’s union membership or health-related data.
weight and whether they developed
diabetes. When a new, previously unseen sniffers Software that intercepts and
patient’s weight is run through the model, analyses the data being sent across a
this ‘learned’ function will be used to network, to or from a phone, computer
predict whether they will develop diabetes. or other electronic device.
Modern computer hardware has enabled
the development of powerful machine Yellow Vests movement Protest movement
learning algorithms. originating in France, focused on issues such
as rising fuel prices and the cost of living.
microtargeting Strategy used during
political or advertising campaigns in which
personalized messaging is delivered to
different subsets of customers or voters
based on information that has been mined
or collected about their views, preferences
or behaviours.
Glossary g 81
SURVEILLANCE
the 30-second data
Data surveillance is all around us,
and it continues to grow more sophisticated
and all-encompassing. From biometric airport
3-SECOND SAMPLE security to grocery shopping, online activity RELATED TOPICS
Eyewitness sketches and and smartphone usage, we are constantly being See also
background checks might SECURITY
surveilled, with our actions and choices being
become an archaic relic page 84
with the amount of
documented into spreadsheets. Geospatial
MARKETING
surveillance data we now surveillance data allows marketers to send you page 108
have the capability of tailored ads based upon your physical, real-time
storing and analysing.
location. Not only that, it can also use your
past location behaviour to predict precisely 3-SECOND BIOGRAPHY
TIM BERNERS-LEE
3-MINUTE ANALYSIS what kind of ads to send you, sometimes 1955–
While data surveillance without your permission or knowledge. While Creator of the World Wide
Web, coining the internet
can feel negative, there data surveillance is itself uninteresting; it’s the as the ‘world’s largest
are incredible advances surveillance network’.
in preventing terrorism,
actions taken from analysis of the data that
cracking child pornography can be both harmful and helpful. Using data
rings by following images surveillance, private and public entities are 30-SECOND TEXT
being sourced from the investigating methods of influencing or Liberty Vittert
internet, and even aiding
the global refugee crisis. ‘nudging’ individuals to do the ‘right’ thing, and
The Hive (a data initiative penalizing us for the doing the ‘wrong’ thing.
for USA for the UN A health insurance company could raise or lower
Refugee Agency) used
rates based upon the daily steps a fitness tracker
high-resolution satellite
imagery to create a records; a car insurance company could do the
machine-learning algorithm same based upon data from a smart car. Data
for detecting tents in surveillance is not only about the present and
refugee camps – allowing
for better camp planning
analysis of actions; it’s also about predicting
and field operation. future action. Who will be a criminal, who will be When put towards
a terrorist, or simply, what time of the day are a good cause, such
you most likely to buy that pair of shoes you as crime prevention,
have been eyeing while online shopping? certain types of
surveillance can be
82 g Society well justified.
SECURITY
the 30-second data
Data is opening up new
opportunities in intelligence processing,
dissemination and analysis while improving
3-SECOND SAMPLE investigative capacities of security and RELATED TOPICS
Big Data meets Big Brother intelligence organizations at global and See also
in the untapped and SURVEILLANCE
community levels. From anomalies (behaviour
untried world of page 82
data-driven security
that doesn’t fit a usual pattern) to association
ETHICS
opportunities. From (relationships that the human eye couldn’t page 152
community policing to detect) and links (social networks of
preventing terrorism, the
connections, such as Al-Qaeda), intelligence
possibilities are endless,
organizations compile data from online activity, 3-SECOND BIOGRAPHY
and untested. PATRICK W. KELLEY
surveillance, social media and so on, to detect fl. 1994
patterns, or lack thereof, in individual and group FBI Director of Integrity and
Compliance, who migrated
3-MINUTE ANALYSIS activity. Systems called ‘sniffers’ – designed to Carnivore to practice.
In the case of Chicago (see
‘data’ text), the higher the
monitor a target user’s internet traffic – have
score means the greater been transformed from simple surveillance
30-SECOND TEXT
risk of being a victim or systems to security systems designed to Liberty Vittert
perpetrator of violence.
distinguish between communications that
In 2016, on Mother’s Day
weekend, 80 per cent of may be lawfully intercepted and those that may
the 51 people shot over two not for security purposes. Data can visualize
days had been correctly how violence spreads like a virus among
identified on the list. While
proponents say that it
communities. The same data can also predict
allows police to prioritize the most likely victims of violence and even,
youth violence by supposedly, the criminals. Police forces are
intervening in the lives using data to both target and forecast these
of those most at risk,
naysayers worry that
individuals. For example, police in Chicago
by not identifying what identified over 1,400 men to go on a ‘heat list’ Carnivore was one
generates the risk score, generated by an algorithm that rank-orders of the first systems
racial bias and unethical
potential victims and subjects with the implemented by the FBI
data use might be
in practice. greatest risk of violence. to monitor email and
communications from
84 g Society a security perspective.
PRIVACY
the 30-second data
The adage ‘if you’re not paying
for the product, you are the product’ remains
true in the era of big data. Businesses and
3-SECOND SAMPLE governments hold detailed information about RELATED TOPICS
Every day we generate our likes, health, finances and whereabouts, See also
thousands of data points SURVEILLANCE
and can harness this to serve us personalized
describing our lifestyle and page 82
behaviour. Who should
advertising. Controversies around targeted
REGULATION
have access to this political campaigning on Facebook, including page 150
information, and how can alleged data breaches during the 2016 US
they use it responsibly?
presidential election, have brought data privacy
to the forefront of public debate. For example, 3-SECOND BIOGRAPHY
MITCHELL BAKER
3-MINUTE ANALYSIS medical records are held securely by healthcare 1959–
Governments have taken providers, but health apps are not subject to the Founder of the Mozilla
steps to safeguard privacy. Foundation, launched in 2003,
same privacy regulations as hospitals or doctors. which works to protect
The UK’s Information individuals’ privacy while
Commissioner’s Office
A British Medical Journal study found that keeping the internet open
fined Facebook £500,000 nearly four in five of these apps routinely share and accessible.
for failing to protect user personal data with third parties. Users of
data. In the European
menstrual cycle, fitness or mental health 30-SECOND TEXT
Union, organizations must
ask for consent when tracking apps may be unaware that sensitive Maryam Ahmed
collecting personal data information about their health and well-being is
and delete it when asked. up for sale. One strategy for protecting privacy
The US Census Bureau is
introducing ‘differential
is the removal of identifying variables, such as
privacy’ into the 2020 full names or addresses, from large data sets.
census, a method that But can data ever truly be anonymized? In
prevents individuals 2018 the New York Times reviewed a large
being identified from
aggregate statistics.
anonymized phone location data set. Journalists
were able to identify and contact two individuals Non-governmental
from the data, demonstrating that true organizations advocate
anonymization is difficult to achieve. for and support projects
relating to greater
internet and data
86 g Society privacy.
12 May 1820 1844 1854
Born in Italy, and is Announces intention Travels to the Scutari
named after the city to pursue a career in Barracks in modern-day
of her birth nursing, prompting Turkey, with a group of
opposition from family 38 female nurses, and
oversees the introduction
1837 of sanitary reforms
Experiences the first of 1851
several ‘calls from God’, Undertakes medical
which inform her desire training in Düsseldorf, 1857
to serve others Germany Suffers from intermittent
episodes of depression
and ill health, which
1853 continue until her death
Becomes Superintendent
of the Institute for
the Care of Sick 1858
Gentlewomen, London Publishes the data-driven
report ‘Mortality of the
British Army’
1859
Elected to the Royal
Statistical Society
13 August 1910
Dies in her sleep,
in London
FLORENCE NIGHTINGALE
Maryam Ahmed
Florence Nightingale g 89
VOTE SCIENCE
the 30-second data
Vote Science has been in practice
since political outcomes began being decided by
votes, dating back to sixth-century BCE Athens.
3-SECOND SAMPLE Modern Vote Science evolved rapidly in the US RELATED TOPICS
Vote Science is the practice in the 1950s, when campaigns, political parties See also
of using modern voter LEARNING FROM DATA
and special interest groups started keeping large
registration lists, consumer page 20
and social media data, and
databases of eligible voters, which were later
ETHICS
polling to influence public used to build individual voter profiles. Using page 152
opinion and win elections. machine learning and statistical analysis,
campaign professionals began using these
profiles to make calculated decisions on how to 3-SECOND BIOGRAPHIES
3-MINUTE ANALYSIS DONALD P. GREEN
George Bush’s 2004 win an election or sway public opinion. Current 1961–
re-election was the first best practices include maintaining databases Leader of Vote Science
randomized experiments.
political campaign to use of people with hundreds of attributes, from
political microtargeting – SASHA ISSENBERG
the use of machine-
individuals’ credit scores to whether they vote fl. 2002–
learning algorithms to early/in-person or even if they are more likely to Chronicler of how data science
has been used and evolved in
classify voters on an vote if reminded via phone, text or email. Using campaigns in the last 20 years.
individual level of how
this data, campaigns and political parties work
they might vote or if they DAN WAGNER
even would vote. Barack to predict voter behaviour, such as whether fl. 2005–
voters will turn out, when they will vote, how Director of Analytics for
Obama’s campaigns in ‘Obama for America’ in
2008 and 2012 took Vote they will vote and – most recently – what will 2012; led efforts to expand
Science a step further by Vote Science in campaigns
incorporating randomized
persuade them to change their opinion. Recent to message testing and
donor models.
field experiments. Elections campaigns have adopted randomized field
in the UK, France and India experiments to assess the effectiveness of
began to use Vote Science mobilization and persuasion efforts. Vote 30-SECOND TEXT
techniques such as
microtargeting and random
Science now determines how a campaign Scott Tranter
field experiments in their chooses to spend its advertising funds as Modern-day election
campaigns after witnessing well as which particular messages are shown campaigns are driven
the success of the
to specific, individual voters. by Vote Science, with
American model.
a vast amount of
campaign budget
90 g Society allocated to it.
HEALTH
the 30-second data
Data science develops tools to
analyse health information, to improve related
services and outcomes. An estimated 30 per
3-SECOND SAMPLE cent of the world’s electronically stored data RELATED TOPICS
Data science transforms comes from the healthcare field. A single See also
unstructured health EPIDEMIOLOGY
patient can generate roughly 80 megabytes
information into page 76
knowledge that changes
of data annually (the equivalent of 260 books
PERSONALIZED MEDICINE
medical practice. worth of data). This health data can come from page 138
a variety of sources, including genetic testing,
MENTAL HEALTH
surveys, wearable devices, social media, clinical page 140
3-MINUTE ANALYSIS
Consumer-grade wearable
trials, medical imaging, clinic and pharmacy
devices coupled with information, administrative claim databases and
smartphone technology national registries. A common data source is 3-SECOND BIOGRAPHIES
FLORENCE NIGHTINGALE
offer innovative ways to electronic medical record (EMR) platforms, 1820–1910
capture continuous health
data, improving patient
which collect, organize and analyse patient data. Championed the use of
healthcare statistics.
outcomes. For example, EMRs enable doctors and healthcare networks
BILL & MELINDA GATES
heart monitors can be used to communicate and coordinate care, thereby 1955– & 1964–
to diagnose and/or predict
reducing inefficiencies and costs. EMR data Launched in 2000, the Gates
abnormal and potentially Foundation uses data to solve
life-threatening heart is used to create decision tools, for clinicians, some of the world’s biggest
health data science problems.
rhythms. The data can be which incorporate evidence-based
assessed by varying time recommendations for patient test results and JAMES PARK &
parameters (days to weeks ERIC FRIEDMAN
versus months to years),
prevention procedures. Healthcare data science fl. 2007
to develop early-warning combines the fields of predictive analytics, Founders of Fitbit who applied
sensors and wireless tech to
health scores. Similarly, machine learning and information technology health and fitness.
hearing aids with motion to transform unstructured information into
sensors can detect the
cause of a fall (slipping
knowledge used to change clinical and public
30-SECOND TEXT
versus heart attack), so health practice. Data science helps to save Rupa R. Patel
doctors can respond lives by predicting patient risk for diseases,
effectively.
personalizing patient treatments and enabling
research to cure diseases. Using data to
personalize healthcare
92 g Society helps to save lives.
IBM’S WATSON
& GOOGLE’S
DEEPMIND
the 30-second data
When IBM’s Watson computer
defeated the reigning Jeopardy! champion on a
nationally televised game show in 2011, it was a
3-SECOND SAMPLE demonstration of how computer-based natural RELATED TOPICS
IBM’s Watson Jeopardy!- language processing and machine learning had See also
playing computer and MACHINE LEARNING
advanced sufficiently to take on the complex
Google’s DeepMind page 32
Go-playing program
wordplay, puns and ambiguity that many
NEURAL NETWORKS &
introduced the world to viewers might struggle with. Google’s DEEP LEARNING
machine learning and DeepMind subsidiary did something similar – its page 34
artificial intelligence in
AlphaGo program used machine learning and GAMING
ways that were easy
to understand. artificial intelligence to beat the world champion page 130
at Go, a very complicated strategy board game
played with black and white stones; a feat no
3-MINUTE ANALYSIS 3-SECOND BIOGRAPHIES
other computer had ever accomplished. Picking THOMAS WATSON
Computer companies
ambitious targets such as beating humans at 1874–1956
pursue targets such as Chairman and CEO of IBM, the
playing Jeopardy! and well-known games serves several purposes. Jeopardy!-playing computer is
named after him.
Go because to excel First, it gives data scientists clear goals and
at them they have to
benchmarks to target, like ‘Win at Jeopardy!’. DEEPMIND TECHNOLOGIES
develop general-purpose 2010–
capabilities that can In IBM’s case, they even announced the goal Acquired by Alphabet (parent
be applied to other beforehand, which put pressure on the of Google) in 2014.
commercially important development team to be creative and think
problems. The ability to
answer a person’s question
outside the box, as who would want to be 30-SECOND TEXT
in his or her own language publicly humiliated by a mere human? Second, Willy Shih
on a broad range of topics, these sparring matches speak to the public
or to train for complicated about how far hardware and software are
problems such as robot
navigation, will help future
progressing. Go is much more challenging than
computers to perform chess, so if a computer can beat the world Computers beating
more sophisticated tasks champion, we must be making a lot of progress! humans at ever more
for people, including
their creators.
complex games is a
visible measure of the
progress being made
94 g Society in data science.
g
BUSINESS
BUSINESS
GLOSSARY
automated system Some repetitive tasks foot traffic analysis Often used in the retail
or calculations can be carried out faster, sector to measure how many customers
continuously and more efficiently by enter a shop, and their movements and
computers. Examples of automated behaviour while browsing.
systems include automated passport
gates at airports, self-driving cars or geolocation data Describes the location of
speech-to-text software. a person or object over time.
98 g Business
natural language-processing algorithms quantum mechanics Branch of physics
Techniques for analysing written or spoken concerned with the behaviour of atomic
language. This could include the contents of and subatomic particles.
political speeches, vocal commands given to
a smartphone or written customer feedback reinforcement learning Branch of machine
on an e-commerce website. Common natural learning, where algorithms learn to take
language processing techniques include actions which maximize a specified reward.
sentiment analysis, where text is labelled as
positive or negative depending on its tone, tabulating system A machine, developed in
and topic modelling, which aims to identify the 1800s, designed to store information in
the overall theme or topic of a piece of text. the form of hole-punched cards. Its first use
was in the 1890s, to store data collected
probability theory Branch of mathematics during the first US census.
concerned with representing probabilities
in mathematical terms. The field relies on tracking cookies Piece of information from
a set of underlying assumptions, or axioms, a website, stored by a person’s web browser,
including ‘the probability of an event is a which is shared or tracked across websites,
non-negative, real number.’ to track a user’s online journey. They may be
used by third party advertising providers, to
prototype Working draft version of a piece serve personalized adverts based on a user’s
of software or hardware, sometimes referred browsing history.
to as a minimum viable product, or MVP.
Glossary g 99
INDUSTRY 4.0
the 30-second data
Industry 4.0 can be more easily
understood as a ‘smart factory’, where internet-
connected systems/machines communicate and
3-SECOND SAMPLE cooperate with each other in real-time to do RELATED TOPIC
‘Humankind will be extinct the jobs that humans used to do. This relies See also
or jobless’ is the feared ARTIFICIAL INTELLIGENCE (AI)
on the Internet of things (IoT), the extension of
mantra with the fourth page 148
industrial revolution in
internet connectivity into devices and everyday
manufacturing, where objects. While Industry 4.0 can have an ominous
machines use data to ring to it in certain circles, there is a vast amount 3-SECOND BIOGRAPHY
make their own decisions. HUGH EVERETT
of incredible applications to our daily lives. From 1930–82
robots picking and packing items in a warehouse First proposed the Many
Worlds interpretation for
3-MINUTE ANALYSIS for delivery, to autonomous cranes and trucks on quantum mechanics and
operations research.
Millions of people building sites, and using information collected
are employed by the from these machines to find and optimize
manufacturing industry
and fears over job loss
irregularities in business systems – the 30-SECOND TEXT
from the data revolution/ possibilities are endless and, as of yet, unknown. Liberty Vittert
Industry 4.0 are real and Business is not the only winner in this industrial
already evident. While this
revolution. For example, providing assistance to
may be very worrisome,
many are using it as an elderly or disabled individuals through homecare
opportunity to push for the advances with systems like voice control or alerts
idea of a ‘universal basic for falls or seizures. However, there are large
income’. This is a periodic
monetary compensation
barriers to the full implementation of Industry
given to all citizens as 4.0, integration being one of the biggest. There
a right, with the only are no industry standards for connectivity
requirement being legal and the systems themselves are fragmented
residency. This stipend
would be enough for basic
between different industries and companies.
bills and living, with the Privacy concerns are overwhelming, with the Connectivity and
aim that individuals will be amount of data collected (personal and standardization across
free to pursue any interest.
otherwise) by these systems needing to be industries are a major
protected, as are decisions over ownership. obstacle to the
widespread adoption
100 g Business of smart factories.
ENERGY SUPPLY
& DISTRIBUTION
the 30-second data
Our energy supply is transitioning
from fossil fuels and centralized infrastructure
to a renewable, decentralized system, and data
3-SECOND SAMPLE analytics eases the challenges of that transition. RELATED TOPICS
Data science is key to As the output of wind farms and solar See also
managing the growth of INDUSTRY 4.0
photovoltaic plants is weather-dependent,
renewable and distributed page 100
energy sources in the
high-resolution weather forecasting based
electric power system. on predictive analytics has wide applications
in improving design and operation of these 3-SECOND BIOGRAPHY
THOMAS EDISON
systems, from optimizing the layout of wind 1847–1931
3-MINUTE ANALYSIS
Fossil fuels still make
turbines in a field to automatically adjusting Architect of the world’s first
power grid, which went live in
up a large part of global the angle of solar panels to maximize power New York City in 1882.
energy consumption, generation despite changing conditions.
and oil and gas companies As electricity is then transmitted to the end
make liberal use of 30-SECOND TEXT
analytics as well – in
customer, analytics is critical to managing the Katrina Westerhof
characterizing untapped growing complexity of the power grid due to
oil reservoirs below the ‘distributed energy resources’ – controllable
earth’s surface, optimizing
devices such as backup generators, home
drill operations when
drilling new wells, batteries and smart thermostats, often owned
forecasting impending by homeowners and businesses. These devices
equipment failures, are excellent resources for balancing the
deciding which oil
streams to blend
grid, and grid operators can use analytics to
together, and more. determine which mix of devices to pull from
at any time based on weather, historic energy
demand, the performance and tolerances of
each device, and grid conditions like voltage.
For grid operators, analytics is also useful in Data like demographics
planning infrastructure investments, allowing and infrastructure
them to predict which parts of the network condition can inform
will be most strained decades into the future. decisions about where
to increase the capacity
102 g Business of the power grid.
LOGISTICS
the 30-second data
Route optimization – born of
both predictive and prescriptive data analytics –
has unlocked enormous benefits for the
3-SECOND SAMPLE previously low-tech logistics industry, reducing RELATED TOPICS
Getting an item from fuel consumption and improving reliability of See also
Point A to Point B is more INDUSTRY 4.0
service. When delivering packages to homes
efficient and reliable with page 100
optimized routing, enabled
and businesses, logistics companies can now
SHOPPING
by data analytics. identify the most efficient routes for each page 118
driver, each day, across the entire fleet, taking
into account delivery deadlines, traffic patterns
3-MINUTE ANALYSIS
In the context of
and weather forecasts. Upstream, in freight 3-SECOND BIOGRAPHY
JUAN PEREZ
supply-chain management, shipping, shippers can apply similar techniques 1967–
the value of analytics for to optimize the route from an origin point to Chief Engineering and
logistics is even greater. Information Officer at UPS
a distribution facility, choosing the right who led the implementation
Predictive analytics will of the company’s ORION route
improve inventory
combination of sea, air, rail and road transport optimization project.
management by to get each shipment to its destination most
considering the impacts efficiently and on time. In both cases, the
of factors like geopolitics, 30-SECOND TEXT
tools exist today to make these optimizations Katrina Westerhof
weather and climate
change, and consumer dynamic, allowing carriers to reroute parcels in
sentiment on product real time as conditions change, and for delivery
availability or demand. routes, even recommending the ideal driving
And integrating data across
the supply chain unlocks
speed on each leg of the route to consistently
new opportunities – for hit green traffic lights. Beyond optimizing how
example, dynamically an item gets to its destination, big data and
rerouting a shipment of analytics also provide insights into how to
ripe fruit to a nearer store
or a store where fruit sells
structure a global logistics network, such as
more quickly, thereby where to build new hubs, distribution facilities Dynamic route
reducing food waste. and customer drop sites as transportation optimization enables
constraints and customer demand change. shippers to be
responsive to changing
conditions in the
104 g Business supply chain.
29 February 1860 1880 1890–1900
Born in Buffalo, Serves as assistant to Contracted to supply his
New York, USA William Trowbridge, his machines for the 1890
professor who worked census count
on the US Census
1875
Enrols in City College 1911
of New York 1889 Begins (alongside several
Receives patent for others) the Computing-
punch-card tabulator Tabulating-Recording
1879 (Patent No. 395,782) Company (CRT)
Receives undergraduate
degree in Mining from
Columbia University 1890 1918
Gains PhD from Columbia Starts stepping back from
University day-to-day operations
at CRT
1890
Receives the Elliot 1921
Cresson Medal Retires
1924
CRT becomes IBM
17 November 1929
Dies in Washington, DC,
USA
HERMAN HOLLERITH
Aditya Ranganathan
analytical engine Mechanical computer, digital age Time period beginning in the
designed by Charles Babbage in the early 1970s and stretching to the present day,
1800s, intended to carry out arithmetic and characterized by rapid technological
logical operations, taking instructions or advances, including the introduction
inputs via hole-punched cards. The machine of the personal computer and the rise of
was not constructed during Babbage’s the internet.
lifetime, but a modified version was built
by the London Science Museum in 1991.
116 g Pleasure
digital library Large repository or archive metrics Quantitative measure of
of data, sometimes available to access performance. For example, it is important
or download through the internet, for to assess accuracy metrics for automated
commercial or research purposes. Digital decision-making algorithms. Similarly,
libraries may include images, text or measures such as inflation or the FTSE 100
numerical data. index could be seen as a performance
metrics for the economy.
esports Electronic sports in which
individuals or teams of players compete in model/modelling Real world processes or
international tournaments and for monetary problems in mathematical terms; can be
prizes, to win video games. simple or very complex, and are often used
to make predictions or forecasts.
geolocated franchising model Teams of
competitive video-game players, based STEM The fields of science, technology,
in a specific city, can form a franchise to engineering and mathematics.
compete in international or national esports
tournaments for a particular game. swipe The act of swiping a finger across a
smartphone screen, to interact with an
live streaming The live broadcast of video app. Swiping is widely used in dating apps,
or audio content, via the internet. Esports where users often swipe right or left on a
are usually watched through live streaming. photograph of a potential romantic partner,
to signal interest or disinterest.
machine learning Finding a mathematical
relationship between input variables and an wearable technology Electronic devices
output. This ‘learned’ relationship can then that can be worn on the body including
be used to output predictions, forecasts or activity monitors and smart watches.
classifications given an input.
Glossary g 117
SHOPPING
the 30-second data
With the internet giving a home
to a variety of retailers, the consumer can now
buy almost anything from the comfort of their
3-SECOND SAMPLE own home. The consequence of this is that RELATED TOPICS
Shopping online has retailers have been able to harvest extensive See also
changed shopping as we DATA COLLECTION
and accurate data relating to customers, which
know it, but how is it page 16
possible that websites
means they are better able to target shoppers
LEARNING FROM DATA
seem to know what we based on their habits. An example of this can be page 20
want before we do? seen on Amazon – the biggest online retailer in
ARTIFICIAL INTELLIGENCE (AI)
the world – with its ability to recommend items page 148
3-MINUTE ANALYSIS
based on your previous purchases, ratings and
Ever wondered how a wish lists. However, the ability to perform
website knows the shoes this type of activity is not only the realm of 3-SECOND BIOGRAPHY
JEFF BEZOS
you were looking at the companies the size of Amazon. Services now 1964–
other day? Well, the
answer is cookies. These
exist offering artificial intelligence (AI) solutions Tech entrepreneur who is the
founder, CEO and president
are small pieces of data that allow retailers of many sizes to be able to of Amazon.
that come from a website harness the power of these types of algorithms
and are stored in the web
to drive business, which means that the next
browser, allowing websites 30-SECOND TEXT
to remember various time an online retailer suggests a T-shirt to go Robert Mastrodomenico
nuggets of information with your jeans, it could be via AI. Data science
including past activity or isn’t restricted to shopping suggestions: it also
items in a shopping cart,
which explains why that
applies to how goods are purchased. Facial
pair of shoes just keeps recognition technology combined with smart
coming back. devices allows payments to be authenticated
without the use of credit cards.
Aditya Ranganathan
in social media. The to learn their interests, social media companies MARK ZUCKERBERG
episode ‘Nosedive’ depicts have been able to deliver highly targeted 1984–
a world where ‘social Co-founder and CEO of
adverts and generate billions of pounds in Facebook, and the youngest
credit’, deriving from a self-made billionaire, at 23.
mixture of in-person ad revenue every year. These same machine-
and online interactions, learning algorithms can be used to tailor the
dictates where a person content each user sees on their screen. From 30-SECOND TEXT
can live, what they can buy,
who they can talk to and
a timeline to suggested friends, social media Scott Tranter
more. China has begun companies play a prominent role in how users
implementing a Social interact with their apps and, subsequently, the
Credit System to determine world around them. What once started as a way
the trustworthiness of
individuals and accept
to update friends on one’s status has evolved
or deny individuals for into a public forum, marketplace and news The rapid growth of
functions such as receiving outlet all rolled into one. social media has seen it
loans and travelling.
infiltrate everyday life,
with data capture
capabilities on an
128 g Pleasure unprecedented scale.
GAMING
the 30-second data
Competitive video gaming,
known as esports, is an emerging global
phenomenon in which professional players
3-SECOND SAMPLE compete in packed stadiums for prize pools RELATED TOPICS
Esports is engaging its reaching millions of pounds. Unlike with See also
young, digital-savvy fans LEARNING FROM DATA
traditional sporting events, esports fans page 20
through non-traditional
online media, paving the
engage more directly with content via online
SPORTS
way for data science into live streaming technology on platforms such page 126
the recreational trends of as Twitch. Esports consumers largely consist
young generations.
of males in the 20 to 30 age range, a prime
demographic that companies wish to target. 3-SECOND BIOGRAPHIES
JUSTIN KAN
3-MINUTE ANALYSIS By tracking the fan base’s habits and interests 1983–
Although esports thrived using analytical tools and survey methods, American internet entrepreneur
who co-founded Twitch,
off the back of online live companies have been able to tailor content formerly Justin.tv, the most
streaming technology, popular streaming platform for
it has also begun
based on the audience they wish to target. esports content.
broadcasting esports However, because of the esports audience’s TYLER ‘NINJA’ BLEVINS
on television, with reduced television consumption and tendency 1991–
ads displaying during American Twitch streamer,
to block internet adverts using browser-based internet personality and former
commercial breaks, akin to
traditional sports. Esports ad-blocking technology, companies are looking professional gamer who helped
bring mainstream attention to
companies are adopting into non-traditional methods to reach this the world of esports.
the geolocated franchising demographic. For example, due to the digital
model, which looks to take
nature of esports, brands have the ability to
advantage of television 30-SECOND TEXT
advertising and display their products directly in the video Scott Tranter
sponsorship deals for its games, avoiding ad-blockers altogether.
revenue. With this move, Additionally, professional esports players have
esports has an opportunity
to expand its reach,
a large influence on how their fans may view
opening up the door for certain products. To take advantage of this, As the esports industry
mainstream popularity. companies often partner with these influencers grows, top players may
and utilize their popularity in order to reach soon be able to sign
target audiences for their products. endorsement deals
in the ballpark of
130 g Pleasure professional athletes.
GAMBLING
the 30-second data
In gambling, everything from the
likelihood that the dealer will bust in blackjack
to the placement of specific slot machines at
3-SECOND SAMPLE key locations are driven by statistics. And, in RELATED TOPICS
Data science and gambling the evolving world of data science, those with See also
can blend together with LEARNING FROM DATA
greater access to it can find themselves at a
devastating effect – and page 20
has made the adage ‘the
huge advantage over others. This ranges from
SURVEILLANCE
house always wins’ even the simple approach of an experienced poker page 82
more true. player understanding the odds of turning his
SPORTS
straight-draw into a winning hand – and the page 126
3-MINUTE ANALYSIS
correlated risk of pursuing that potential
There have been reports on hand – to the more advanced techniques
the ways in which casinos casinos use to turn vast amounts of 3-SECOND BIOGRAPHIES
are utilizing decades’ worth RICHARD EPSTEIN
unstructured data into predictions on the best 1927–
of player data (tied back to
individual players through
way to entice players to bet, and to bet more, Game theorist who has served
as an influential statistical
their rewards cards), while on lower-odd payouts. Resources exist for both consultant for casinos.
plenty of ‘expert’ gamblers the house and for the player, and they extend EDWARD O. THORP
have written books 1932–
well beyond card games and slot machines.
designed to ‘beat the Mathematician who pioneered
house’. Those with designs Statistical models can impact the payout of successful models used on Wall
Street and in casinos.
on gambling based on luck sports events – oftentimes adjusting odds in
are simply playing the real time and based on the direction that money
wrong game – they should
be playing the stats – while
is moving – in a way that can minimize the risk 30-SECOND TEXT
hoping that Lady Luck still of the sportsbook (the part of casinos that Scott Tranter
shines on them. manages sports betting). By the same token,
some gamblers use or create statistical models
to make educated decisions on outcomes that
are data-driven rather than narrative-driven,
giving them an edge on those following Move over Lady Luck:
their instinct. professional gamblers
now pit their data
skills against those
132 g Pleasure of the house.
g
THE FUTURE
THE FUTURE
GLOSSARY
self-learning Type of machine learning, time series analysis The analysis of a signal
commonly used to find patterns or structure or variable that changes over time. This
in data sets. Also known as ‘unsupervised can include identifying seasonal trends or
learning’. patterns in the data, or forecasting future
values of the variable.
smart Refers to internet-connected devices
with real-time analysis or machine learning topology Branch of mathematics concerned
capabilities. Smart watches typically include with geometric objects and their properties
physical activity monitors and internet when they are stretched, twisted or crumpled.
connectivity, and smart TVs may include
voice recognition.
Glossary g 137
PERSONALIZED
MEDICINE
the 30-second data
Humans have always been
interested in themselves. So it’s no surprise that
they want to know what’s happening in their
3-SECOND SAMPLE bodies – at all times. Consumer demand for RELATED TOPICS
Wearable technology could personalized health data has fuelled the success See also
tap in to huge amounts of EPIDEMIOLOGY
of smart watches, fitness trackers and other
human data, opening up page 76
the possibility of real-time
wearable devices which give real-time feedback.
ETHICS
healthcare, along with But what does the next generation of wearables
page 152
new ways to detect and look like? And what can the data tell us? With
prevent disease.
technology so deeply ingrained in our lives, it is
easy to imagine a future with technologically 3-SECOND BIOGRAPHIES
JOSEPH WANG
3-MINUTE ANALYSIS advanced clothing, smart skin patches or 1948–
As the line between ingestible nanotechnologies which detect or American researcher and
consumer wearables and director of the Center for
monitor disease. Instead of a one-off blood Wearable Sensors at University
medical devices blurs, of California, San Diego, who
concerns are rising about
test, we could all be wearing a smart patch is pioneering wearable sensors
the reliability and security made of a series of micro-needle sensors that to monitor disease.
of data. For example, continually track chemical changes under JEFF WILLIAMS
current smartphone apps 1963–
the skin. Or flexible and stretchable sensors
for melanoma skin cancer Apple’s chief operating officer,
detection have a high resembling tattoos, which could monitor who oversees the Apple
watch and the company’s
failure rate. If people are lactate during a workout or sense changes health initiatives.
deciding to change their in environmental chemicals and pollutants.
lifestyle, or medical
professionals are making
And imagine the data – huge amounts of
30-SECOND TEXT
treatment decisions, it data. Future wearable technology will collect Stephanie McClellan
is crucial that wearable thousands of data points a minute, maybe even
devices go through any a second, which will need powerful algorithms,
necessary clinical trials and
are supported by strong
machine learning and AI to reduce the data
scientific evidence. into meaning. This will be essential to mine
the information, to better understand disease, To attain secure and
population-wide health trends and the vital reliable data, the
signs to predict a medical emergency. personal healthcare
industry needs to be
138 g The Future properly regulated.
MENTAL HEALTH
the 30-second data
Mental health disorders affect
over 970 million people worldwide, tend to be
under-diagnosed, can have long-term effects
3-SECOND SAMPLE and often carry social stigma. Mental health RELATED TOPICS
Data science enables data involves longitudinal behavioural surveys, See also
digital mental healthcare, NEURAL NETWORKS &
brain scans, administrative healthcare data and
to improve access and DEEP LEARNING
treatment outcomes.
genomics research. Such data is difficult to page 34
obtain and of a sensitive nature. Data science
HEALTH
facilitates access to this data and its applications page 92
3-MINUTE ANALYSIS to mental health include virtual counselling,
Mental healthcare has been PERSONALIZED MEDICINE
made more accessible
tele-psychiatry, effective use of social media page 138
John W. Tukey was born in 1915, of unsuspected patterns in data and to learn
and he showed unusual brilliance from a very about distributions of data, including the box
early age. He was schooled at home by his plot and the stem-and-leaf display. He coined
parents until he entered Brown University, other new terms that also became standard
Rhode Island. He graduated in three years with terminology, including ‘bit’ (short for ‘binary
BA and MS degrees in Chemistry. From Brown digit’) for the smallest unit of information
he moved on to Princeton, where he started in transferred, and even more terms that did not
Chemistry but soon switched to Mathematics, catch on (virtually no one remembers his 1947
getting a PhD at age 24 in 1939 with a term for a reciprocal second: a ‘whiz’).
dissertation in Topology, and he then moved It was through his teaching and consistent
directly to a faculty position at the same emphasis on the importance of exploratory
university. He remained at Princeton until he data analysis as a basic component of scientific
retired in 1985, adding a part-time appointment investigation that Tukey was established as a
at Bell Telephone Laboratories in 1945. founder of what some today call data science,
Over a long career, Tukey left his imprint and some credit him with coining the term.
in many fields. In mathematics: Tukey’s From 1960 to 1980, he worked with the
formulation of the axiom of choice; in time television network NBC as part of an election
series analysis: the Cooley–Tukey Fast Fourier night forecasting team that used early partial
Transform; in statistics: exploratory data results to ‘call’ the contested races of interest,
analysis, the jackknife, the one-degree test for working with several of his students, including
non-additivity, projection pursuit (an early form at different times David Wallace and David
of machine learning), and Tukey’s method Brillinger. In 1960 he prevented NBC from
of guaranteeing the accuracy of a set of prematurely announcing the election of Richard
simultaneous experimental comparisons. In Nixon. Tukey’s favourite pastimes were square
data analysis alone he created an array of dancing, bird watching and reading several
graphical displays that have since become hundreds of science fiction novels. He died in
standard, widely used to facilitate the discovery 2000 in New Brunswick, New Jersey.
Stephen Stigler
3-MINUTE ANALYSIS
for permission to collect personal data, in simple page 148
154 g Resources
WEBSITES
Memories of My Life Coursera
F. Galton www.coursera.org/learn/machine-learning
Methuen & Co. (1908)
Data Camp
Naked Statistics: Stripping the Dread www.datacamp.com/courses/introduction-
from the Data to-data
Charles Wheelan
W.W. Norton & Company (2014) The Gender Shades project
gendershades.org
The Numerati Uncovered bias in facial recognition
Stephen Baker algorithms
Mariner Books (2009)
ProPublica
Pattern Recognition and Machine Learning www.propublica.org/article/machine-bias-
C.M. Bishop risk-assessments-in-criminal-sentencing
Springer (2006) Investigated the COMPAS algorithm for
risk-scoring prisoners
The Practice of Data Analysis: Essays in
Honour of John W. Tukey Simply Statistics
D. Brillinger (Ed) simplystatistics.org
Princeton Univ. Press (1997)
Udemy
Statistics Done Wrong: The Woefully www.udemy.com/topic/data-science/
Complete Guide
Alex Reinhart
No Starch Press (2015)
Resources g 155
NOTES ON CONTRIBUTORS
Regina Nuzzo has a PhD in Statistics from Stephen M. Stigler is the Ernest DeWitt Burton
Stanford University and graduate training in Distinguished Service Professor of Statistics at
Science Writing from University of California the University of Chicago. Among his many
Santa Cruz. Her writings on probability, data and published works is ‘Stigler’s Law of Eponymy’
statistics have appeared in the Los Angeles Times, (‘No scientific discovery is named after its original
New York Times, Nature, Science News, Scientific discoverer’ in Trans. N. Y. Acad. Sci. 1980, 39:
American and New Scientist, among others. 147–158). His most recent book on the history
of statistics is The Seven Pillars of Statistical
Rupa Patel is a physician scientist and is the Wisdom (2016).
Founder and Director of the Washington
University in St Louis Biomedical HIV Prevention Scott Tranter is the former Director of Data
programme. She is also a technical advisor for the Science for Marco Rubio for President and
World Health Organization. Dr Patel utilizes data founder of Øptimus, a data and technology
science to improve implementation of evidence- company based in Washington, DC. Tranter has
based HIV prevention strategies in clinics, health worked in both the political and commercial
departments and community organizations in the spaces where the science of using data to
US, Africa and Asia. innovate how we do everything from elect our
leaders to sell people cars has been evolving over
Aditya Ranganathan is the chief evangelist for the last several decades.
Sense & Sensibility and Science (S&S&S), a UC
Berkeley Big Ideas course – founded by Saul Katrina Westerhof helps companies develop and
Perlmutter – on critical thinking, group decision adopt emerging technologies, particularly in
making and applied rationality. He also serves on spaces that are being upended by analytics
the board of Public Editor, a citizen science and the Internet of things. She has a diverse
approach to low-quality news and fake news. background in consulting, innovation,
Aditya is pursuing his PhD at Harvard University, engineering and entrepreneurship across the
where he studies collective behaviour (with energy, manufacturing and materials industries.
implications for group dynamics and education).
158 g Index
L P S Turing, Alan 15, 20, 125, 148
Large Hadron Collider (LHC) p-value 39, 54 sampling 40, 48 Twitch 130
62 Park, James 92 Schmidt, Brian 64 Twitter 40, 128
Lecun, Yann 32 Pascal, Blaise 8, 10 security 84
Leibniz, Gottfried 16 patents 106–7 Shakespeare, Stephan 18 U
logistics 104 Pearl, Judea 42 shopping 118 University College London
Lorentzon, Martin 122 Pearson, Karl 42, 54 smart devices 16, 32, 34, 48 27, 52–3
Lovelace, Ada 124–5 Perez, Juan 104 and business 100, 102 University of Wisconsin 53
Perlmutter, Saul 64 and future trends 138, 140,
M personalization 68, 81, 86, 92, 142 V
machine learning 22, 30, 32, 34 99, 138, 140 and pleasure 118, 128 voice recognition 137, 142, 148
and future trends 138, 140, Pollock, Rufus 152 and society 82, 92 vote science 90
142, 145 Prasad, Rohit 142 Snapchat 128
and pleasure 128 precision medicine 68 sniffer systems 84 W
and science 66 prediction 10, 16, 20, 24, 27 Snow, John 20, 76 Wagner, Dan 90
and society 90, 92, 94 and business 102, 104, 110 Snowden, Edward 152 Wald, Abraham 48
and uncertainty 50 and future trends 138, 140, social media 128, 140, 148 Wang, Joseph 138
McKinney, Wes 22 148 Spiegelhalter, David 120 Watson computer 94
marketing 30, 82, 108 and machine learning 32 sports 126, 130 Watson, James 68
Massachusetts Institute of and pleasure 132 Spotify 32, 122 Watson, Thomas 94
Technology (MIT) 50, 107 and science 62, 66, 72 statistics 8, 18, 22, 27–8, 30 wearables 138
medicine 68, 138, 140 and society 82, 84, 90, 92 and business 107–8 weather 26–7, 102, 104, 152
mental health 140 and uncertainty 40, 56 and future trends 144–5, 152 Wickham, Hadley 22
Microsoft 150 privacy 86, 100, 142, 150 and pleasure 120, 126, 132 Williams, Jeff 138
Million Genome Project (MGP) probability 40, 110 and science 64, 70–1, 74, 76 Wojcicki, Susan 108
68 product development 112 and society 82, 86, 88–90
modelling 30, 110, 132 profiles 16, 81, 90, 108, 146, and uncertainty 40, 44, 46, Y
Mojica, Francisco J.M. 66 152 48, 52–4 YouGov 18
Muller, Robert A. 72 programming 14, 22, 80, 94, Stephens, Frank 66
music 122 116, 125, 136 streaming 108, 117, 122, 130 Z
prototyping 112 supernovae 64 Zelen, Marvin 74
N surveillance 82, 84 Zuckerberg, Mark 16, 128
Netflix 112 R swipers 120
neural networks 34 randomized trials 61, 74, 81, 90
Neyman, Jerzy 46 regression 24, 26–8, 30, 44 T
Ng, Andrew 32 regulation 150 Thorp, Edward O. 110, 132
Nightingale, Florence 88–9, Reid, Toni 142 tools 22
92 Reiss, Ada 64 tracking 16, 40, 86, 107–8, 112
Nobel Prize 44, 62, 64, 68 robots 32, 94, 100, 148 and future trends 138, 140
nudge technologies 82 robustness 53 and pleasure 126, 128, 130
Rosenblatt, Frank 34 transparency 50, 140, 150
O route optimization 104 trustworthiness scoring 146
overfitting 56 Royal Statistical Society 88–9 Tukey, John W. 144–5
Index g 159
ACKNOWLEDGEMENTS
All images that appear in the montages are from Shutterstock, Inc.
unless stated.
160 g Acknowledgements