Coli R 00388 STAT SIGN TEST NLP
Coli R 00388 STAT SIGN TEST NLP
Coli R 00388 STAT SIGN TEST NLP
Reviewed by
Edwin D. Simpson
Like any other science, research in natural language processing (NLP) depends on the
ability to draw correct conclusions from experiments. A key tool for this is statistical sig-
nificance testing: We use it to judge whether a result provides meaningful, generalizable
findings or should be taken with a pinch of salt. When comparing new methods against
others, performance metrics often differ by only small amounts, so researchers turn to
significance tests to show that improved models are genuinely better. Unfortunately,
this reasoning often fails because we choose inappropriate significance tests or carry
them out incorrectly, making their outcomes meaningless. Or, the test we use may
fail to indicate a significant result when a more appropriate test would find one. NLP
researchers must avoid these pitfalls to ensure that their evaluations are sound and
ultimately avoid wasting time and money through incorrect conclusions.
This book guides NLP researchers through the whole process of significance testing,
making it easy to select the right kind of test by matching canonical NLP tasks to specific
significance testing procedures. As well as being a handbook for researchers, the book
provides theoretical background on significance testing, includes new methods that
solve problems with significance tests in the world of deep learning and multidataset
benchmarks, and describes the open research problems of significance testing for NLP.
The book focuses on the task of comparing one algorithm with another. At the
core of this is the p-value, the probability that a difference at least as extreme as the
one we observed could occur by chance. If the p-value falls below a predetermined
threshold, the result is declared significant. Leaving aside the fundamental limitation
of turning the validity of results into a binary question with an arbitrary threshold, to
be a valid statistical significance test, the p-value must be computed in the right way.
The book describes the two crucial properties of an appropriate significance test: The
test must be both valid and powerful. Validity refers to the avoidance of type 1 errors, in
which the result is incorrectly declared significant. Common mistakes that lead to type 1
errors include deploying tests that make incorrect assumptions, such as independence
between data points. The power of a test refers to its ability to detect a significant result
and therefore to avoid type 2 errors. Here, knowledge of the data and experiment must
be used to choose a test that makes the correct assumptions. There is a trade-off between
validity and power, but for the most common NLP tasks (language modeling, sequence
labeling, translation, etc.), there are clear choices of tests that provide a good balance.
https://doi.org/10.1162/coli r 00388
Beginning with a detailed background on significance testing, the book then shows
the reader how to carry out tests for specific NLP tasks. There is a mix of styles, with
the first four chapters providing reference material that will be extremely useful to
both new and experienced researchers. Here, it is easy to find the material related to a
given NLP task. The next two chapters discuss more recent research into the application
of significance tests to deep neural networks and for testing across multiple datasets.
Alongside open research questions, these later chapters provide clear guidelines on
how to apply the proposed methods. It is this mix of background material and reference
guidelines that I believe makes this book so compelling and nicely self-contained.
The introduction in Chapter 1 motivates the need for a comprehensive textbook
and outlines challenges that the later chapters address more deeply. The theoretical
background material begins in Chapter 2, which introduces core concepts, including
906
Book Reviews
better alternative. The chapter explains how to use the proposed method, evaluates it in
an empirical case study, and finally analyzes the errors made by each testing approach.
Large NLP models are often tested across a range of datasets, which presents
another problem for standard significance testing. Chapter 6 discusses the challenges
of assessing two questions: (1) On how many datasets does algorithm A outperform
algorithm B? (2) On which datasets does A outperform B? Applying standard sig-
nificance tests individually to each dataset and counting the number of significant
results is likely to overestimate the total number of significant results, as this chapter
explains. The authors then present a new framework for replicability analysis, based on
partial conjunction testing, and discuss two variants (Bonferroni and Fisher) for when the
datasets are independent or dependent. They introduce a method based on Benjamini
and Heller (2008) to count the number of datasets where one method outperforms
another, then show how to use the Holm procedure (Holm 1979) to identify which
907
Computational Linguistics Volume 46, Number 4
Edwin D. Simpson is a Lecturer in the Department of Computer Science, University of Bristol, UK.
His research focuses on natural language processing (NLP), with particular interest in
applying interactive machine learning and Bayesian techniques to NLP problems such as argu-
mentation, crowdsourced annotation, and text ranking. His e-mail address is edwin.simpson
@bristol.ac.uk.
908