Better than Average: Paired Evaluation of NLP Systems

Peyrard, Maxime; Zhao, Wei; Eger, Steffen; West, Robert

doi:10.18653/v1/2021.acl-long.179

Computer Science > Computation and Language

arXiv:2110.10746 (cs)

[Submitted on 20 Oct 2021]

Title:Better than Average: Paired Evaluation of NLP Systems

Authors:Maxime Peyrard, Wei Zhao, Steffen Eger, Robert West

View PDF

Abstract:Evaluation in NLP is usually done by comparing the scores of competing systems independently averaged over a common set of test instances. In this work, we question the use of averages for aggregating evaluation scores into a final number used to decide which system is best, since the average, as well as alternatives such as the median, ignores the pairing arising from the fact that systems are evaluated on the same test instances. We illustrate the importance of taking the instance-level pairing of evaluation scores into account and demonstrate, both theoretically and empirically, the advantages of aggregation methods based on pairwise comparisons, such as the Bradley-Terry (BT) model, a mechanism based on the estimated probability that a given system scores better than another on the test set. By re-evaluating 296 real NLP evaluation setups across four tasks and 18 evaluation metrics, we show that the choice of aggregation mechanism matters and yields different conclusions as to which systems are state of the art in about 30% of the setups. To facilitate the adoption of pairwise evaluation, we release a practical tool for performing the full analysis of evaluation scores with the mean, median, BT, and two variants of BT (Elo and TrueSkill), alongside functionality for appropriate statistical testing.

Comments:	Published in ACL 2021 (long paper)
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2110.10746 [cs.CL]
	(or arXiv:2110.10746v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2110.10746
Related DOI:	https://doi.org/10.18653/v1/2021.acl-long.179

Submission history

From: Maxime Peyrard [view email]
[v1] Wed, 20 Oct 2021 19:40:31 UTC (9,129 KB)

Computer Science > Computation and Language

Title:Better than Average: Paired Evaluation of NLP Systems

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Better than Average: Paired Evaluation of NLP Systems

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators