Towards Building More Robust NER datasets: An Empirical Study on NER Dataset Bias from a Dataset Difficulty View

Ruotian Ma, Xiaolei Wang, Xin Zhou, Qi Zhang, Xuanjing Huang


Abstract
Recently, many studies have illustrated the robustness problem of Named Entity Recognition (NER) systems: the NER models often rely on superficial entity patterns for predictions, without considering evidence from the context. Consequently, even state-of-the-art NER models generalize poorly to out-of-domain scenarios when out-of-distribution (OOD) entity patterns are introduced. Previous research attributes the robustness problem to the existence of NER dataset bias, where simpler and regular entity patterns induce shortcut learning. In this work, we bring new insights into this problem by comprehensively investigating the NER dataset bias from a dataset difficulty view. We quantify the entity-context difficulty distribution in existing datasets and explain their relationship with model robustness. Based on our findings, we explore three potential ways to de-bias the NER datasets by altering entity-context distribution, and we validate the feasibility with intensive experiments. Finally, we show that the de-biased datasets can transfer to different models and even benefit existing model-based robustness-improving methods, indicating that building more robust datasets is fundamental for building more robust NER systems.
Anthology ID:
2023.emnlp-main.281
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4616–4630
Language:
URL:
https://aclanthology.org/2023.emnlp-main.281
DOI:
10.18653/v1/2023.emnlp-main.281
Bibkey:
Cite (ACL):
Ruotian Ma, Xiaolei Wang, Xin Zhou, Qi Zhang, and Xuanjing Huang. 2023. Towards Building More Robust NER datasets: An Empirical Study on NER Dataset Bias from a Dataset Difficulty View. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4616–4630, Singapore. Association for Computational Linguistics.
Cite (Informal):
Towards Building More Robust NER datasets: An Empirical Study on NER Dataset Bias from a Dataset Difficulty View (Ma et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.281.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.281.mp4