Towards Lingua Franca Named Entity Recognition with BERT

Moon, Taesun; Awasthy, Parul; Ni, Jian; Florian, Radu

Computer Science > Computation and Language

arXiv:1912.01389 (cs)

[Submitted on 19 Nov 2019 (v1), last revised 12 Dec 2019 (this version, v2)]

Title:Towards Lingua Franca Named Entity Recognition with BERT

Authors:Taesun Moon, Parul Awasthy, Jian Ni, Radu Florian

View PDF

Abstract:Information extraction is an important task in NLP, enabling the automatic extraction of data for relational database filling. Historically, research and data was produced for English text, followed in subsequent years by datasets in Arabic, Chinese (ACE/OntoNotes), Dutch, Spanish, German (CoNLL evaluations), and many others. The natural tendency has been to treat each language as a different dataset and build optimized models for each. In this paper we investigate a single Named Entity Recognition model, based on a multilingual BERT, that is trained jointly on many languages simultaneously, and is able to decode these languages with better accuracy than models trained only on one language. To improve the initial model, we study the use of regularization strategies such as multitask learning and partial gradient updates. In addition to being a single model that can tackle multiple languages (including code switch), the model could be used to make zero-shot predictions on a new language, even ones for which training data is not available, out of the box. The results show that this model not only performs competitively with monolingual models, but it also achieves state-of-the-art results on the CoNLL02 Dutch and Spanish datasets, OntoNotes Arabic and Chinese datasets. Moreover, it performs reasonably well on unseen languages, achieving state-of-the-art for zero-shot on three CoNLL languages.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:1912.01389 [cs.CL]
	(or arXiv:1912.01389v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1912.01389

Submission history

From: Taesun Moon [view email]
[v1] Tue, 19 Nov 2019 19:48:02 UTC (295 KB)
[v2] Thu, 12 Dec 2019 18:23:41 UTC (295 KB)

Computer Science > Computation and Language

Title:Towards Lingua Franca Named Entity Recognition with BERT

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Towards Lingua Franca Named Entity Recognition with BERT

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators