Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Agarwal, Oshin; Ge, Heming; Shakeri, Siamak; Al-Rfou, Rami

Computer Science > Computation and Language

arXiv:2010.12688 (cs)

[Submitted on 23 Oct 2020 (v1), last revised 13 Mar 2021 (this version, v2)]

Title:Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Authors:Oshin Agarwal, Heming Ge, Siamak Shakeri, Rami Al-Rfou

View PDF

Abstract:Prior work on Data-To-Text Generation, the task of converting knowledge graph (KG) triples into natural text, focused on domain-specific benchmark datasets. In this paper, however, we verbalize the entire English Wikidata KG, and discuss the unique challenges associated with a broad, open-domain, large-scale verbalization. We further show that verbalizing a comprehensive, encyclopedic KG like Wikidata can be used to integrate structured KGs and natural language corpora. In contrast to the many architectures that have been developed to integrate these two sources, our approach converts the KG into natural text, allowing it to be seamlessly integrated into existing language models. It carries the further advantages of improved factual accuracy and reduced toxicity in the resulting language model. We evaluate this approach by augmenting the retrieval corpus in a retrieval language model and showing significant improvements on the knowledge intensive tasks of open domain QA and the LAMA knowledge probe.

Comments:	Accepted at NAACL 2021
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2010.12688 [cs.CL]
	(or arXiv:2010.12688v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2010.12688

Submission history

From: Oshin Agarwal [view email]
[v1] Fri, 23 Oct 2020 22:14:50 UTC (151 KB)
[v2] Sat, 13 Mar 2021 18:25:01 UTC (186 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2020-10

Change to browse by:

References & Citations

1 blog link

(what is this?)

DBLP - CS Bibliography

listing | bibtex

Oshin Agarwal
Heming Ge
Siamak Shakeri

export BibTeX citation

Computer Science > Computation and Language

Title:Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators