Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

Alves, Diego; Thakkar, Gaurish; Amaral, Gabriel; Kuculo, Tin; Tadić, Marko

Computer Science > Computation and Language

arXiv:2212.07429 (cs)

[Submitted on 14 Dec 2022]

Title:Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

Authors:Diego Alves, Gaurish Thakkar, Gabriel Amaral, Tin Kuculo, Marko Tadić

View PDF

Abstract:With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

Comments:	arXiv admin note: substantial text overlap with arXiv:2212.07162
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2212.07429 [cs.CL]
	(or arXiv:2212.07429v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2212.07429

Submission history

From: Gaurish Thakkar Mr [view email]
[v1] Wed, 14 Dec 2022 11:38:48 UTC (360 KB)

Computer Science > Computation and Language

Title:Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators