Neural Text Sanitization with Privacy Risk Indicators: An Empirical Analysis

Papadopoulou, Anthi; Lison, Pierre; Anderson, Mark; Øvrelid, Lilja; Pilán, Ildikó

Abstract:Text sanitization is the task of redacting a document to mask all occurrences of (direct or indirect) personal identifiers, with the goal of concealing the identity of the individual(s) referred in it. In this paper, we consider a two-step approach to text sanitization and provide a detailed analysis of its empirical performance on two recently published datasets: the Text Anonymization Benchmark (Pilán et al., 2022) and a collection of Wikipedia biographies (Papadopoulou et al., 2022). The text sanitization process starts with a privacy-oriented entity recognizer that seeks to determine the text spans expressing identifiable personal information. This privacy-oriented entity recognizer is trained by combining a standard named entity recognition model with a gazetteer populated by person-related terms extracted from Wikidata. The second step of the text sanitization process consists in assessing the privacy risk associated with each detected text span, either isolated or in combination with other text spans. We present five distinct indicators of the re-identification risk, respectively based on language model probabilities, text span classification, sequence labelling, perturbations, and web search. We provide a contrastive analysis of each privacy indicator and highlight their benefits and limitations, notably in relation to the available labeled data.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2310.14312 [cs.CL]
	(or arXiv:2310.14312v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.14312

Computer Science > Computation and Language

Title:Neural Text Sanitization with Privacy Risk Indicators: An Empirical Analysis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators