Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples

Mozes, Maximilian; Stenetorp, Pontus; Kleinberg, Bennett; Griffin, Lewis D.

Computer Science > Computation and Language

arXiv:2004.05887 (cs)

[Submitted on 13 Apr 2020 (v1), last revised 26 Jan 2021 (this version, v2)]

Title:Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples

Authors:Maximilian Mozes, Pontus Stenetorp, Bennett Kleinberg, Lewis D. Griffin

View PDF

Abstract:Recent efforts have shown that neural text processing models are vulnerable to adversarial examples, but the nature of these examples is poorly understood. In this work, we show that adversarial attacks against CNN, LSTM and Transformer-based classification models perform word substitutions that are identifiable through frequency differences between replaced words and their corresponding substitutions. Based on these findings, we propose frequency-guided word substitutions (FGWS), a simple algorithm exploiting the frequency properties of adversarial word substitutions for the detection of adversarial examples. FGWS achieves strong performance by accurately detecting adversarial examples on the SST-2 and IMDb sentiment datasets, with F1 detection scores of up to 91.4% against RoBERTa-based classification models. We compare our approach against a recently proposed perturbation discrimination framework and show that we outperform it by up to 13.0% F1.

Comments:	EACL 2021 camera-ready
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2004.05887 [cs.CL]
	(or arXiv:2004.05887v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2004.05887

Submission history

From: Maximilian Mozes [view email]
[v1] Mon, 13 Apr 2020 12:11:36 UTC (1,482 KB)
[v2] Tue, 26 Jan 2021 09:55:19 UTC (1,027 KB)

Computer Science > Computation and Language

Title:Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators