Do Not Trust Me: Explainability Against Text Classification

Gniewkowski, Mateusz; Walkowiak, Paweł; Syga, Piotr; Klonowski, Marek; Walkowiak, Tomasz

doi:10.3233/FAIA230356

Abstract

Explaining artificial intelligence models can be utilized to launch targeted adversarial attacks on text classification algorithms. Understanding the reasoning behind the model’s decisions makes it easier to prepare such samples. Most of the current text-based adversarial attacks rely on brute-force by using SHAP approach to identify the importance of tokens in the samples, we modify the crucial ones to prepare targeted attacks. We base our results on experiments using 5 datasets. Our results show that our approach outperforms TextBugger and TextFooler, achieving better results with 4 out of 5 datasets against TextBugger, and 3 out of 5 datasets against TextFooler, while minimizing perturbation introduced to the texts. In particular, we managed to outperform the efficacy of TextFooler by over 3100% and TextBugger by over 420% on the WikiPL dataset, additionally keeping high cosine similarity between the original text sample and the adversarial example. The evaluation of the results was additionally supported through a survey to assess their quality and ensure that the text perturbations did not change the intended class according to subjective, human classification.

Contact

IOS Press Copyright 2024

Contact

IOS Press Copyright 2024

This website uses cookies

This website uses cookies