Authors:
Štefan Pócoš
;
Iveta Bečková
and
Igor Farkaš
Affiliation:
Faculty of Mathematics, Physics and Informatics, Comenius University Bratislava, Mlynská dolina F1, 842 48 Bratislava, Slovakia
Keyword(s):
Attention, Transformer, Recurrence, Adversarial Examples, Robustness, Heatmap.
Abstract:
We propose and analyse a novel neural network architecture — recurrent vision transformer (RecViT). Building upon the popular vision transformer (ViT), we add a biologically inspired top-down connection, letting the
network ‘reconsider’ its initial prediction. Moreover, using a recurrent connection creates space for feeding
multiple similar, yet slightly modified or augmented inputs into the network, in a single forward pass. As it has
been shown that a top-down connection can increase accuracy in case of convolutional networks, we analyse
our architecture, combined with multiple training strategies, in the adversarial examples (AEs) scenario. Our
results show that some versions of RecViT indeed exhibit more robust behaviour than the baseline ViT, on the
tested datasets yielding ≈18 % and ≈22 % absolute improvement in robustness while the accuracy drop was
only ≈1 %. We also leverage the fact that transformer networks have certain level of inherent explainability.
By visualising atte
ntion maps of various input images, we gain some insight into the inner workings of our
network. Finally, using annotated segmentation masks, we numerically compare the quality of attention maps
on original and adversarial images.
(More)