Evolving Attention with Residual Convolutions

Yujing Wang, Yaming Yang, Jiangang Bai, Mingliang Zhang, Jing Bai, Jing Yu, Ce Zhang, Gao Huang, Yunhai Tong
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:10971-10980, 2021.

Abstract

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However, they are learned independently in each layer and sometimes fail to capture precise patterns. In this paper, we propose a novel and generic mechanism based on evolving attention to improve the performance of transformers. On one hand, the attention maps in different layers share common knowledge, thus the ones in preceding layers can instruct the attention in succeeding layers through residual connections. On the other hand, low-level and high-level attentions vary in the level of abstraction, so we adopt convolutional layers to model the evolutionary process of attention maps. The proposed evolving attention mechanism achieves significant performance improvement over various state-of-the-art models for multiple tasks, including image classification, natural language understanding and machine translation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-wang21ab, title = {Evolving Attention with Residual Convolutions}, author = {Wang, Yujing and Yang, Yaming and Bai, Jiangang and Zhang, Mingliang and Bai, Jing and Yu, Jing and Zhang, Ce and Huang, Gao and Tong, Yunhai}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {10971--10980}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/wang21ab/wang21ab.pdf}, url = {https://proceedings.mlr.press/v139/wang21ab.html}, abstract = {Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However, they are learned independently in each layer and sometimes fail to capture precise patterns. In this paper, we propose a novel and generic mechanism based on evolving attention to improve the performance of transformers. On one hand, the attention maps in different layers share common knowledge, thus the ones in preceding layers can instruct the attention in succeeding layers through residual connections. On the other hand, low-level and high-level attentions vary in the level of abstraction, so we adopt convolutional layers to model the evolutionary process of attention maps. The proposed evolving attention mechanism achieves significant performance improvement over various state-of-the-art models for multiple tasks, including image classification, natural language understanding and machine translation.} }
Endnote
%0 Conference Paper %T Evolving Attention with Residual Convolutions %A Yujing Wang %A Yaming Yang %A Jiangang Bai %A Mingliang Zhang %A Jing Bai %A Jing Yu %A Ce Zhang %A Gao Huang %A Yunhai Tong %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-wang21ab %I PMLR %P 10971--10980 %U https://proceedings.mlr.press/v139/wang21ab.html %V 139 %X Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However, they are learned independently in each layer and sometimes fail to capture precise patterns. In this paper, we propose a novel and generic mechanism based on evolving attention to improve the performance of transformers. On one hand, the attention maps in different layers share common knowledge, thus the ones in preceding layers can instruct the attention in succeeding layers through residual connections. On the other hand, low-level and high-level attentions vary in the level of abstraction, so we adopt convolutional layers to model the evolutionary process of attention maps. The proposed evolving attention mechanism achieves significant performance improvement over various state-of-the-art models for multiple tasks, including image classification, natural language understanding and machine translation.
APA
Wang, Y., Yang, Y., Bai, J., Zhang, M., Bai, J., Yu, J., Zhang, C., Huang, G. & Tong, Y.. (2021). Evolving Attention with Residual Convolutions. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:10971-10980 Available from https://proceedings.mlr.press/v139/wang21ab.html.

Related Material