Temporally Efficient Vision Transformer for Video Instance Segmentation

Yang, Shusheng; Wang, Xinggang; Li, Yu; Fang, Yuxin; Fang, Jiemin; Liu, Wenyu; Zhao, Xun; Shan, Ying

Computer Science > Computer Vision and Pattern Recognition

arXiv:2204.08412 (cs)

[Submitted on 18 Apr 2022]

Title:Temporally Efficient Vision Transformer for Video Instance Segmentation

Authors:Shusheng Yang, Xinggang Wang, Yu Li, Yuxin Fang, Jiemin Fang, Wenyu Liu, Xun Zhao, Ying Shan

View PDF

Abstract:Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS). Different from previous transformer-based VIS methods, TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head. In the backbone stage, we propose a nearly parameter-free messenger shift mechanism for early temporal context fusion. In the head stages, we propose a parameter-shared spatiotemporal query interaction mechanism to build the one-to-one correspondence between video instances and queries. Thus, TeViT fully utilizes both framelevel and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost. On three widely adopted VIS benchmarks, i.e., YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains state-of-the-art results and maintains high inference speed, e.g., 46.6 AP with 68.9 FPS on YouTube-VIS-2019. Code is available at this https URL.

Comments:	To appear in CVPR 2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2204.08412 [cs.CV]
	(or arXiv:2204.08412v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2204.08412

Submission history

From: Shusheng Yang [view email]
[v1] Mon, 18 Apr 2022 17:09:20 UTC (1,041 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Temporally Efficient Vision Transformer for Video Instance Segmentation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Temporally Efficient Vision Transformer for Video Instance Segmentation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators