Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Feng, Guang; Zhang, Lihe; Hu, Zhiwei; Lu, Huchuan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2203.15969 (cs)

[Submitted on 30 Mar 2022]

Title:Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Authors:Guang Feng, Lihe Zhang, Zhiwei Hu, Huchuan Lu

View PDF

Abstract:Referring video segmentation aims to segment the corresponding video object described by the language expression. To address this task, we first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically, and a vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features. Compared with the existing multi-modal fusion methods, this two-stream encoder takes into account the multi-granularity linguistic context, and realizes the deep interleaving between modalities with the help of VLGM. In order to promote the temporal alignment between frames, we further propose a language-guided multi-scale dynamic filtering (LMDF) module to strengthen the temporal coherence, which uses the language-guided spatial-temporal features to generate a set of position-specific dynamic filters to more flexibly and effectively update the feature of current frame. Extensive experiments on four datasets verify the effectiveness of the proposed model.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2203.15969 [cs.CV]
	(or arXiv:2203.15969v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2203.15969

Submission history

From: Guang Feng [view email]
[v1] Wed, 30 Mar 2022 01:06:13 UTC (20,105 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators