Multi-grained encoding and joint embedding space fusion for video and text cross-modal retrieval

X Cui, J Xiao, Y Cao, J Zhu - Multimedia Tools and Applications, 2022 - Springer
X Cui, J Xiao, Y Cao, J Zhu
Multimedia Tools and Applications, 2022Springer
Video-text cross-modal retrieval is significant to computer vision. Most of existing works
focus on exploring the global similarity between modalities, but ignore the influence of
details on retrieval results. How to explore the correlation between different forms of data
from multiple angles is a key issue. In this paper, we propose a Multi-grained Encoding and
Joint Embedding Spaces Fusion (MEJESF) for video-text cross-modal retrieval. Specifically,
we propose a novel dual encoding network to explore not only coarse-grained feature but …
Abstract
Video-text cross-modal retrieval is significant to computer vision. Most of existing works focus on exploring the global similarity between modalities, but ignore the influence of details on retrieval results. How to explore the correlation between different forms of data from multiple angles is a key issue. In this paper, we propose a Multi-grained Encoding and Joint Embedding Spaces Fusion (MEJESF) for video-text cross-modal retrieval. Specifically, we propose a novel dual encoding network to explore not only coarse-grained feature but also fine-grained feature of modals. At the same time, giving considerations to multiple encoding and hard sample mining, a modified pairwise ranking loss function is introduced. After that, we build two joint embedding spaces and adopt them when retrieving by fusing their scores. Experiments on two public benchmark datasets (MSR-VTT,MSVD) demonstrate that our method can obtain promising performance compared to the state-of-the-art methods in video-text cross-modal retrieval. Furthermore, our network model achieves outstanding performance in zero-example video retrieval.
Springer
Showing the best result for this search. See all results