Google Scholar

Trust your partner's friends: Hierarchical cross-modal contrastive pre-training for video-text retrieval

Y Xiang, K Liu, S Tang, L Bai, F Zhu… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

Y Xiang, K Liu, S Tang, L Bai, F Zhu, R Zhao, X Lin

ICASSP 2023-2023 IEEE International Conference on Acoustics …, 2023•ieeexplore.ieee.org

Video-text retrieval has greatly benefited from the massive web video in recent years, while the performance is still limited to the weak supervision from the uncurated data. In this work, we propose to leverage the well-represented information of each original modality and exploit complementary information in two views of the same video, i.e., video clips and captions, by using one view to obtain positive samples with the neighboring samples of the other. Respecting the hierarchical organization of real-world data, we further design a hierarchical cross-modal pre-training method (HCP) to learn good representations in the common embedding space. We evaluate the pre-trained model on three downstream tasks, i.e. text-to-video retrieval, action step localization and video question answering and our method outperforms previous works under the same setting.

ieeexplore.ieee.org

Show moreShow less

Save Cite Cited by 2 Related articles

Create alert

Cite

Advanced search

Saved to My library

Trust your partner's friends: Hierarchical cross-modal contrastive pre-training for video-text retrieval