Trust your partner's friends: Hierarchical cross-modal contrastive pre-training for video-text retrieval

Y Xiang, K Liu, S Tang, L Bai, F Zhu… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Y Xiang, K Liu, S Tang, L Bai, F Zhu, R Zhao, X Lin
ICASSP 2023-2023 IEEE International Conference on Acoustics …, 2023ieeexplore.ieee.org
Video-text retrieval has greatly benefited from the massive web video in recent years, while
the performance is still limited to the weak supervision from the uncurated data. In this work,
we propose to leverage the well-represented information of each original modality and
exploit complementary information in two views of the same video, ie, video clips and
captions, by using one view to obtain positive samples with the neighboring samples of the
other. Respecting the hierarchical organization of real-world data, we further design a …
Video-text retrieval has greatly benefited from the massive web video in recent years, while the performance is still limited to the weak supervision from the uncurated data. In this work, we propose to leverage the well-represented information of each original modality and exploit complementary information in two views of the same video, i.e., video clips and captions, by using one view to obtain positive samples with the neighboring samples of the other. Respecting the hierarchical organization of real-world data, we further design a hierarchical cross-modal pre-training method (HCP) to learn good representations in the common embedding space. We evaluate the pre-trained model on three downstream tasks, i.e. text-to-video retrieval, action step localization and video question answering and our method outperforms previous works under the same setting.
ieeexplore.ieee.org