Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Xue, Hongwei; Hang, Tiankai; Zeng, Yanhong; Sun, Yuchong; Liu, Bei; Yang, Huan; Fu, Jianlong; Guo, Baining

Computer Science > Computer Vision and Pattern Recognition

arXiv:2111.10337 (cs)

[Submitted on 19 Nov 2021 (v1), last revised 8 Jul 2022 (this version, v2)]

Title:Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Authors:Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, Baining Guo

View PDF

Abstract:We study joint video and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream VL tasks. Existing works either extract low-quality video features or learn limited text embedding, while neglecting that high-resolution videos and diversified semantics can significantly improve cross-modality learning. In this paper, we propose a novel High-resolution and Diversified VIdeo-LAnguage pre-training model (HD-VILA) for many visual tasks. In particular, we collect a large dataset with two distinct properties: 1) the first high-resolution dataset including 371.5k hours of 720p videos, and 2) the most diversified dataset covering 15 popular YouTube categories. To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned video features with diversified texts. Our pre-training model achieves new state-of-the-art results in 10 VL understanding tasks and 2 more novel text-to-visual generation tasks. For example, we outperform SOTA models with relative increases of 40.4% R@1 in zero-shot MSR-VTT text-to-video retrieval task and 55.4% in high-resolution dataset LSMDC. The learned VL embedding is also effective in generating visually pleasing and semantically relevant results in text-to-visual editing and super-resolution tasks.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2111.10337 [cs.CV]
	(or arXiv:2111.10337v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2111.10337
Journal reference:	published in CVPR 2022

Submission history

From: Bei Liu [view email]
[v1] Fri, 19 Nov 2021 17:36:01 UTC (6,961 KB)
[v2] Fri, 8 Jul 2022 08:46:43 UTC (14,939 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators