UniVTG: Towards Unified Video-Language Temporal Grounding

Lin, Kevin Qinghong; Zhang, Pengchuan; Chen, Joya; Pramanick, Shraman; Gao, Difei; Wang, Alex Jinpeng; Yan, Rui; Shou, Mike Zheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2307.16715 (cs)

[Submitted on 31 Jul 2023 (v1), last revised 18 Aug 2023 (this version, v2)]

Title:UniVTG: Towards Unified Video-Language Temporal Grounding

Authors:Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou

View PDF

Abstract:Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels. In this paper, we propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions: Firstly, we revisit a wide range of VTG labels and tasks and define a unified formulation. Based on this, we develop data annotation schemes to create scalable pseudo supervision. Secondly, we develop an effective and flexible grounding model capable of addressing each task and making full use of each label. Lastly, thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels and develop stronger grounding abilities e.g., zero-shot grounding. Extensive experiments on three tasks (moment retrieval, highlight detection and video summarization) across seven datasets (QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights, TVSum, and QFVS) demonstrate the effectiveness and flexibility of our proposed framework. The codes are available at this https URL.

Comments:	Accepted by ICCV 2023. 16 pages, 10 figures, 13 tables. Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2307.16715 [cs.CV]
	(or arXiv:2307.16715v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2307.16715

Submission history

From: Qinghong Lin [view email]
[v1] Mon, 31 Jul 2023 14:34:49 UTC (5,545 KB)
[v2] Fri, 18 Aug 2023 07:56:32 UTC (5,547 KB)

✅2024-10-01: arxiv.org is back to normal.✅

Computer Science > Computer Vision and Pattern Recognition

Title:UniVTG: Towards Unified Video-Language Temporal Grounding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

✅2024-10-01: arxiv.org is back to normal.✅

Computer Science > Computer Vision and Pattern Recognition

Title:UniVTG: Towards Unified Video-Language Temporal Grounding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators