Local-Global Video-Text Interactions for Temporal Grounding

Mun, Jonghwan; Cho, Minsu; Han, Bohyung

Computer Science > Computer Vision and Pattern Recognition

arXiv:2004.07514 (cs)

[Submitted on 16 Apr 2020]

Title:Local-Global Video-Text Interactions for Temporal Grounding

Authors:Jonghwan Mun, Minsu Cho, Bohyung Han

View PDF

Abstract:This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query. We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query, which corresponds to important semantic entities described in the query (e.g., actors, objects, and actions), and reflect bi-modal interactions between the linguistic features of the query and the visual features of the video in multiple levels. The proposed method effectively predicts the target time interval by exploiting contextual information from local to global during bi-modal interactions. Through in-depth ablation studies, we find out that incorporating both local and global context in video and text interactions is crucial to the accurate grounding. Our experiment shows that the proposed method outperforms the state of the arts on Charades-STA and ActivityNet Captions datasets by large margins, 7.44\% and 4.61\% points at Recall@tIoU=0.5 metric, respectively. Code is available in this https URL.

Comments:	CVPR 2020; code available in this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2004.07514 [cs.CV]
	(or arXiv:2004.07514v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2004.07514

Submission history

From: Jonghwan Mun [view email]
[v1] Thu, 16 Apr 2020 08:10:41 UTC (1,992 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2020-04

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Jonghwan Mun
Minsu Cho
Bohyung Han

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Local-Global Video-Text Interactions for Temporal Grounding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Local-Global Video-Text Interactions for Temporal Grounding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators