Leveraging Temporal Contextualization for Video Action Recognition

Kim, Minji; Han, Dongyoon; Kim, Taekyung; Han, Bohyung

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.09490 (cs)

[Submitted on 15 Apr 2024 (v1), last revised 24 Jul 2024 (this version, v2)]

Title:Leveraging Temporal Contextualization for Video Action Recognition

Authors:Minji Kim, Dongyoon Han, Taekyung Kim, Bohyung Han

View PDF HTML (experimental)

Abstract:We propose a novel framework for video understanding, called Temporally Contextualized CLIP (TC-CLIP), which leverages essential temporal information through global interactions in a spatio-temporal domain within a video. To be specific, we introduce Temporal Contextualization (TC), a layer-wise temporal information infusion mechanism for videos, which 1) extracts core information from each frame, 2) connects relevant information across frames for the summarization into context tokens, and 3) leverages the context tokens for feature encoding. Furthermore, the Video-conditional Prompting (VP) module processes context tokens to generate informative prompts in the text modality. Extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition validate the effectiveness of our model. Ablation studies for TC and VP support our design choices. Our project page with the source code is available at this https URL

Comments:	26 pages, 11 figures, 16 tables. To be presented at ECCV'24
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.09490 [cs.CV]
	(or arXiv:2404.09490v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.09490

Submission history

From: Taekyung Kim [view email]
[v1] Mon, 15 Apr 2024 06:24:56 UTC (2,345 KB)
[v2] Wed, 24 Jul 2024 05:08:08 UTC (3,762 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Leveraging Temporal Contextualization for Video Action Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Leveraging Temporal Contextualization for Video Action Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators