SIGMA: Sinkhorn-Guided Masked Video Modeling

Salehi, Mohammadreza; Dorkenwald, Michael; Thoker, Fida Mohammad; Gavves, Efstratios; Snoek, Cees G. M.; Asano, Yuki M.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.15447 (cs)

[Submitted on 22 Jul 2024]

Title:SIGMA: Sinkhorn-Guided Masked Video Modeling

Authors:Mohammadreza Salehi, Michael Dorkenwald, Fida Mohammad Thoker, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano

View PDF HTML (experimental)

Abstract:Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods. Our project website with code is available at: this https URL.

Comments:	Accepted at ECCV 24
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.15447 [cs.CV]
	(or arXiv:2407.15447v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.15447

Submission history

From: Mohammadreza Salehi Dehnavi [view email]
[v1] Mon, 22 Jul 2024 08:04:09 UTC (3,144 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:SIGMA: Sinkhorn-Guided Masked Video Modeling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:SIGMA: Sinkhorn-Guided Masked Video Modeling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators