Deep ViT Features as Dense Visual Descriptors

Amir, Shir; Gandelsman, Yossi; Bagon, Shai; Dekel, Tali

Computer Science > Computer Vision and Pattern Recognition

arXiv:2112.05814 (cs)

[Submitted on 10 Dec 2021 (v1), last revised 15 Oct 2022 (this version, v3)]

Title:Deep ViT Features as Dense Visual Descriptors

Authors:Shir Amir, Yossi Gandelsman, Shai Bagon, Tali Dekel

View PDF

Abstract:We study the use of deep features extracted from a pretrained Vision Transformer (ViT) as dense visual descriptors. We observe and empirically demonstrate that such features, when extractedfrom a self-supervised ViT model (DINO-ViT), exhibit several striking properties, including: (i) the features encode powerful, well-localized semantic information, at high spatial granularity, such as object parts; (ii) the encoded semantic information is shared across related, yet different object categories, and (iii) positional bias changes gradually throughout the layers. These properties allow us to design simple methods for a variety of applications, including co-segmentation, part co-segmentation and semantic correspondences. To distill the power of ViT features from convoluted design choices, we restrict ourselves to lightweight zero-shot methodologies (e.g., binning and clustering) applied directly to the features. Since our methods require no additional training nor data, they are readily applicable across a variety of domains. We show by extensive qualitative and quantitative evaluation that our simple methodologies achieve competitive results with recent state-of-the-art supervised methods, and outperform previous unsupervised methods by a large margin. Code is available in this http URL.

Comments:	Revised version - high res figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2112.05814 [cs.CV]
	(or arXiv:2112.05814v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2112.05814

Submission history

From: Shir Amir [view email]
[v1] Fri, 10 Dec 2021 20:15:03 UTC (32,719 KB)
[v2] Sun, 4 Sep 2022 16:24:40 UTC (2,019 KB)
[v3] Sat, 15 Oct 2022 21:18:49 UTC (40,231 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Deep ViT Features as Dense Visual Descriptors

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Deep ViT Features as Dense Visual Descriptors

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators