Choosing Wisely and Learning Deeply: Selective Cross-Modality Distillation via CLIP for Domain Generalization

Leng, Jixuan; Li, Yijiang; Wang, Haohan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.15145 (cs)

[Submitted on 26 Nov 2023 (v1), last revised 22 Apr 2024 (this version, v3)]

Title:Choosing Wisely and Learning Deeply: Selective Cross-Modality Distillation via CLIP for Domain Generalization

Authors:Jixuan Leng, Yijiang Li, Haohan Wang

View PDF HTML (experimental)

Abstract:Domain Generalization (DG), a crucial research area, seeks to train models across multiple domains and test them on unseen ones. In this paper, we introduce a novel approach, namely, Selective Cross-Modality Distillation for Domain Generalization (SCMD). SCMD leverages the capabilities of large vision-language models, specifically CLIP, to train a more efficient model, ensuring it acquires robust generalization capabilities across unseen domains. Our primary contribution is a unique selection framework strategically designed to identify hard-to-learn samples for distillation. In parallel, we introduce a novel cross-modality module that seamlessly combines the projected features of the student model with the text embeddings from CLIP, ensuring the alignment of similarity distributions. We assess SCMD's performance on various benchmarks, where it empowers a ResNet50 to deliver state-of-the-art performance, surpassing existing domain generalization methods. Furthermore, we provide a theoretical analysis of our selection strategy, offering deeper insight into its effectiveness and potential in the field of DG.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2311.15145 [cs.CV]
	(or arXiv:2311.15145v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.15145

Submission history

From: Jixuan Leng [view email]
[v1] Sun, 26 Nov 2023 00:06:12 UTC (1,066 KB)
[v2] Sun, 17 Dec 2023 07:06:31 UTC (1,066 KB)
[v3] Mon, 22 Apr 2024 03:32:18 UTC (1,683 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Choosing Wisely and Learning Deeply: Selective Cross-Modality Distillation via CLIP for Domain Generalization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Choosing Wisely and Learning Deeply: Selective Cross-Modality Distillation via CLIP for Domain Generalization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators