Learning to Adapt Category Consistent Meta-Feature of CLIP for Few-Shot Classification
Abstract
The recent CLIP-based methods have shown promising zero-shot and few-shot performance on image classification tasks. Existing approaches such as CoOp and Tip-Adapter only focus on high-level visual features that are fully aligned with textual features representing the “Summary” of the image. However, the goal of few-shot learning is to classify unseen images of the same category with few labeled samples. Especially, in contrast to high-level representations, local representations (LRs) at low-level are more consistent between seen and unseen samples. Based on this point, we propose the Meta-Feature Adaption method (MF-Adapter) that combines the complementary strengths of both LRs and high-level semantic representations. Specifically, we introduce the Meta-Feature Unit (MF-Unit), which is a simple yet effective local similarity metric to measure category-consistent local context in an inductive manner. Then we train an MF-Adapter to map image features to MF-Unit for adequately generalizing the intra-class knowledge between unseen images and the support set. Extensive experiments show that our proposed method is superior to the state-of-the-art CLIP downstream few-shot classification methods, even showing stronger performance on a set of challenging visual classification tasks.
1 Introduction
Visual understanding tasks, including image classification [50, 21, 23, 10, 33, 31], object detection [42, 3, 55, 4, 1], semantic segmentation [49, 41], and video action recognition [12, 13, 47, 41], have achieved great success owing to the excellent innovation of model structure, a large amount of annotated data [43] and multi-iterations for massive model parameters [31]. Traditionally, it is a standard paradigm to fine-tune downstream understanding tasks based on pre-trained models trained on public datasets such as ImageNet [50]. But recently, self-supervised pre-training methods such as CLIP (Contrastive Language-Image Pretraining) [39] and ALIGN [24] were proposed to learn high-quality visual representation by conducting contrastive learning with hundreds of millions of noisy text-image pairs [44]. By performing the cross-modal match between textual and visual contexts in the high-level semantic feature space, this new framework can also learn visual representations of images.
Further, many recent follow-up works have demonstrated that fine-tuning such a pre-training visual-language model can get better results than the traditional models [41, 32, 30, 14, 54, 57, 9] due to the ultra-large scale data. However, in the case of few-shot classification where the seen images are not sufficient enough, simply adopting “pre-trainingfine-tuning” paradigm does not produce better results [39]. In order to achieve the goal of simultaneously exploiting the pre-training knowledge and better generalization to the down-stream task, the common practice is to fix the CLIP pre-training weights, while learning a lightweight module [14, 54, 37]. For example, the prompt-based methods, such as CoOp [57] and CoCoOp [56], proposed to learn continuous text prompts to replace manual-set prompts with trainable parameters to achieve huge improvements over manually intensively-tuned prompts. Instead of tuning prompts, the adapter-based methods, such as CLIP-Adapter [14] and Tip-Adapter [54], only fine-tune a small number of additional linear layers on top of CLIP. By fine-tuning a trainable query-key cache layer [36, 16, 17, 51], Tip-Adapter [54] achieves large performance compared to both prompt-based method and CLIP-Adapter on several classification datasets.
However, in our study, we identify a critical problem of both prompt-based and adapter-based methods in few-shot classification that they only focus on high-level visual features that are fully aligned with textual features due to the training target of CLIP representing the “Summary” of images. However, the goal of few-shot learning is to classify unseen images of the same category with few labeled samples, while local representations (LRs) are more consistent between seen and unseen images compared with high-level representations. Thus we ask the following question can we achieve the best of both worlds, which not only takes the advantage of CLIP’s powerful high-level semantic representation but also learns low-level consistent representation of categories in the few-shot task.
In this paper, we introduce a novel method named Meta-Feature Adapter (MF-Adapter), which could learn similarity metrics through multi-scale features at multi layers. The key idea is that the CLIP models including ResNet and Vit encoders simultaneously learn high-level semantic features during self-supervised contrastive learning, while also producing low-level features containing details and edges [21, 10]. Therefore, we propose a simple yet effective local similarity metric in multi scales at multi layers, termed as Meta-Feature Unit (MF-Unit). In detail, MF-Unit is obtained by an inductive representation of local feature maps on different scales using sliding windows with different perceptual fields, which contains rich local information to represent category-consistent characteristics. As a visual example, suppose we have a picture of a dog, and MF-Unit is used to encode the dog’s category characteristics in local view (such as paws, mouth and shape) inductively. This information is more useful for unseen samples in contrast to a global feature representation (eg. 1024-d features from the final layer of CLIP’s ResNet-50). Before model training, the images in support set are firstly encoded into different-level features by the fixed CLIP visual encoder, then the low-level feature maps are unfolded into Meta-Feature by a list of sliding windows with different scales. In order to compress parameters and reduce redundancy, we finally calculate the MF-Unit by the inductive and operations on each window, which is regarded as the category-consistent features containing inductive knowledge. In the training phase, we introduce a light-weighted learnable MF-Adapter to adapt Meta-Feature to MF-Unit instead of simple induction for knowledge generalization within categories. During inference, the proposed model first infers test images into MF-Units at different levels and different scales, and then retrievals the few-shot knowledge in the MF-Unit space of support set to get local logits. After that, the final prediction is combined with the original CLIP’s prediction at high level (similar to TIP-Adapter [54]). Through such a pioneering MF-Unit design, our method can simultaneously exploit the global semantic context stored in the original CLIP and learn category-consistent knowledge through few-shot samples in an inductive manner. In summary, the contributions of our paper are as follows:
-
•
We propose a novel method named Meta-Feature Adapter to combine the complementary strengths of both low-level and high-level semantic representations, which is the first work to utilize the local similarity of CLIP in few-shot learning.
-
•
We introduce the multi-scale MF-Unit at multi layers, which inductively measures category-consistent local context for the knowledge generalization between seen and unseen samples.
-
•
Compared with previous CLIP-based methods, we present comprehensive experiments on 11 widely-adopted datasets for few-shot classification and achieve state-of-the-art performance.
2 Related Work
2.1 Pre-training Vision-language Model
The “pre-trainingfine-tuning” paradigm has been successfully applied to both Natural Language Processing (NLP) [8, 40] and Computer Vision (CV) [21, 31] fields in the past decade. Specifically, there are two paradigms in CV for pre-training on large-scale datasets (e.g., ImageNet [50] and Kinetics [5]) by supervision of labels [21, 31, 29, 46] or self-supervision without labels, such as MoCo [20], BYOL [18] and recent MAE [19] in image classification tasks. Further, CLIP [39] and ALIGN [24] are the newly typical frameworks that leverage hundreds of millions of image-text pairs collected from internet to align the embedding space of images with raw texts. It has demonstrated the power of visual-language contrastive representation learning on zero-shot image classification tasks [54, 14, 57, 56, 53].
2.2 Few-shot Adaptation
Few-shot learning aims at transferring knowledge from a small dataset to a full classification task, which relies more on well pre-trained models. On top of CLIP, there are roughly two types of methods proposed to improve the training strategy in recent few-shot learning tasks. One is the prompt-based method and the other is the adapter-based method.
Prompt-based methods. The concept of prompt design first comes from NLP. Prompt learning aims to automate the process of generating proper prompts without manual design [26, 45, 15]. In the computer vision field, Visual Prompt Tuning (VPT) [25] achieved significant performance gains by introducing a small amount of task-specific learnable parameters in input space while freezing the entire pre-trained transformer backbone during downstream training. Besides, several efforts have started to find efficient strategies to transfer the visual-language model to few-shot classification tasks. As a landmark, Context Optimization (CoOp) [57] is the first to apply prompt learning for the adaptation of CLIP model in image classification task, which proposes to model the prompt’s context words with learnable vectors while the entire pre-trained parameters are kept fixed. Conditional CoOp (CoCoOp) [56] extends CoOp by learning an input-conditional token for each input image which provides better generalization than CoOp for unseen samples.
Adapter-based methods. In contrast, adapter-based methods conduct fine-tuning on the light-weighted feature adapters instead of performing soft prompt tuning on text inputs. Specifically, CLIP-Adapter [14] introduces feature adapters on either visual or textual branches and fine-tunes them on the few-shot classification task which achieves better few-shot classification performance while having a much simpler design. Zhang et al. [54] further propose Training-Free adaption method, which is constructed as a key-value cache model from few-shot training set. Other works, such as CAVPT [53] and SVL-adapter [37], further extend CLIP by introducing class-aware visual prompts in a self-supervised representation learning manner.
Similarly to the prompt-based methods, most of these apply a well-aligned semantic representation for downstream tasks, while ignoring the local category consistency in few-shot learning. In contrast, we again benefit from CLIP’s powerful high-level semantic representation as well as the low-level consistent knowledge by conducting the learnable adapter.
3 Method

In this paper, we are the first to exploit the local similarity of CLIP in few-shot learning, owing to the capability of CLIP in both low-level and high-level layers. In Section 3.2, we introduce our defined similarity metric, Meta- Feature Unit (MF-Unit). Then in Section 3.3, we discuss the adaption module to map low-level features to MF-Unit by a learnable adapter, which inductively concludes the consistent knowledge between seen and unseen samples.
3.1 Method Overview
As is known, the capability of deep neural networks benefits from large-scale and high-quality data. However, it is difficult and expensive to collect and label a clean and balanced dataset like ImageNet [7]. Therefore, the contrastive learning paradigm such as CLIP [39] and DeCLIP [27] based on large-scale cross-modal data has achieved high improvement in the field of computer vision, which is also widely used as a pre-trained feature extractor. Thus, we have gotten a hypothesis that since high-level features benefit from the capability of this cross-modal contrastive paradigm, low-level features are equally distinguishable. Based on this hypothesis, our method is proposed as shown in Fig. 1. Following CoOp [57], CLIP-Adapter [14], Tip-Adapter [54], our method is built on CLIP pre-trained encoders and the text prompts are also extended from them. Firstly, all images of support set and label texts are extracted to low-level support feature maps, high-level support features, and textual features respectively by pre-trained CLIP encoders. Secondly, we would inductively cache Meta-Feature Unit 3.2 of low-level features in support set to construct the Meta-Feature Unit space. During model training, we use Meta-Feature Adapter 3.3 to adapt the consistent context at low level for local representation. Finally, the final classification information is obtained by combining local representation at low-level, semantic context at high-level and embedding similarity from the text level. In this way, our method benefits from the low-level consistent knowledge by conducting the learnable adapter as well as CLIP’s powerful high-level semantic representations.
3.2 Meta-Feature Unit
Existing downstream methods often process features on top of CLIP, which represent high-level semantics aligned by text context. However, from high-level embedding space, the features are aggregated by text captions like ”A photo of a dog”, which is so summary that it is difficult to be concluded from a few samples. In other words, these seen samples cannot cover most of the class characteristics, especially in the case of fine-grained datasets with little variation among different classes. Therefore, it is necessary to mine the category’s consistency of local features as fine as possible. In few-shot classification, the dataset has K-shot N-class training samples, which means there are K annotated images in each of the N categories. Encoded by visual and textual pre-trained models, we get several features of support set, which are low-level feature maps at different layers , where is the shape of feature map and is the layer index which can be in ResNet-50, high-level features of support set after operation and label text-level features , is the original embedding dimension of CLIP as shown in Fig. 1.
In order to extract the local consistency over support set and query images, we have defined the similarity metric named Meta-Feature Unit (MF-Unit). This unit is built on the low-level feature map with spatial information. Because it is necessary to exclude the interference of object sizes in different images within the same category, we conduct multi-scale sliding window operations to extract different meta-features at layer . It can be formulated as:
(1) |
where is sliding window operation without learnable parameters, is the window’s kernel size, which we use , and is dilation for different scales. After sliding, there is a list of meta-features which represent local context on different scales. For each scale of meta features, we them at the last dimension to get the combined meta feature , where is the window channel and is the concated dimension of all meta-features.
Specifically, to reduce computation and extract the underlying common features, we condense using an inductive approach including and , which can be formulated as:
(2) |
where all induction and concatenate operations are in channel dim. In this way, we obtain the multi-scale of the support set which inductively aggregates the most common consistency of a single category.
However, as mentioned above, our MF-Unit is proposed to describe local contexts such as color, shape, and edge which may be not encoded in the same feature layer. Thus, we conduct the above sliding window operations in different layers, that is to say, the of contains 3 and 4. In this way, the final MF-Units of multi scales at multi layers are able to exploit local representations for few-shot seen samples. Collecting all the MF-Units of support set, there is an embedding space built by MF-Units shown in Fig. 1 to guide the training of Meta-Feature Adapter next.
3.3 Meta-Feature Adapter
In terms of the critical problem of few-shot classification that only a small portion of the data is seen, the local consistent information obtained by feature induction needs to be generalized to the rest of the unseen images within the same category. Therefore, it is not enough to predict the category by simply comparing the inductive similarity of MF-Unit between the seen and unseen images. We need a bridge to build on the low-level feature maps and local consistent information, and then we can generate MF-Units of unseen samples by this bridge. Thus, we propose a trainable adapter named Meta-Feature Adapter (MF-Adapter) to adaptively learn the consistency of MF-Units.
In detail, MF-Adapter can be simply achieved by one convolution-1d which is the only layer of the whole framework that requires gradient back-propagation for training. We compute the meta-feature of the training sample by using the same sliding window operation as mentioned in Equation 1, where is the batch size in training. Then, for each scale of meta features, we them at the last dimension to get the combined meta feature of training samples . For adapter training, this convolution-1d adapters meta-features to 2-channel vectors which has the same feature shape as MF-Units of support set. Our goal is to close the distance between same-category samples. Therefore, we use L2 normalization for both training samples and support set for computing the similarity to get local logits by matrix-vector multiplication as one part of the final predicted logits.
(3) |
where is the one-hot label of support set and . Thus, the takes the advantage of category consistency by label retrieval in support set. Following Tip-Adapter [54], our method’s prediction contains three terms. The first one described above has generalized the local contexts between seen and unseen data which has rich low-level consistent representation. The last two terms summarize semantic information and preserve the prior knowledge from the CLIP’s classifier, which can be formulated as:
(4) |
(5) |
Therefore, the final logits , where we conduct the local branch through layer3 and layer4 of ResNet-50 where . According to the trainable setting of MF-Adapter, it can greatly boost CLIP by incorporating new knowledge in the few-shot training set. More specifically, we unfreeze the layer, but still freeze the values of support set’s MF-Units , and the two encoders of pre-trained CLIP. The intuition is that mapping the low-level features to cached MF-Units can boost the estimation of generalization which is able to calculate the cosine similarities between the test and training images more accurately in the same semantic space.












4 Experiments
4.1 Training Settings
To verify the performance of our proposed method, we compare our MF-Adapter with the state-of-the-art methods by conducting extensive experiments on 11 widely-used image classification datasets: ImageNet [7], StandfordCars [28], UCF101 [48], Caltech101 [11], Flowers102 [35], SUN397 [52], DTD [6], EuroSAT [22], FGVCAircraft [34], OxfordPets [38], and Food101 [2]. For a fair comparison, we follow the prior methods Tip-Adapter [54] for few-shot learning settings with 1, 2, 4, 8, 16 few-shot training sets and test models on the full test sets. On behalf of the model structure, our method is based on pre-trained vision-language model CLIP. For the textual branch, we adopt prompt ensemble on ImageNet and use a single handcrafted prompt on the other 10 datasets which are the same as Tip-Adapter. Besides, the label textual embedding of all datasets only needs to be evaluated before training. For the visual branch, we conduct the comparison on ResNet-50 backbone and freeze it during training. Therein, the images are composed by CLIP preprocessing protocol with random cropping, resizing, and random horizontal flip. As the proposed adapter is trained from scratch, we use Adam optimizer with initial learning rate and compute CE loss between predicted logits and one-hot targets for back propagation. All experiments use the batch size of 256 trained on a single A10 GPU with 24G memory for 100 epochs.
The comparison methods are as follows: Zero-shot CLIP [39], CoOp [57], CLIP-Adapter [14], Tip-Adapter [54]. Zero-shot CLIP directly conducts the downstream classification using pre-trained models. CoOp adopts learnable prompts for training which has replaced manual class tokens. CLIP-Adapter and Tip-Adapter are both adapt-based methods that have a light-weighted learnable layer for few-shot learning. Specially, all the scores are from their official papers for a fair comparison.


4.2 Comparison on Public Datasets
Fig. 2 shows the performance comparison on 11 datasets listed in Section 4.1 and an average improvement of our method. Our MF-Adapter significantly boosts the classification accuracy over Zero-shot CLIP and surpasses Tip-Adapter with fine-tuning (Tip-Adapter-F) on most datasets. It can be seen that the superiority of MF-Adapter is stable and consistent for both generic datasets (UCF101, etc.) and fine-grained datasets (Caltech101, OxfordPets, etc.). The average results achieve the absolute boost of 1%+ over each shot, especially in the condition of fine-grained images which are more similar across different categories, our MF-Adapter achieves comprehensively leading performance by 6% in UCF101 with 16-shot samples. This inspiring superiority fully demonstrates the effectiveness and category consistency of our proposed MF-Units.
In Fig. 3, we show the absolute accuracy improvement brought by MF-Adapter compared with the SoTA Tip-Adapter-F on 11 classification datasets under 4-shot and 8-shot settings. It is worth noting that Tip-Adapter-F is fine-tuned with global semantics with few seen images. When the shot is limited from 8-shot to 4-shot, the decay of Tip-Adapter-F’s effect is obvious but our method would relatively bring more improvements by exploiting the local consistent units for better representation in 7 datasets (marked with red). For example, there is an absolute boost from 1.56 (8-shot) to 2.9 (4-shot) in StanfordCars dataset.
4.3 Ablation Studies
In this section, we conduct several ablation studies about MF-Adapter on both generic dataset (UCF101) and fine-grained datasets (Caltech101). All experiments adopt the 16-shot setting.
Dataset | The scale of window | ||||
---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | |
Caltech101 | 92.97 | 93.15 | 93.03 | 93.09 | 93.03 |
UCF101 | 84.08 | 84.19 | 84.13 | 84.08 | 83.92 |
MF-Unit Scale. The MF-Unit Scale controls which dilation we use to unfold the low-level features to get the listed meta-features in Equation 1. As formulated above, a larger scale denotes using more windows with the dilation from 1 to scale and less otherwise. For example, when , the in Equation 1 is equal to 2 which means we use sliding windows with dilation 1, 2 respectively and then them to get the combined meta features. From both generic dataset and fine-grained dataset of Table 1, we observe that the classification accuracy is best in the condition of , achieving the best 93.15 on Caltech101 and 84.19 on UCF101. This indicates that the low-level context contains not only category-consistent knowledge but also useless redundancies. It is equally important to induct low-level features with appropriate scales.
Global | Layer3 | Layer4 | Caltech101 | UCF101 |
---|---|---|---|---|
✓ | ✓ | 92.90 | 82.76 | |
✓ | ✓ | 92.78 | 82.13 | |
✓ | ✓ | ✓ | 93.15 | 84.19 |
MF-Unit layer. We explore the influence of which low-level layer we use in MF-Adapter. Given ResNet-50 pre-trained model in CLIP, we infer the original image for the 3-rd and 4-th block output layer. Taking the first row of Table 2 as an example, we conduct MF-Adapter of the 3-rd output called layer3 on the local branch and then combine it with global logits for final prediction. The results from all rows of Table 2 illustrate that, combining both layer3 and layer4 to adapt local context between seen and unseen samples can achieve higher accuracy. This comparison also indicates our assumption that different local contexts such as color, shape and edge are not encoded in the same feature layer. It is necessary to exploit local knowledge to enrich global representations.
Dataset | Backbone | SP | EP | Score |
---|---|---|---|---|
Caltech101 | RN50 | ✓ | 93.15 | |
✓ | 93.69 | |||
RN101 | ✓ | 94.60 | ||
✓ | 94.77 | |||
UCF101 | RN50 | ✓ | 84.19 | |
✓ | 84.04 | |||
RN101 | ✓ | 85.14 | ||
✓ | 84.98 |
Analysis on pre-settings of CLIP. We utilize the single prompt like “a photo of a [CLASS].” and ensemble prompts of 7 templates from CLIP [39] based on different visual encoders for both generic and fine-grained datasets. As shown in Table 3, the score drops are smaller regardless of how the pre-trained encoder changes or which prompts we use. This comparison indicates that our method adapts to the semantic variations by the proposed MF-Units without extra adjustment for adequate generalization within categories.
5 Conclusion
In this paper, we propose a novel method named Meta-Feature Adapter for few-shot learning. This proposed MF-Adapter exploits local category-consistent representations using multi-scale MF-Units at multi layers for knowledge generalization between seen and unseen samples. Before training, we inductively construct the MF-Unit space to measure the underlying context for each category. Further, MF-Adapter achieves the best of both worlds: the low-level consistent representation of categories by MF-Units and the strong semantic representation brought by CLIP. We evaluate MF-Adapter on both generic few-shot classification benchmark datasets and more challenging fine-grained few-shot benchmarks and achieve competitive results compared with several state-of-the-art methods.
References
- [1] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
- [2] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In European conference on computer vision, pages 446–461. Springer, 2014.
- [3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- [4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- [5] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- [6] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
- [7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- [9] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
- [10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- [11] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004.
- [12] Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4768–4777, 2017.
- [13] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941, 2016.
- [14] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
- [15] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020.
- [16] Edouard Grave, Moustapha M Cisse, and Armand Joulin. Unbounded cache model for online language modeling with open vocabulary. Advances in neural information processing systems, 30, 2017.
- [17] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- [18] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- [19] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
- [20] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- [22] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
- [23] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- [24] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- [25] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. arXiv preprint arXiv:2203.12119, 2022.
- [26] Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020.
- [27] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- [28] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
- [29] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
- [30] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
- [31] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
- [32] Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
- [33] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- [34] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- [35] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008.
- [36] Emin Orhan. A simple cache model for image recognition. Advances in Neural Information Processing Systems, 31, 2018.
- [37] Omiros Pantazis, Gabriel Brostow, Kate Jones, and Oisin Mac Aodha. Svl-adapter: Self-supervised adapter for vision-language pretrained models. arXiv preprint arXiv:2210.03794, 2022.
- [38] Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
- [39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- [40] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- [41] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18082–18091, 2022.
- [42] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
- [43] Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
- [44] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
- [45] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
- [46] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- [47] Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, and Andrew Zisserman. A short note on the kinetics-700-2020 human action dataset. arXiv preprint arXiv:2010.10864, 2020.
- [48] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- [49] Weiwei Sun and Ruisheng Wang. Fully convolutional networks for semantic segmentation of very high resolution remotely sensed images combined with dsm. IEEE Geoscience and Remote Sensing Letters, 15(3):474–478, 2018.
- [50] Andrea Vedaldi and Karel Lenc. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM international conference on Multimedia, pages 689–692, 2015.
- [51] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
- [52] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
- [53] Yinghui Xing, Qirui Wu, De Cheng, Shizhou Zhang, Guoqiang Liang, and Yanning Zhang. Class-aware visual prompt tuning for vision-language pre-trained model. arXiv preprint arXiv:2208.08340, 2022.
- [54] Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
- [55] Minghang Zheng, Peng Gao, Renrui Zhang, Kunchang Li, Xiaogang Wang, Hongsheng Li, and Hao Dong. End-to-end object detection with adaptive clustering transformer. arXiv preprint arXiv:2011.09315, 2020.
- [56] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
- [57] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.