(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: Institute of Information Science, Beijing Jiaotong University 22institutetext: Peng Cheng Laboratory 33institutetext: Georgia Institute of Technology 44institutetext: Picsart AI Research (PAIR)
44email: [email protected]

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Siyu Jiao\orcidlink0000-0002-0795-8401 Equal contribution1122    Hongguang Zhu1,2\star\orcidlink0000-0002-1356-5153    Jiannan Huang\orcidlink0009-0002-2447-9928 1133    Yao Zhao\orcidlink0000-0002-8581-9554 1122   
Yunchao Wei\orcidlink0000-0002-2812-8781
1122
   Humphrey Shi \orcidlink0000-0002-2922-5663 3344
Abstract

Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local regions. However, few of them incorporate vision-text collaborative optimization. Based on this, we propose the Content-Dependent Transfer to adaptively enhance each text embedding by interacting with the input image, which presents a parameter-efficient way to optimize the text representation. Besides, we additionally introduce a Representation Compensation strategy, reviewing the original CLIP-V representation as compensation to maintain the zero-shot capability of CLIP. In this way, the vision and text representation of CLIP are optimized collaboratively, enhancing the alignment of the vision-text feature space. To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field. Extensive experiments demonstrate our method achieves superior performance on popular OVS benchmarks. In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively on A-847, A-150, PC-459, PC-59 and PAS-20. Furthermore, in a panoptic setting on ADE20K, we achieve the performance of 27.1 PQ, 73.5 SQ, and 32.9 RQ. Code will be available at MAFT-Plus.

Keywords:
Open-Vocabulary Segmentation Fine-tuning

1 Introduction

Segmentation stands as the most popular basic topics in computer vision, traditional segmentation models [4, 45, 14, 15, 10] are only capable of segmenting a few predefined categories within a closed vocabulary [9, 3], notably smaller than the human-used categories for describing the real world. Therefore, open-vocabulary segmentation (OVS) [32, 2, 12, 13] is introduced to segment objects using arbitrary categories described by texts.

Recently, large-scale visual-language pre-training models (e.g. CLIP [27] and ALIGN [17]) learn representation with cross-modal alignment and show strong zero-shot capability, leading to the increased adoption for tackling the challenging OVS task [8, 38, 22, 26]. A mainstream solution follows the "decoupling" paradigm, which executes the open-vocabulary segmentation with two steps: 1) employing a Proposal Generator to produce class-agnostic mask proposals and 2) leveraging a pre-trained CLIP to classify each mask proposal via similarity matching in the aligned image-text feature space. The above-mentioned paradigm can be categorized into two groups hinges on whether CLIP is frozen during the training process, as depicted in Fig. 1a, b.

Refer to caption
Figure 1: Different learning frameworks for open-vocabulary segmentation, from the perspective of whether to freeze CLIP. (a) The "frozen CLIP" paradigm. [22, 38, 26, 39] (b) Fine-tuning CLIP-V [18]. (c) Our MAFT+ framework enables to optimize both CLIP-V and CLIP-T.

In order to retain the strong zero-shot capability of CLIP when classifying mask proposals, most previous works [38, 22, 26, 39] choose to freeze the pre-trained CLIP model (Fig. 1a). They execute with either masked-crops or masked-attention, when processing images and masks within CLIP-V. Considering the domain gap between image-level pre-training of CLIP and pixel-level application of segmentation, these approaches compromise the representational ability of CLIP, and fail to fit the distribution of segmentation tasks well. Recent work MAFT [18] highlights the frozen CLIP is insensitive to different mask proposals and often yields similar predictions. It designs a mask-aware fine-tuning strategy to enhance the sensitivity of CLIP-V to local regions (Fig. 1b). While MAFT partially addresses the insensitivity issue, it comes with some new problems: 1) only updating CLIP-V constrains the overall optimization space, thereby limiting the alignment of vision and text representation. 2) fine-tuning CLIP-V on downstream datasets leads to the degradation of generalization ability.

To address the aforementioned problems, we introduce a collaborative Vision-Text representation fine-tuning framework as the enhanced version of MAFT, named MAFT+. As shown in Fig. 1c. Specific to enhance the alignment of vision-text representation, we incorporate CLIP-T into the fine-tuning process to concurrently optimize the text representation. This vision-text joint optimization alleviates the training complexity and enhances the vision and text alignment. Considering the challenging GPU memory requirements for fine-tuning CLIP-T, we introduce a Content-Dependent Transfer (CDT) following CLIP-T to optimize text representation in a parameter-efficient way. CDT utilizes Transformer Layers to condition text embeddings on each input image rather than fixed once generated by CLIP-T, mitigating the computational burden while preserving the effectiveness of the fine-tuning process. Moreover, to maintain the zero-shot capality during CLIP-V fine-tuning, we draw inspiration from preventing Catastrophic Forgetting [24] in continual learning, and devise a Representation Compensation (RC) strategy. This strategy aims to preserve CLIP’s zero-shot capability by reviewing the pre-trained representation of an original CLIP-V as a form of compensation.

Overall, our contributions are summarized as follows:

  • Our MAFT+ represents the first collaborative framework to jointly optimize vision-text representation in OVS. This collaborative design mitigates training complexity and enhances alignment in the vision-text feature space.

  • The Content-Dependent Transfer is proposed to unleash the optimization potential of CLIP-T through parameter-efficient fine-tuning. The Representation Compensation achieves effective CLIP-V fine-tuning while maintaining the original zero-shot capability.

We evaluate our MAFT+ on the commonly used open-vocabulary semantic and panoptic segmentation benchmarks: Pacal-Context [25], Pascal-VOC [9], and ADE20K [47]. Compared with the prior open-vocabulary semantic results, MAFT+ enhances the performance of A-847 [47], A-150 [47], PC-459 [25], PC-59 [25] and PAS-20 [9] datasets by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU respectively. Furthermore, we conduct experiments in a panoptic setting, where MAFT+ achieves the performance of 27.1 PQ, 73.5 SQ, and 32.9 RQ on the ADE20K dataset. Notably, our approach outperforms the existing OVS methods and establishes new state-of-the-art results across all evaluated datasets.

2 Related Work

Open-Vocabulary Segmentation [29] is established to break category restrictions and perform segmentation across arbitrary categories. Earlier works [32, 2, 12, 21, 34] use large pre-trained vision-language models to perform open-vocabulary segmentation, they leverage rich alignment features from image-text pairs. Recent approaches [8, 38, 22, 26, 11, 37, 35, 5, 39, 18, 36] decouple the open-vocabulary segmentation into mask proposals generation and mask proposals classification, they first generate a series of mask proposals and then utilize CLIP [27] or ALIGN [17] for classification. Specifically, Zegformer [8] first uses mask&crop to get sub-images based on mask proposals, feeding them into CLIP for mask classification. The following approaches ZSSeg [38] and OVSeg [22], train CLIP adapters to boost performance. In order to improve the classification ability of the vision-language models, OpenSeg [11] takes extra image-caption pairs to scale up training data. FreeSeg[26] unifies semantic, instance, and panoptic tasks and performs fusion training. ODISE [35] utilizes a strong text-to-image diffusion model [28] to obtain a well-aligned image-text feature space. SAN [37] and FC-CLIP [39] design the end-to-end frameworks by exploiting a single frozen CLIP as the backbone. Recently, MAFT [18] introduces a CLIP-V fine-tuning strategy, allowing CLIP-V to be sensitive to different mask proposals.

Pre-trained model fine-tuning is widely used for fitting the distribution to downstream tasks. Specific to segmentation, traditional close-set methods [4, 45, 14, 15] typically use a lower learning rate (e.g. 110110\frac{1}{10}divide start_ARG 1 end_ARG start_ARG 10 end_ARG) to fine-tune the image encoder, transferring pre-trained knowledge to segmentation tasks. However, this strategy may be suboptimal for data-limited scenarios such as few-shot segmentation, zero-shot segmentation and incremental segmentation due to the daunting overfitting problem. To tackle this, SVF [30] fine-tunes only a subset of parameters in the pre-trained image encoder, adapting pre-trained knowledge to few-shot segmentation. [22] applies prompt-tuning to learn image prompts using annotated data, adapting CLIP-V to masked images. Some continual segmentation approaches utilize techniques like contrastive learning [44, 43, 42], distillation [40] and EMA [33] to avoid catastrophic forgetting.

In a recent development, MAFT [18] conducts a mask-aware CLIP fine-tuning strategy by aligning CLIP’s classification score with the IoU score. Although this approach partially adapts CLIP-V to segmentation tasks, it exclusively optimizes CLIP-V representation, potentially amplifying the training difficulty and risking overfitting on fixed text embeddings. This observation motivates our exploration of collaborative optimization strategies for both vision and text representation.

3 Preliminary

Problem Setting. Open-vocabulary segmentation addresses the task of training a segmentation model capable of segmenting arbitrary objects using text descriptions. Given two category sets Ctrainsubscript𝐶𝑡𝑟𝑎𝑖𝑛C_{train}italic_C start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT and Ctestsubscript𝐶𝑡𝑒𝑠𝑡C_{test}italic_C start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT, where Ctrainsubscript𝐶𝑡𝑟𝑎𝑖𝑛C_{train}italic_C start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT and Ctestsubscript𝐶𝑡𝑒𝑠𝑡C_{test}italic_C start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT are unequal in terms of object categories (CtrainCtestsubscript𝐶𝑡𝑟𝑎𝑖𝑛subscript𝐶𝑡𝑒𝑠𝑡C_{train}\neq C_{test}italic_C start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ≠ italic_C start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT). The model is trained on Ctrainsubscript𝐶𝑡𝑟𝑎𝑖𝑛C_{train}italic_C start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT and directly tested on Ctestsubscript𝐶𝑡𝑒𝑠𝑡C_{test}italic_C start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT. Typically, Ctrainsubscript𝐶𝑡𝑟𝑎𝑖𝑛C_{train}italic_C start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT and Ctestsubscript𝐶𝑡𝑒𝑠𝑡C_{test}italic_C start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT are described by noun words (e.g. sky, sea, mount…).

mask-aware Loss Function. [18] proposes a mask-aware loss (masubscript𝑚𝑎\mathcal{L}_{ma}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a end_POSTSUBSCRIPT) to fine-tune CLIP-V for sensitivity to local regions. The primary objective of masubscript𝑚𝑎\mathcal{L}_{ma}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a end_POSTSUBSCRIPT is to assign high classification scores to high-quality proposals and low scores to low-quality proposals. This is achieved by utilizing the Intersection over Union (IoU) score SIoUsuperscript𝑆𝐼𝑜𝑈S^{IoU}italic_S start_POSTSUPERSCRIPT italic_I italic_o italic_U end_POSTSUPERSCRIPT derived from ground-truth as supervision and aligning it with the CLIP classification score Sclssuperscript𝑆𝑐𝑙𝑠S^{cls}italic_S start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT to induce mask awareness. The mask-aware loss is calculated using the SmoothL1SmoothL1\mathrm{SmoothL1}SmoothL1 function:

ma=SmoothL1(Scls,SIoU)subscript𝑚𝑎SmoothL1superscript𝑆𝑐𝑙𝑠superscript𝑆𝐼𝑜𝑈\mathcal{L}_{ma}=\mathrm{SmoothL1}(S^{cls},S^{IoU})caligraphic_L start_POSTSUBSCRIPT italic_m italic_a end_POSTSUBSCRIPT = SmoothL1 ( italic_S start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT italic_I italic_o italic_U end_POSTSUPERSCRIPT ) (1)

In this paper, we use masubscript𝑚𝑎\mathcal{L}_{ma}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a end_POSTSUBSCRIPT to fit the distribution of CLIP with OVS. Furthermore, we delve into CLIP fine-tuning techniques, and propose a novel CLIP fine-tuning strategy by collaboratively optimizing the distribution of CLIP-V and CLIP-T.

4 Methodology

Refer to caption
Figure 2: Overview of the MAFT+. We use CLIP-V as the backbone to extract image features. A Proposal Generator is trained to generate mask proposals. The Representation Compensation strategy reviews the vision representation to preserve the zero-shot capability of CLIP (red part); the Content-Dependent Transfer enables the text embeddings conditioned on input image, and achieves text representation optimizing in a parameter-efficient fine-tuning way. (blue part).

We introduce MAFT+, a method for collaboratively optimizing CLIP’s vision and text representation. The complete framework of the MAFT+ is shown in Fig. 2, we use the Convnext-Large CLIP model for illustration. Within MAFT+, CLIP-V serves as the vision backbone, and a Proposal Generator is trained to generate class-agnostic mask proposals (Sec. 4.1). Simultaneously, the representation of CLIP-V and CLIP-T is collaboratively optimized. We introduce the Representation Compensation (RC) strategy for CLIP-V fine-tuning (Sec. 4.2), and propose the Content-Dependent Transfer (CDT) for parameter-efficient CLIP-T fine-tuning (Sec. 4.3). Finally, we outline the loss functions in Sec. 4.4.

4.1 Feature Extraction & Proposal Generator

Feature Extraction. We utilize a pre-trained convolutional CLIP-V for extracting features from an input image I𝐼Iitalic_I. Denoting each stage of CLIP-V’s output as F={Fi},i[0,1,2,3]formulae-sequence𝐹superscript𝐹𝑖𝑖0123F=\{F^{i}\},i\in[0,1,2,3]italic_F = { italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } , italic_i ∈ [ 0 , 1 , 2 , 3 ]. F0superscript𝐹0F^{0}italic_F start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, F1superscript𝐹1F^{1}italic_F start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, F2superscript𝐹2F^{2}italic_F start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, F3superscript𝐹3F^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT have strides of {4, 8, 16, 32} with respect to the input image.

Proposal Generator. We follow the common design [8, 38, 22, 26, 41, 39, 18] to use MaskFormer [7, 6] as the Proposal Generator. Since the Hungarian matching [20] is used in the training process, only a subset of the mask proposals is optimized. This matching strategy enhances generalizability of the Proposal Generator, ensuring it segment masks of novel categories. Given the image features F𝐹Fitalic_F, the Proposal Generator generates a set of N𝑁Nitalic_N mask proposals M={mi}i=1NN×H×W𝑀subscriptsuperscriptsubscript𝑚𝑖𝑁𝑖1superscript𝑁𝐻𝑊M=\{m_{i}\}^{N}_{i=1}\in\mathbb{R}^{N\times H\times W}italic_M = { italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_H × italic_W end_POSTSUPERSCRIPT.

During the training process, we stop the gradient flow from CLIP-V to the Proposal Generator. This measure is taken to avoid the potential overfitting of CLIP-V on the training categories.

4.2 Representation Compensation

Refer to caption
Figure 3: Details of Representation Compensation.

The representation Compensation (RC) strategy aims to review the original representation of CLIP as compensation during the training phase. Details of Representation Compensation are shown in Fig. 3. Within RC, we use a frozen CLIP-V (denoted as CLIP-V*) to generate the original CLIP-V features during training. Extracting the last stage output from the CLIP-V* (F^3superscript^𝐹3\hat{F}^{3}over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT) and the fine-tuned CLIP-V (F3superscript𝐹3F^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT), F^3superscript^𝐹3\hat{F}^{3}over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and F3superscript𝐹3F^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT are expected to be similar to avoid Catastrophic Forgetting. However, direct per-pixel alignment is not feasible, as it would result in the loss of region-level differences. Therefore, we devise multiple grids of average pooling (AvgPoolingAvgPooling\mathrm{AvgPooling}roman_AvgPooling) to generate multi-scale features, and ensure the consistency of the features after pooling.

Given an arbitrary feature fd×h×w𝑓superscript𝑑𝑤f\in\mathbb{R}^{d\times h\times w}italic_f ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_h × italic_w end_POSTSUPERSCRIPT, an AvgPoolingAvgPooling\mathrm{AvgPooling}roman_AvgPooling operation with grid size of k×k𝑘𝑘k\times kitalic_k × italic_k can be formulated as:

fpool=AvgPooling(f,k),fpoold×k×k.formulae-sequencesuperscript𝑓𝑝𝑜𝑜𝑙AvgPooling𝑓𝑘superscript𝑓𝑝𝑜𝑜𝑙superscript𝑑𝑘𝑘f^{pool}=\mathrm{AvgPooling}(f,k),f^{pool}\in\mathbb{R}^{d\times k\times k}.italic_f start_POSTSUPERSCRIPT italic_p italic_o italic_o italic_l end_POSTSUPERSCRIPT = roman_AvgPooling ( italic_f , italic_k ) , italic_f start_POSTSUPERSCRIPT italic_p italic_o italic_o italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k × italic_k end_POSTSUPERSCRIPT . (2)

In our default design, we use AvgPoolingAvgPooling\mathrm{AvgPooling}roman_AvgPooling with K={1,2,4}𝐾124K=\{1,2,4\}italic_K = { 1 , 2 , 4 } to perform pooling F^3superscript^𝐹3\hat{F}^{3}over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and F3superscript𝐹3{F}^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT into {1×1,2×2,4×4}112244\{1\times 1,2\times 2,4\times 4\}{ 1 × 1 , 2 × 2 , 4 × 4 } grids, denoting as F^psuperscript^𝐹𝑝\hat{F}^{p}over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and Fpsuperscript𝐹𝑝{F}^{p}italic_F start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Specifically, F^p=AvgPooling(F^3,K)superscript^𝐹𝑝AvgPoolingsuperscript^𝐹3𝐾\hat{F}^{p}=\mathrm{AvgPooling}(\hat{F}^{3},K)over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = roman_AvgPooling ( over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_K ) and Fp=AvgPooling(F3,K)superscript𝐹𝑝AvgPoolingsuperscript𝐹3𝐾{F}^{p}=\mathrm{AvgPooling}({F}^{3},K)italic_F start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = roman_AvgPooling ( italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_K ). Then, we use SmoothL1SmoothL1\mathrm{SmoothL1}SmoothL1 Loss to minimize the difference as follows:

rc=SmoothL1(Fp,F^p),subscript𝑟𝑐SmoothL1superscript𝐹𝑝superscript^𝐹𝑝\mathcal{L}_{rc}=\mathrm{SmoothL1}({F}^{p},\hat{F}^{p}),caligraphic_L start_POSTSUBSCRIPT italic_r italic_c end_POSTSUBSCRIPT = SmoothL1 ( italic_F start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) , (3)
SmoothL1(Fp,F^p)={0.5(FpF^p)2,if|FpF^p|<1|FpF^p|0.5,otherwise\mathrm{SmoothL1}({F}^{p},\hat{F}^{p})=\left\{\begin{aligned} 0.5\cdot({F}^{p}% -\hat{F}^{p})^{2}&,~{}~{}~{}\mathrm{if}~{}|{F}^{p}-\hat{F}^{p}|<1\\ |{F}^{p}-\hat{F}^{p}|-0.5&,~{}~{}~{}\mathrm{otherwise}~{}\\ \end{aligned}\right.SmoothL1 ( italic_F start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) = { start_ROW start_CELL 0.5 ⋅ ( italic_F start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT - over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL , roman_if | italic_F start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT - over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT | < 1 end_CELL end_ROW start_ROW start_CELL | italic_F start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT - over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT | - 0.5 end_CELL start_CELL , roman_otherwise end_CELL end_ROW (4)

With RC to compensate F3superscript𝐹3{F}^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT original CLIP’s representation, the CLIP-V maintains the zero-shot capability during fine-tuning. We apply MaskMask\mathrm{Mask}roman_Mask PoolingPooling\mathrm{Pooling}roman_Pooling [39] on the F3superscript𝐹3{F}^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT to generate vision embeddings (VN×d𝑉superscript𝑁𝑑V\in\mathbb{R}^{N\times d}italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT) for each mask proposal.

4.3 Content-Dependent Transfer

Given a set of class names C={C1,C2Cn}𝐶subscript𝐶1subscript𝐶2subscript𝐶𝑛C=\{C_{1},C_{2}...C_{n}\}italic_C = { italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, we use the predefined templates [38, 37, 35, 39, 18] to generate sentences corresponding to these class names, e.g., "a photo of a {Ci}subscript𝐶𝑖\{C_{i}\}{ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }; There is a {Ci}subscript𝐶𝑖\{C_{i}\}{ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } in the scene…", these sentences are then fed into CLIP-T to generate embeddings of each sentence. The embeddings of the same classes are averaged to obtain text embedding (Td×|C|𝑇superscript𝑑𝐶T\in\mathbb{R}^{d\times|C|}italic_T ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × | italic_C | end_POSTSUPERSCRIPT). d𝑑ditalic_d is the dimension of the embedding, and |C|𝐶|C|| italic_C | is the number of class names.

Refer to caption
Figure 4: Details of Content-Dependent Transfer.

To optimize CLIP-T representation T𝑇Titalic_T, we propose the Content-Dependent Transfer (CDT), which involves a sequence of Transformer Layers performing cross-attention with vision feature F3superscript𝐹3F^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. Details of the CDT are illustrated in Fig. 4. We take the last stage feature of CLIP-V (F3superscript𝐹3F^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT) and the text embeddings T𝑇Titalic_T as the inputs for CDT. F3superscript𝐹3F^{3}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is first FlattenFlatten\mathrm{Flatten}roman_Flatten at spatial dimension, denoted as Fflat3d×hwsubscriptsuperscript𝐹3𝑓𝑙𝑎𝑡superscript𝑑𝑤F^{3}_{flat}\in\mathbb{R}^{d\times hw}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f italic_l italic_a italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_h italic_w end_POSTSUPERSCRIPT. Then, we use n𝑛nitalic_n sequential Transformer Layers to process T𝑇Titalic_T and Fflat3subscriptsuperscript𝐹3𝑓𝑙𝑎𝑡F^{3}_{flat}italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f italic_l italic_a italic_t end_POSTSUBSCRIPT, while incorporating a shortcut connection. This process can be formulated as:

Ti+1=TransLayeri(Ti,Fflat3)+Ti,i=1,2l.formulae-sequencesubscript𝑇𝑖1subscriptTransLayerisubscript𝑇𝑖subscriptsuperscript𝐹3𝑓𝑙𝑎𝑡subscript𝑇𝑖𝑖12𝑙T_{i+1}=\mathrm{TransLayer_{i}}(T_{i},F^{3}_{flat})+T_{i},~{}~{}~{}i={1,2...l}.italic_T start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = roman_TransLayer start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f italic_l italic_a italic_t end_POSTSUBSCRIPT ) + italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , 2 … italic_l . (5)

In our default setting, l𝑙litalic_l is set to 2. The resulting output of the CDT is denoted as the conditioned text embeddings (T^^𝑇\hat{T}over^ start_ARG italic_T end_ARG). Specifically,

TransLayer(a,b)=Softmax(Que(a)Key(b)d)Val(b),TransLayer𝑎𝑏SoftmaxQue𝑎Key𝑏𝑑Val𝑏\mathrm{TransLayer}(a,b)=\mathrm{Softmax}(\frac{\mathrm{Que}(a)\cdot\mathrm{% Key}(b)}{\sqrt{d}})\cdot\mathrm{Val}(b),roman_TransLayer ( italic_a , italic_b ) = roman_Softmax ( divide start_ARG roman_Que ( italic_a ) ⋅ roman_Key ( italic_b ) end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) ⋅ roman_Val ( italic_b ) , (6)

where Que()Que\mathrm{Que}(\cdot)roman_Que ( ⋅ ), Key()Key\mathrm{Key}(\cdot)roman_Key ( ⋅ ), and Val()Val\mathrm{Val}(\cdot)roman_Val ( ⋅ ) represent linear projections, d𝑑ditalic_d is the dimension of the input vectors, we assume all vectors have the same dimension d𝑑ditalic_d by default. In Eq. 6, we simplify the expression of Multihead Attention and LayerNorm in Transformer. Note that the CLIP-T remains frozen during training, and only the Transformer Layers are trained to optimize the CLIP-T representation. Therefore, the parameter-efficient CLIP-T fine-tuning is established, with T^^𝑇\hat{T}over^ start_ARG italic_T end_ARG is conditioned on the input images.

We investigate various designs to optimize the CLIP-T representation (T𝑇Titalic_T), including fine-tuning CLIP-T, training an additional MLP, incorporating description guidance, etc. Further details are presented in Sec. 5.3.

4.4 Objective

After getting the conditional text embeddings T^^𝑇\hat{T}over^ start_ARG italic_T end_ARG, we perform matrix multiplication on T^^𝑇\hat{T}over^ start_ARG italic_T end_ARG and V𝑉Vitalic_V to derive the classification score Sclssuperscript𝑆𝑐𝑙𝑠S^{cls}italic_S start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT for the mask proposals. Subsequently, we multiply Sclssuperscript𝑆𝑐𝑙𝑠S^{cls}italic_S start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT with M𝑀Mitalic_M to obtain the final output.

We use the mask-aware loss [18] (masubscript𝑚𝑎\mathcal{L}_{ma}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a end_POSTSUBSCRIPT, Eq. 1) on Sclssuperscript𝑆𝑐𝑙𝑠S^{cls}italic_S start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT to optimize the representation of both CLIP-V and CLIP-T. Considering the masubscript𝑚𝑎\mathcal{L}_{ma}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a end_POSTSUBSCRIPT may induce overfitting on the training categories and reduce the transferability of CLIP, we introduce rcsubscript𝑟𝑐\mathcal{L}_{rc}caligraphic_L start_POSTSUBSCRIPT italic_r italic_c end_POSTSUBSCRIPT (Sec. 4.2) to compensate CLIP’s representation during training. Meanwhile, we follow Mask2Former [6] to adopt the same loss functions (Psubscript𝑃\mathcal{L}_{P}caligraphic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT) to train the Proposal Generator without any special design. Therefore, the final loss function (\mathcal{L}caligraphic_L) can be formulated as: =P+λ1ma+λ2rcsubscript𝑃subscript𝜆1subscript𝑚𝑎subscript𝜆2subscript𝑟𝑐\mathcal{L}=\mathcal{L}_{P}+{\lambda}_{1}{\mathcal{L}_{ma}}+{\lambda}_{2}{% \mathcal{L}_{rc}}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_m italic_a end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_r italic_c end_POSTSUBSCRIPT, where λ1=1subscript𝜆11\lambda_{1}=1italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 and λ2=0.1subscript𝜆20.1\lambda_{2}=0.1italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.1.

Note that we stop the gradient from CLIP-V to Proposal Generator. The CLIP-V is not optimized by Psubscript𝑃\mathcal{L}_{P}caligraphic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT.

Modifications in the panoptic setting. The masubscript𝑚𝑎\mathcal{L}_{ma}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a end_POSTSUBSCRIPT is tailored for semantic segmentation and lacks the ability to capture instance-level information. We explore adapting the masubscript𝑚𝑎\mathcal{L}_{ma}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a end_POSTSUBSCRIPT to panoptic segmentation with the following modification. Specifically, when a mask contains multiple instances, we use binary ground-truth (GT) to mask out redundant instances, retaining only the instance with the highest IoU score with GT. This change allows CLIP-V to learn instance-level knowledge, making masubscript𝑚𝑎\mathcal{L}_{ma}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a end_POSTSUBSCRIPT applicable to panoptic segmentation.

5 Experiments

5.1 Setting

Dataset. We conduct experiments on popular open-vocabulary segmentation benchmarks, including COCO-Stuff, COCO-Panoptic, Pascal-VOC, Pascal-Context and ADE20K. We train MAFT+ on COCO-Stuff and testing on ADE20K (A-847, A-150), Pascal-Context (PC-459, PC-59), and Pascal-VOC (PAS-20) to evaluate the performance of open-vocabulary semantic segmentation. Then, we evaluate MAFT+ in open-vocabulary panoptic settings [35, 5, 39], i.e., training on COCO-Panoptic and testing on ADE20K.
More details of the dataset settings are provided in the Appendix.

Evaluation Metrics. To quantitatively evaluate the performance, we follow standard practice [8, 38, 22, 37, 35, 39]. Semantic segmentation results are evaluated with mean Intersection over Union (mIoU) [9]. Panoptic segmentation results are evaluated with the panoptic quality (PQ), segmentation quality (SQ) and recognition quality (RQ) [19].

Implementation details. We employ ConvNeXt-Large CLIP from OpenCLIP [16]. The Proposal Generator is built following the default settings of Mask2Former [6]. We set the number of class-agnostic mask proposals to 100 (N=100𝑁100N=100italic_N = 100). During training, the model is optimized with AdamW optimizer with a weight-decay of 0.05. The learning rate is set to 1 ×\times× 10-5 for CLIP-V and 1 ×\times× 10-4 for other modules. We use a crop size of 1024 ×\times× 1024. The model is trained for 60,000 iterations on COCO with 4 NVIDIA A100 GPUs.

5.2 Comparisons with State-of-the-art Methods

Table 1: Open-vocabulary semantic segmentation performance. mIoU is used to evaluate the performance. * denotes additional ensemble operation [39] used during testing.
  VLM A-847 A-150 PC-459 PC-59 PAS-20
OpenSeg [ECCV22][11] ALIGN 8.8 28.6 12.2 48.2 72.2
OVSeg [CVPR23][22] ViT-L 9.0 29.6 12.4 55.7 94.5
SAN [CVPR23][37] ViT-L 12.4 32.1 15.7 57.7 94.6
ODISE [CVPR23][35] ViT-L 11.1 29.9 14.5 57.3 -
FC-CLIP [NeurIPS23][39]   ConvNeXt-L 11.2 26.6 12.7 42.4 89.5
FC-CLIP* [NeurIPS23][39] ConvNeXt-L 14.8 34.0 18.2 58.4 95.4
MAFT [NeurIPS23][18] ViT-L 12.7 33.0 16.2 59.0 92.1
MAFT [NeurIPS23][18] ConvNeXt-L 13.1 34.4 17.0 57.5 93.0
MAFT+ (ours) ConvNeXt-L 15.1 36.1 21.6 59.4 96.5
 
Table 2: Open-vocabulary panoptic segmentation performance on ADE20K. PQ, SQ, and RQ are used for evaluation. The best results are highlighted with red.
      PQ     SQ       RQ
FreeSeg [CVPR22][26] 16.3 - -
ODISE [CVPR22][35] 22.6 - -
MaskCLIP [ICML23][46] 15.1 70.4 19.2
OPSNet [ICCV23][5] 19.0 52.4 23.0
FC-CLIP [NeurIPS23][39] 21.9 71.5 26.4
FC-CLIP* [NeurIPS23][39] 26.8 71.5 32.2
MAFT+ (ours) 27.1 73.5 32.9
 

In this section, we compare our proposed MAFT+ with the state-of-the-art open-vocabulary semantic segmentation methods and open-vocabulary panoptic segmentation methods.

Comparisons in the semantic setting. In Tab. 1, we present the performance of MAFT+ on various benchmarks. MAFT+ demonstrates a significant improvement over existing open-vocabulary segmentation models, achieving a performance boost of +0.5, +2.3, +3.4, +0.4, +1.1 mIoU across A-847, PC-459, A-150, PC-59, and PAS-20, respectively. Moreover, compared to MAFT [18], our MAFT+ eliminates the need for an additional fine-tuned CLIP-V. MAFT+ applies an end-to-end pipeline, facilitating both the training and testing processes.

Comparisons in the panoptic setting. In Tab. 2, we evaluate our MAFT+ on ADE20K, the main evaluation dataset of open-vocabulary panoptic segmentation. With the aforementioned modifications, our approach achieves new state-of-the-art performance. Compared to FC-CLIP without the ensemble strategy (3rd last results), our MAFT+ outperforms it by +5.2 PQ, +2.0 SQ and +6.5 RQ. Although the ensemble strategy greatly improves FC-CLIP’s performance, our model still outperforms FC-CLIP* across all evaluation metrics.

Analysis of the ensemble strategy in FC-CLIP. FC-CLIP ensembles the classification score of Mask2Former and CLIP, along with two hyper-parameters to balance these scores. As shown in Tab. 1 and Tab. 2, the ensemble operation significantly improves FC-CLIP’s performance. i.e., 42.4\rightarrow58.4 mIoU on PC-59, 21.9\rightarrow26.8 PQ on ADE20K. However, this improvement stems from the overlap of categories between training and testing datasets. Moreover, determining the two critical hyper-parameters requires numerous repeated experiments. Based on this, in our default settings, we remove this ensemble operation, and solely use the CLIP for classification.

5.3 Ablation Study

We conduct ablation studies on various choices of designs of our MAFT+, and showcase their contribution to the final results in Tab. 3, 4, 5. We freeze the CLIP-V and remove the Content-Dependent Transfer as the baseline model (i.e. representation of a frozen CLIP).

Component-wise ablations. To understand the effect of each component in the MAFT+, including the Representation Compensation (RC) strategy and the Content-Dependent Transfer (CDT). We start with a frozen CLIP as the baseline model, and gradually add each design. (Tab. 3). The frozen CLIP yields inferior performance due to CLIP’s region-unaware property (1stsuperscript1𝑠𝑡1^{st}1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT row). Then, Content-Dependent Transfer optimizes CLIP Text representation and promotes the alignment of vision and text embeddings, resulting in an improvement of +5.8 mIoU on A-150 and +12.8 mIoU on PC-59 (2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT row). Using only Representation Compensation for fine-tuning CLIP-V produces decent performance (the 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT result), 26.6\rightarrow34.8 on A-150, 42.4\rightarrow57.1 on PC-59 in terms of mIoU. Finally, introducing CDT and RC collaboratively learns effective vision and text alignment representation, fitting the distribution of CLIP from image-level to segmentation tasks, further enhancing the performance to establish state-of-the-art benchmarks. (last row).

Table 3: Ablation on components of MAFT+. Here RC and CDT denote Representation Compensation and Content-Dependent Transfer. Note that “tune CLIP-T” represents optimizing the distribution of text-embeds, not directly fine-tuning CLIP-T.
  A-847 A-150 PC-459 PC-59 PAS-20
frozen CLIP (baseline) 11.2 26.6 12.7 42.4 89.5
+ CDT (tune CLIP-T) 13.3 +2.1 32.4 +5.8 17.2 +4.5 55.2 +12.8 94.7 +5.2
+ RC   (tune CLIP-V) 14.6 +3.4 34.8 +8.2 18.2 +5.5 57.1 +14.7 95.3 +5.8
+ CDT & RC 15.1 +3.9 36.1 +9.5 21.6 +8.9 59.4 +17.0 96.5 +7.0
 

Effect of Content-Dependent Transfer. Optimizing CLIP text representation is an essential design of MAFT+. We investigate various designs to optimize the CLIP-T representation in Fig. 5, including direct fine-tuning of CLIP-T parameters, training with additional MLP, training with class-description sentences by GPT, and training with class-description embeddings by Llama-2. Tab. 4 presents the results of different designs for optimizing CLIP text representation. Here, we remove Representation Compensation strategy, and keep the CLIP-V frozen for analysis.

Refer to caption
Figure 5: Comparisons between CLIP-T tuning strategies.
Table 4: Ablation of diverse designs of CLIP-Text optimization. We remove the Representation Compensation strategy and freeze CLIP-V for analysis. Note that fine-tuning CLIP-T requires excessive GPU memory, and thus it is infeasible (denoted as N/ANA\mathrm{N/A}roman_N / roman_A) for the setting in the 2nd row.
  A-847 A-150 PC-459 PC-59 PAS-20
frozen CLIP (baseline) 11.2 26.6 12.7 42.4 89.5
+ fine-tune CLIP-T N/ANA\mathrm{N/A}roman_N / roman_A N/ANA\mathrm{N/A}roman_N / roman_A N/ANA\mathrm{N/A}roman_N / roman_A N/ANA\mathrm{N/A}roman_N / roman_A N/ANA\mathrm{N/A}roman_N / roman_A
+ MLP 4.1 20.2 11.2 51.4 89.4
+ GPT-Description 11.9 28.2 13.3 42.6 90.6
+ Llama-Description 9.6 26.1 11.5 40.8 90.9
+ Content-Dependent Transfer 13.3 32.4 17.2 55.2 94.7
 
  • a. fine-tuning CLIP-T We explore fine-tuning CLIP-T parameters to optimize the CLIP text representation. The category name ({CLS}CLS\{\mathrm{CLS}\}{ roman_CLS }) is first augmented to sentences by some templates [38, 37, 35, 39, 18] and fed into CLIP-T However, fine-tuning CLIP-T (2nd results in Tab. 4) requires excessive GPU memory (more than 8 NVIDIA A100 GPUs), which is unaffordable in our experiments.

  • b. MLP An MLP layer is added after CLIP-T, with the MLP learning to project text embedding to fit segmentation distributions. Within this design, CLIP-T is frozen, greatly reducing GPU memory consumption compared with fine-tuning CLIP-T. According to the 3rd results in Tab. 4, the performance suffers a significant drop on ADE20K (11.2\rightarrow4.1, 26.6\rightarrow20.2), while increasing on PC-59 (42.4\rightarrow51.4). This could be attributed to the MLP layer losing CLIP’s zero-shot capability and its inability to perceive novel categories effectively.

  • c. GPT-Description We assume that the detailed description of {CLS}CLS\{\mathrm{CLS}\}{ roman_CLS } contains additional valuable information, helping to optimize CLIP-T distribution. To explore this, we leverage GPT-3.5 [1] to generate description sentences of one {CLS}CLS\{\mathrm{CLS}\}{ roman_CLS }. e.g., if the instruction provided to GPT is: [Instruct]=delimited-[]Instructabsent[\mathrm{Instruct}]=[ roman_Instruct ] =“Please describe the appearance of cat𝑐𝑎𝑡catitalic_c italic_a italic_t.” GPT responds the description sentences of cat𝑐𝑎𝑡catitalic_c italic_a italic_t: [Response]=delimited-[]Responseabsent[\mathrm{Response}]=[ roman_Response ] =[[[[-a rounded head; -a short snout; -triangular ears …]]]]” Then we use a frozen CLIP-T to generate the corresponding text embeddings, followed by an MLP layer to project the embeddings. Within this design, the performance is slightly improved: +0.7 on A-847, +2.0 on A-150 (4th results in Tab. 4).

  • d. Llama-Description In view of Large Language Models (LLMs) powerful text representation capability, we explore to use of the open-source LLM, Llama-2 [31], to generate descriptive text embeddings. After obtaining Llama and CLIP-T embeddings, we average them and train an MLP layer to project the Llama embeddings into the CLIP-T embeddings space. Our experimental results demonstrate that this design does not benefit the performance (5th results in Tab. 4). The mIoU drops from 11.2 to 9.6 on A-847, 12.7\rightarrow11.5 on PC459. This decrease may be due to the fact that the LLMs’ feature space is not aligned with the CLIP-V’s visual feature space.

  • Content-Dependent Transfer We propose the Content-Dependent Transfer to enhance CLIP Text embeddings conditioned on the input images. Details can be found in Sec. 4.3. As shown in the last results in Tab. 4, the Content-Dependent Transfer improves the performance on all five datasets: 11.2\rightarrow13.3, 26.6\rightarrow32.4, 12.7\rightarrow17.2, 42.4\rightarrow55.2, and 89.5\rightarrow94.7, respectively.

Analysis of why LLMs do not work? OVS focuses on data-limited settings, examines the model’s ability to segment arbitrary text after seeing a few classes. Therefore, effective image-text alignment of prior models (e.g., CLIP) is crucial. Despite LLMs’ strong text processing capabilities, their potential is not fully realized with limited data, resulting in incomplete image-text alignment. Thus, simply adapting LLMs to OVS is unsuitable and may require further research. Note: The descriptions of all categories in the training set can be obtained through one single pre-processing step. Therefore, in c. & d., the additional computational cost during training can be ignored. More details of the templates and the designs for GPT and Llama-2 are provided in the Appendix.

Table 5: Ablations of the Representation Compensation strategy. The Content-Dependent Transfer is removed. The best results are highlighted with red, and the default settings are highlighted with gray background.
  A-847 A-150 PC-59
None 14.6 34.8 57.1
Freeze {S0, 1} 14.6 34.7 57.0
Freeze {S0, 1, 2} 14.0 34.6 55.3
Freeze {S0, 1, 2, 3} 13.6 33.6 54.7
 
(a) Ablation of the frozen stages in CLIP-V.
  A-847 A-150 PC-59
Grid {1} 13.8 33.9 56.5
Grid {1, 2} 14.0 34.6 56.6
Grid {1, 2, 4} 14.6 34.8 57.1
Grid {1, 3, 6} 14.5 34.6 55.7
 
(b) Ablation of the AvgPoolingAvgPooling\mathrm{AvgPooling}roman_AvgPooling grid in rcsubscript𝑟𝑐\mathcal{L}_{rc}caligraphic_L start_POSTSUBSCRIPT italic_r italic_c end_POSTSUBSCRIPT.

Effect of Representation Compensation. We conduct ablation studies on Representation Compensation strategy in Fig. 5, here we remove the Content-Dependent Transfer for analysis.

  • Frozen stages in CLIP-V: We explore the impact of fine-tuning units within CLIP-V. CLIP-V consists of 4 ConvNeXt stages {S0, S1, S2, S3}, which downsample the image features from 1414\frac{1}{4}divide start_ARG 1 end_ARG start_ARG 4 end_ARG to 132132\frac{1}{32}divide start_ARG 1 end_ARG start_ARG 32 end_ARG. We start with fine-tuning the entire CLIP-V, and then freezing each stage sequentially, as detailed in Tab. 5(a). Compared to fine-tuning the entire CLIP-V, freezing any stage causes performance degradation. Freezing S0-1, S0-2, S0-3 brings -0.1, -1.8, and -2.4 mIoU performance degradation respectively on PC-59, indicating that freezing S2 and S3 (depth convnext stages) has the most significant impact on the performance.

  • Effect of AvgPoolingAvgPooling\mathrm{AvgPooling}roman_AvgPooling grids: In Tab. 5(b), we investigate how different multi-scale AvgPoolingAvgPooling\mathrm{AvgPooling}roman_AvgPooling grids ({1}1\{1\}{ 1 }, {1,2}12\{1,2\}{ 1 , 2 }, {1,2,4}124\{1,2,4\}{ 1 , 2 , 4 }, {1,3,6}136\{1,3,6\}{ 1 , 3 , 6 }) in rcsubscript𝑟𝑐\mathcal{L}_{rc}caligraphic_L start_POSTSUBSCRIPT italic_r italic_c end_POSTSUBSCRIPT impact performance. Results show {1,2,4}124\{1,2,4\}{ 1 , 2 , 4 } grids boost performance on A-150 to 34.8 mIoU, and achieve the best performance. Using {1,3,6}136\{1,3,6\}{ 1 , 3 , 6 } grads results in -1.6 drops on PC-59, manifesting overly large AvgPoolingAvgPooling\mathrm{AvgPooling}roman_AvgPooling grids compromises the model to learn region-level differences.

Table 6: Extending MAFT+ with ConvNeXt-Base CLIP. The best results are highlighted with red.
   A-847  A-150  PC-459  PC-59  PAS-20
FC-CLIP*[39] 12.7 31.1 12.5 54.3 93.8
MAFT + 13.2 33.6 14.2 55.9 93.9
 

Extending MAFT+ with ConvNeXt-Base CLIP. To showcase the efficacy and robustness of MAFT+, we conduct experiments using ConvNeXt-Base CLIP. The results are shown in Tab. 6, we also include the results of FC-CLIP for comparison. Compared with the FC-CLIP counterpart, MAFT+ outperforms it by a significant margin on all five datasets. This demonstrates that MAFT+ can easily transfer to other CLIP models.

5.4 Qualitative Study

Visualizations of similarity map.

Refer to caption
Figure 6: Qualitative results. Normalized cosine similarity between the text embeddings and image embeddings of 59 classes in PC59. Text & image embeddings are generated by frozen CLIP (left). Text & image embeddings are generated by our MAFT+ fine-tuned CLIP (right). The high similarity scores are highlighted in yellow, low similarity scores are shown in blue.

Fig. 6 presents the normalized similarity map between text and image embeddings. Including similarity map generated by frozen CLIP embeddings (left) and similarity map generated by fine-tuned CLIP embeddings (right). An observation can be obtained: The high similarity values of fine-tuned CLIP are mainly located on the diagonal of the similarity map, indicating the collaborative optimization of CLIP-V and CLIP-T achieves better alignment of vision-text representation.

Qualitative analysis.

Refer to caption
Figure 7: Qualitative results. The results with the frozen CLIP and our MAFT+ fine-tuned CLIP are shown for comparasion.

We show some visual examples in Fig. 7. In some simple cases, the frozen CLIP results may contain background noise, and tend to classify multiple objects into one single class (e.g. the 1stsuperscript1𝑠𝑡1^{st}1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT row, “bicycle”). The frozen CLIP is prone to misclassification when there are many categories in one image (the 3rdsuperscript3𝑟𝑑3^{rd}3 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT row, “streetlight”, “sidewalk”, “hill”). Our fine-tuned CLIP collaboratively learns vision-text representation for segmentation tasks, which can significantly improve the segmentation results. In addition, the 2rdsuperscript2𝑟𝑑2^{rd}2 start_POSTSUPERSCRIPT italic_r italic_d end_POSTSUPERSCRIPT row shows that our fine-tuned CLIP successfully segments “balcony”, which is a reasonable outcome even though “balcony” does not appear in the ground-truth annotations. More visual samples are shown in the Appendix.

6 Conclusion

In this paper, we rethink the issues in frozen CLIP paradigm and CLIP-V fine-tuning paradigm and propose a collaborative vision-text optimizing structure, MAFT+, for OVS. We introduce the Representation Compensation strategy to review the original CLIP’s representation to maintain the zero-shot capability of CLIP-V. And propose the Content-Dependent Transfer to optimize the text representation in a parameter-efficient way. Extensive experiments well demonstrate our MAFT+ achieves superior performance on multiple open-vocabulary segmentation datasets.

Limitations. While the proposed MAFT+ optimizes the vision-test representation space of CLIP to fit the distribution of OVS, it is important to acknowledge that the optimization upper-bound is constrained by the capabilities of the pre-trained CLIP model. Addressing this limitation constitute our future research focus.

Acknowledgements

This work was supported in part by the National Key R & D Program of China (No. 2021ZD0112100), the National NSF of China (No.U23A20314).

References

  • [1] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
  • [2] Bucher, M., Vu, T.H., Cord, M., Pérez, P.: Zero-shot semantic segmentation. Advances in Neural Information Processing Systems 32 (2019)
  • [3] Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1209–1218 (2018)
  • [4] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40(4), 834–848 (2017)
  • [5] Chen, X., Li, S., Lim, S.N., Torralba, A., Zhao, H.: Open-vocabulary panoptic segmentation with embedding modulation. arXiv preprint arXiv:2303.11324 (2023)
  • [6] Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., Schwing, A.G.: Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764 (2021)
  • [7] Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems 34 (2021)
  • [8] Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11583–11592 (2022)
  • [9] Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International journal of computer vision 111, 98–136 (2015)
  • [10] Fang, Y., Zhu, F., Cheng, B., Liu, L., Zhao, Y., Wei, Y.: Locating noise is halfway denoising for semi-supervised segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 16612–16622 (2023)
  • [11] Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling open-vocabulary image segmentation with image-level labels. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI. pp. 540–557. Springer (2022)
  • [12] Gu, Z., Zhou, S., Niu, L., Zhao, Z., Zhang, L.: Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia. pp. 1921–1929 (2020)
  • [13] Han, K., Liu, Y., Liew, J.H., Ding, H., Wei, Y., Liu, J., Wang, Y., Tang, Y., Yang, Y., Feng, J., et al.: Global knowledge calibration for fast open-vocabulary segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
  • [14] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 603–612 (2019)
  • [15] Huang, Z., Wei, Y., Wang, X., Liu, W., Huang, T.S., Shi, H.: Alignseg: Feature-aligned segmentation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(1), 550–557 (2021)
  • [16] Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., et al.: Openclip, july 2021. If you use this software, please cite it as below 2(4),  5 (2021)
  • [17] Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning. pp. 4904–4916. PMLR (2021)
  • [18] Jiao, S., Wei, Y., Wang, Y., Zhao, Y., Shi, H.: Learning mask-aware clip representations for zero-shot segmentation. Advances in Neural Information Processing Systems 36 (2023)
  • [19] Kirillov, A., He, K., Girshick, R., Rother, C., Dollár, P.: Panoptic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9404–9413 (2019)
  • [20] Kuhn, H.W.: The hungarian method for the assignment problem. Naval research logistics quarterly 2(1-2), 83–97 (1955)
  • [21] Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=RriDjddCLN
  • [22] Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7061–7070 (2023)
  • [23] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014)
  • [24] McCloskey, M., Cohen, N.J.: Catastrophic interference in connectionist networks: The sequential learning problem. In: Psychology of learning and motivation, vol. 24, pp. 109–165. Elsevier (1989)
  • [25] Mottaghi, R., Chen, X., Liu, X., Cho, N.G., Lee, S.W., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 891–898 (2014)
  • [26] Qin, J., Wu, J., Yan, P., Li, M., Yuxi, R., Xiao, X., Wang, Y., Wang, R., Wen, S., Pan, X., et al.: Freeseg: Unified, universal and open-vocabulary image segmentation. arXiv preprint arXiv:2303.17225 (2023)
  • [27] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [28] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
  • [29] Shaban, A., Bansal, S., Liu, Z., Essa, I., Boots, B.: One-shot learning for semantic segmentation. arXiv preprint arXiv:1709.03410 (2017)
  • [30] Sun, Y., Chen, Q., He, X., Wang, J., Feng, H., Han, J., Ding, E., Cheng, J., Li, Z., Wang, J.: Singular value fine-tuning: Few-shot segmentation requires few-parameters fine-tuning. arXiv preprint arXiv:2206.06122 (2022)
  • [31] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
  • [32] Xian, Y., Choudhury, S., He, Y., Schiele, B., Akata, Z.: Semantic projection network for zero-and few-label semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8256–8265 (2019)
  • [33] Xiao, J.W., Zhang, C.B., Feng, J., Liu, X., van de Weijer, J., Cheng, M.M.: Endpoints weight fusion for class incremental semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7204–7213 (2023)
  • [34] Xu, J., De Mello, S., Liu, S., Byeon, W., Breuel, T., Kautz, J., Wang, X.: Groupvit: Semantic segmentation emerges from text supervision. CVPR (2022)
  • [35] Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2955–2966 (2023)
  • [36] Xu, J., Chen, W., Zhao, Y., Wei, Y.: Transferable and principled efficiency for open-vocabulary segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15814–15824 (2024)
  • [37] Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2945–2954 (2023)
  • [38] Xu, M., Zhang, Z., Wei, F., Lin, Y., Cao, Y., Hu, H., Bai, X.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIX. pp. 736–753. Springer (2022)
  • [39] Yu, Q., He, J., Deng, X., Shen, X., Chen, L.C.: Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. Advances in Neural Information Processing Systems 36 (2023)
  • [40] Zhang, C.B., Xiao, J.W., Liu, X., Chen, Y.C., Cheng, M.M.: Representation compensation networks for continual semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7053–7064 (2022)
  • [41] Zhang, G., Navasardyan, S., Chen, L., Zhao, Y., Wei, Y., Shi, H., et al.: Mask matching transformer for few-shot segmentation. Advances in Neural Information Processing Systems 35, 823–836 (2022)
  • [42] Zhang, G., Wang, L., Kang, G., Chen, L., Wei, Y.: Slca: Slow learner with classifier alignment for continual learning on a pre-trained model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19148–19158 (2023)
  • [43] Zhang, Z., Gao, G., Fang, Z., Jiao, J., Wei, Y.: Mining unseen classes via regional objectness: A simple baseline for incremental segmentation. Advances in Neural Information Processing Systems 35, 24340–24353 (2022)
  • [44] Zhang, Z., Gao, G., Jiao, J., Liu, C.H., Wei, Y.: Coinseg: Contrast inter-and intra-class representations for incremental segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
  • [45] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2881–2890 (2017)
  • [46] Zheng Ding, Jieke Wang, Z.T.: Open-vocabulary universal image segmentation with maskclip. In: International Conference on Machine Learning (2023)
  • [47] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 633–641 (2017)

Appendix

We first introduce the dataset settings in Sec. 0.A. Then, the Prompt engineering techenic is introduced in detailed in Sec. 0.B, including the template-based Prompt and the description-based Prompt. Moreover, we provide additional qualitative results in Sec. 0.C.

Appendix 0.A Dataset

We follow [37, 39, 35, 18] to conduct experiments on the popular benchmarks of open-vocabulary semantic and panoptic settings, COCO-Stuff [3], COCO-Panoptic [23], Pascal-VOC [9] ADE20K [47], and Pascal-Context [25] to evaluate the performance of MAFT+.

  • COCO-Stuff: COCO-Stuff is a large-scale semantic segmentation dataset that contains 164K images with 171 annotated classes, which are divided into the training set (118K images), validation set (5K images), and testing set (41K images). In our experiments, we use the full 118K training set as the training data to train the semantic models.

  • COCO-Panoptic: COCO-Panoptic shares the same training images with COCO-Stuff. These images are labeled into 133 categories. In our experiments, we use COCO-Panoptic to train the panoptic models.

  • Pascal-VOC: Pascal-VOC includes 1,449 images for testing with 20 annotated classes. In the open-vocabulary semantic segmentation, all 20 classes are used for evaluation (dubbed as PAS-20).

  • ADE20K: ADE20K is a large-scale scene understanding dataset comprising 2k images for validation with two types of annotations: one with 150 classes featuring panoptic annotations and another with 847 classes featuring semantic annotations. For the open-vocabulary semantic segmentation, we evaluate our method on two settings of ADE20K: 150 classes (dubbed as A-150) and 847 classes (dubbed as A-847). In the open-vocabulary panoptic segmentation, we use the setting with 150 class annotations for evaluation.

  • Pascal-Context is a dataset for semantic understanding which contains 5K validation images. Two versions are used for open-vocabulary semantic segmentation, one with 59 frequently used classes (dubbed as PC-59) and another with the whole 459 classes (dubbed as PC-459).

Appendix 0.B Template-based Prompt & Description-based Prompt

Prompt engineering has been proven to be beneficial for open-vocabulary segmentation. In our default setting, we follow the common practice of using the template-based prompt to augment class names into sentences. In addition, we also explore using GPT [1] or Llama [31] to apply description-based prompts.

Template-based Prompt. Following established approaches [37, 39, 35, 18], we use multiple templates to integrate the class names into sentences. These sentences are then fed into CLIP-T, and the resulting outputs are averaged to generate the text embedding for each class. The templates are listed in Tab. 7.

Description-based Prompt. We assume that the detailed descriptions of one class name contain additional valuable information that helps to optimize CLIP-T. To investigate this, we design description-based prompts, leveraging Large Language Models (LLMs) to generate descriptions, including using GPT-3.5 [1] to generate description sentences, and use the open-source LLM, Llama-2 [31], to generate descriptive text embeddings. Through experimental verification, we selected a few prompts suitable for LLMs to generate descriptions. The prompts and responses are shown in Tab. 8(a) and Tab. 8(b), respectively.

The results indicate that some descriptions provide valuable visual attributes, facilitating the alignment of vision-text representations in the CLIP feature space. However, they may introduce noise. e.g., both cat and chair have descriptions that include the sentence “four legs”.

Appendix 0.C Similarity Map & Visualize results

We provide more qualitative results, including similarity maps (Fig. 8), and visualize results in Pascal-VOC, COCO-Stuff, ADE20K datasets (Fig. 9, 10).

Similarity map. Fig. 8 presents the normalized similarity maps between text and image embeddings in A-150 and A-847 datasets. We choose 200 categories in A-847 for visualization. It is evident that the elevated similarity values of fine-tuned CLIP’s similarity map are mainly located on the main diagonal, indicating the fine-tuned CLIP achieves a better alignment of vision-text representation.

Qualitative Analysis. Fig. 9, 10 show segmentation results on Pascal-VOC, COCO-Stuff, ADE20K. The frozen CLIP results may contain background noise (1st and 2nd rows in Fig. 9) or misclassify when there are many objects in one image (3rd row in Fig. 9). The fine-tuned CLIP generates better results compared to the frozen CLIP, which can even correct misclassified areas in ground-truth (fence in the 4th row in Fig. 9).

Table 7: Prompt templates used in our method.
  Templates
“a photo of a {}\{~{}\}{ }.”
“This is a photo of a {}\{~{}\}{ }
“There is a {}\{~{}\}{ } in the scene”
“There is the {}\{~{}\}{ } in the scene”
“a photo of a {}\{~{}\}{ } in the scene”
“a photo of a small {}\{~{}\}{ }.”
“a photo of a medium {}\{~{}\}{ }.”
“a photo of a large {}\{~{}\}{ }.”
“This is a photo of a small {}\{~{}\}{ }.”
“This is a photo of a medium {}\{~{}\}{ }.”
“This is a photo of a large {}\{~{}\}{ }.”
“There is a small {}\{~{}\}{ } in the scene.”
“There is a medium {}\{~{}\}{ } in the scene.”
“There is a large {}\{~{}\}{ } in the scene.”
 
Table 8: Description-based prompts and responses
  Description prompts
“Please describe the appearance of {}\{~{}\}{ }. Please characterize it briefly.”
“Describe the physical attributes of {}\{~{}\}{ }. Please characterize it briefly.”
“What can you tell me about the appearance of the category of {}\{~{}\}{ }? Please characterize it briefly.”
“Tell me about the outward features of the category of {}\{~{}\}{ }. Please characterize it briefly.”
“Briefly outline the visual traits of the category of {}\{~{}\}{ }.”
“Can you provide details about what the category of {}\{~{}\}{ } looks like? Please characterize it briefly.”
“I’m curious about the visual characteristics of the category of {}\{~{}\}{ }. Please characterize it briefly.”
“Provide a description of the visual aspects of {}\{~{}\}{ }. Please characterize it briefly.”
“Q: What are visual features of distinguishing a smartphone? A: - a touchscreen
Q: What are features for distinguishing a {}\{~{}\}{ }? A: -”
 
(a) Description prompts used in our method.
[Uncaptioned image]
(b) LLMs responses of category cat, airplane, chair, and river.
Refer to caption
Figure 8: Normalized cosine similarity on A-150 and A-847, we choose 200 categories in A-847 for visualization.
Refer to caption
Figure 9: Qualitative results.
Refer to caption
Figure 10: Qualitative results.