\UseRawInputEncoding

Structure-Aware Residual-Center Representation for
Self-Supervised Open-Set 3D Cross-Modal Retrieval

Abstract

Existing methods of 3D cross-modal retrieval heavily lean on category distribution priors within the training set, which diminishes their efficacy when tasked with unseen categories under open-set environments. To tackle this problem, we propose the Structure-Aware Residual-Center Representation (SRCR) framework for self-supervised open-set 3D cross-modal retrieval. To address the center deviation due to category distribution differences, we utilize the Residual-Center Embedding (RCE) for each object by nested auto-encoders, rather than directly mapping them to the modality or category centers. Besides, we perform the Hierarchical Structure Learning (HSL) approach to leverage the high-order correlations among objects for generalization, by constructing a heterogeneous hypergraph structure based on hierarchical inter-modality, intra-object, and implicit-category correlations. Extensive experiments and ablation studies on four benchmarks demonstrate the superiority of our proposed framework compared to state-of-the-art methods.

Index Terms—  3D Object Retrieval, Cross-Modal Retrieval, Open-Set Learning, Self-Supervised Learning, Hypergraph

1 Introduction

The proliferation of multimedia data on the Internet, including videos, images, text, and more, has sparked growing interest within the community in the field of cross-modal retrieval tasks. Among them, 3D cross-modal retrieval (3DCMR) has garnered growing attention due to the inherent diverse modalities of 3D data [1] and its relevance across crucial domains such as robotics, medicine, and other significant fields.

Typical 3D cross-modal retrieval task aims to retrieve 3D data from one modality given queries from different modalities. To address the heterogeneity gap [2] from different modalities, a widely adopted strategy of 3D cross-modal retrieval methods is to seek a function that maps data samples from diverse modalities into a unified global representation space [1, 3], which is called the center.

Refer to caption
Fig. 1: Illustration of the proposed SRCR. Given 3D objects of unseen categories represented by different modalities, our method generates the residual-center embeddings for each modality of each object. Then unified center representations are generated via hierarchical structure learning for cross-modal retrieval with unseen categories generalization.

Current methods of constructing such mapping can be broadly categorized into two approaches. One straightforward solution for this task is to construct complex nonlinear transformations [4, 5] that map two types of pre-trained features into a shared space. The alternative approach employs adversarial loss to learn category-related central embedding [1, 3] through end-to-end training. However, both methods exhibit a pronounced dependence on the prior distribution of category spaces within the training sets, which leads to substantial representational biases when confronted with objects of unseen categories. Furthermore, dependency on training labels in adversarial loss also complicates the deployment of 3D cross-modal retrieval.

To overcome the aforementioned challenges, we propose the Structure-Aware Residual-Center Representation (SRCR) framework for the self-supervised open-set 3D cross-modal retrieval task, as shown in Fig. 1. On one hand, to overcome the center deviation due to category distribution differences, we utilize the residual-center embedding for each object by nested auto-encoders, rather than directly mapping them to the modality or category center. On the other hand, we perform a hierarchical structure learning approach to utilize the high-order correlations among objects for generalization, by constructing a heterogeneous hypergraph structure based on hierarchical intra-modality, inter-object, and implicit-category correlations. Our contributions are summarized as follows:

  • We introduce a practical open-set setting for 3D cross-modal retrieval and generate four datasets for benchmarking of downstream 3D cross-modal tasks.

  • We propose the Structure-Aware Residual-Center Representation (SRCR) framework for the open-set 3D cross-modal retrieval, including the Residual-Center Embedding (RCE) and Hierarchical Structure Learning (HSL) modules, which are designed to overcome the modality diversion accentuated by unseen categories distribution.

  • We propose a hierarchical hypergraph structure to capture the high-order correlations among objects, under the guidance of hierarchical inter-modality, intra-object, and implicit-category correlations.

  • The proposed framework significantly outperforms the state-of-the-art 3D cross-modal retrieval methods under the open-set setting.

2 Related Work

2.1 Cross-Modal Retrieval

Existing methods usually construct a mapping function into a unified common space to overcome the heterogeneity gap [2] from different modalities. These approaches could be roughly classified into projection-based [4] and discrimination-based [1, 3] methods. While such methods excel under the closed-set assumption, their reliance on training category distribution limits their generalization in real-world, open-set environments.

2.2 Open-Environment Learning

Most current methods of open-environment learning are designed for open-set recognition [6, 7], which is usually used to detect whether the sample belongs to the seen categories or not. While some methods have succeeded in open-set 3D multi-modal retrieval [8], the complexity and inherent disparities between different modalities still present considerable challenges in open-set cross-modal retrieval.

3 Methodology

3.1 Problem Setup

Given N𝑁Nitalic_N 3D objects {oi}={oir}r=1Msubscript𝑜𝑖superscriptsubscriptsubscriptsuperscript𝑜𝑟𝑖𝑟1𝑀\{o_{i}\}=\{o^{r}_{i}\}_{r=1}^{M}{ italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } = { italic_o start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT represented by M𝑀Mitalic_M modalities, the goal of 3D cross-modal retrieval (3DCMR) is to develop a model using the training set 𝒟trn={(oi,yi)}i=1Lsubscript𝒟𝑡𝑟𝑛subscriptsuperscriptsubscript𝑜𝑖subscript𝑦𝑖𝐿𝑖1\mathcal{D}_{trn}=\{(o_{i},y_{i})\}^{L}_{i=1}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_n end_POSTSUBSCRIPT = { ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, and then employ it to identify similar objects from the query set 𝒟q={(oiq,y^i)}i=1Qsubscript𝒟𝑞subscriptsuperscriptsubscriptsuperscript𝑜𝑞𝑖subscript^𝑦𝑖𝑄𝑖1\mathcal{D}_{q}=\{(o^{q}_{i},\hat{y}_{i})\}^{Q}_{i=1}caligraphic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = { ( italic_o start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT to the target set 𝒟t={(oit,y^i)}i=1Tsubscript𝒟𝑡subscriptsuperscriptsubscriptsuperscript𝑜𝑡𝑖subscript^𝑦𝑖𝑇𝑖1\mathcal{D}_{t}=\{(o^{t}_{i},\hat{y}_{i})\}^{T}_{i=1}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, where the query and target objects are represented in different modalities (tq𝑡𝑞t\neq qitalic_t ≠ italic_q). Here, L𝐿Litalic_L, Q𝑄Qitalic_Q, and T𝑇Titalic_T denote the number of samples in the training, query, and target set, respectively. The query set and target set are from testing set 𝒟tes={𝒟q,𝒟t}subscript𝒟𝑡𝑒𝑠subscript𝒟𝑞subscript𝒟𝑡\mathcal{D}_{tes}=\{\mathcal{D}_{q},\mathcal{D}_{t}\}caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s end_POSTSUBSCRIPT = { caligraphic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, yi𝒴={cj}j=1Ysubscript𝑦𝑖𝒴subscriptsuperscriptsubscript𝑐𝑗𝑌𝑗1y_{i}\in\mathcal{Y}=\{c_{j}\}^{Y}_{j=1}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y = { italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_Y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT and y^i𝒴^={c^j}j=1Y^subscript^𝑦𝑖^𝒴subscriptsuperscriptsubscript^𝑐𝑗^𝑌𝑗1\hat{y}_{i}\in\mathcal{\hat{Y}}=\{\hat{c}_{j}\}^{\hat{Y}}_{j=1}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over^ start_ARG caligraphic_Y end_ARG = { over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT over^ start_ARG italic_Y end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT denote the category space of the training set and target set, where Y𝑌Yitalic_Y and Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG are the numbers of categories in the training and testing sets, respectively.

Traditional 3DCMR task is based on the close-set assumption, which means that in the testing set 𝒟tes={𝒟q,𝒟t}subscript𝒟𝑡𝑒𝑠subscript𝒟𝑞subscript𝒟𝑡\mathcal{D}_{tes}=\{\mathcal{D}_{q},\mathcal{D}_{t}\}caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s end_POSTSUBSCRIPT = { caligraphic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, all categories of objects in the testing set have been seen in the training set 𝒟trnsubscript𝒟𝑡𝑟𝑛\mathcal{D}_{trn}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_n end_POSTSUBSCRIPT. The category spaces of the training set and testing set are the same indicating 𝒴=𝒴^𝒴^𝒴\mathcal{Y}=\mathcal{\hat{Y}}caligraphic_Y = over^ start_ARG caligraphic_Y end_ARG.

Different from the traditional closed-set assumption, we consider a more practical condition that the testing set consists entirely of categories not encountered in the training set. We term this task as Open-Set 3D Cross-Modal Retrieval. Under this circumstance, 𝒟trnsubscript𝒟𝑡𝑟𝑛\mathcal{D}_{trn}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_n end_POSTSUBSCRIPT and 𝒟tessubscript𝒟𝑡𝑒𝑠\mathcal{D}_{tes}caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s end_POSTSUBSCRIPT have their individual distributions, which means 𝒴𝒴^𝒴^𝒴\mathcal{Y}\neq\mathcal{\hat{Y}}caligraphic_Y ≠ over^ start_ARG caligraphic_Y end_ARG. This task seeks to minimize the expected risk:

f=argminf𝔼(Di,Dj)(𝒟q,𝒟t)[𝕀(y^iy^j)e𝔻(f(oiq),f(ojt))+𝕀(y^i=y^j)(1e𝔻(f(oiq),f(ojt)))],superscript𝑓subscript𝑎𝑟𝑔𝑚𝑖𝑛𝑓subscript𝔼similar-tosubscript𝐷𝑖subscript𝐷𝑗subscript𝒟𝑞subscript𝒟𝑡delimited-[]𝕀subscript^𝑦𝑖subscript^𝑦𝑗superscript𝑒𝔻𝑓subscriptsuperscript𝑜𝑞𝑖𝑓subscriptsuperscript𝑜𝑡𝑗𝕀subscript^𝑦𝑖subscript^𝑦𝑗1superscript𝑒𝔻𝑓subscriptsuperscript𝑜𝑞𝑖𝑓subscriptsuperscript𝑜𝑡𝑗\begin{split}f^{*}=\mathop{argmin}\limits_{f\in\mathcal{H}}&\mathbb{E}_{(D_{i}% ,D_{j})\sim(\mathcal{D}_{q},\mathcal{D}_{t})}\left[\mathbb{I}(\hat{y}_{i}\neq% \hat{y}_{j})e^{-\mathbb{D}(f(o^{q}_{i}),f(o^{t}_{j}))}\right.\\ &\left.+\mathbb{I}(\hat{y}_{i}=\hat{y}_{j})(1-e^{-\mathbb{D}(f(o^{q}_{i}),f(o^% {t}_{j}))})\right]\end{split},start_ROW start_CELL italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_f ∈ caligraphic_H end_POSTSUBSCRIPT end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∼ ( caligraphic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ blackboard_I ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT - blackboard_D ( italic_f ( italic_o start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_f ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + blackboard_I ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( 1 - italic_e start_POSTSUPERSCRIPT - blackboard_D ( italic_f ( italic_o start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_f ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_POSTSUPERSCRIPT ) ] end_CELL end_ROW , (1)

where Di=(oiq,y^i)subscript𝐷𝑖subscriptsuperscript𝑜𝑞𝑖subscript^𝑦𝑖D_{i}=(o^{q}_{i},\hat{y}_{i})italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_o start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and Dj=(ojt,y^j)subscript𝐷𝑗subscriptsuperscript𝑜𝑡𝑗subscript^𝑦𝑗D_{j}=(o^{t}_{j},\hat{y}_{j})italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) are samples drawn from the query set 𝒟qsubscript𝒟𝑞\mathcal{D}_{q}caligraphic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and target set 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. 𝕀()𝕀\mathbb{I}(\cdot)blackboard_I ( ⋅ ) is the indicator function, which returns 1111 if the expression is true and 00 otherwise. f:=oirziassign𝑓subscriptsuperscript𝑜𝑟𝑖subscript𝑧𝑖f:=o^{r}_{i}\rightarrow z_{i}italic_f := italic_o start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the function that maps the 3D object oirsubscriptsuperscript𝑜𝑟𝑖o^{r}_{i}italic_o start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represented in different modalities into the same embedding zidsubscript𝑧𝑖superscript𝑑z_{i}\in\mathbb{R}^{d}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. \mathcal{H}caligraphic_H is the hypothesis space of function f()𝑓f(\cdot)italic_f ( ⋅ ) and 𝔻(zi,zj)𝔻subscript𝑧𝑖subscript𝑧𝑗\mathbb{D}(z_{i},z_{j})blackboard_D ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is a distance metric function.

Refer to caption
Fig. 2: An overview of the proposed structure-aware residual-center representation framework (SRCR). Our framework comprises two main modules: Residual-Center Embedding (RCE) and Hierarchical Structure Learning (HSL), which are used for residual embedding generation and structure-aware feature alignment, respectively.

3.2 Framework Architecture

The architecture of SRCR, as illustrated in Fig. 2, is composed of two modules: Residual-Center Embedding (RCE) and Hierarchical Structure Learning (HSL). Given basic features of different modalities extracted by common-used networks. The Residual-Center Embedding module is designed to generate the residual center embeddings for each object, rather than directly mapping them to the modality or category center. Then, in the Hierarchical Structure Learning stage, the hierarchical hypergraph structure is constructed based on the inter-modality, intra-object, and implicit-category correlations. Guided by this structure, the combination of hypergraph convolution and memory bank effectively leverages the high-order correlations between seen and unseen categories and different modalities. Finally, the aligned embedding of each modality is generated for the cross-modal retrieval or other downstream tasks.

3.3 Residual-Center Embedding

In order to improve category generalization while projecting into the unified space, the residual-center embedding module is developed. Specifically, the RCE consists of two nested auto-encoders and takes the basic features of different modalities as input. The outer auto-encoder 𝒜outrsubscriptsuperscript𝒜𝑟𝑜𝑢𝑡\mathcal{A}^{r}_{out}caligraphic_A start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT encodes the basic features into a latent space and pulls them together into a unified embedding. The inner auto-encoder 𝒜inrsubscriptsuperscript𝒜𝑟𝑖𝑛\mathcal{A}^{r}_{in}caligraphic_A start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT encodes the modality embeddings from the outer auto-encoder to the residual space, which transforms the embedding between the modality space and the unified space.

3.3.1 Residual Learning

Given N𝑁Nitalic_N 3D objects {oi}i=1Nsubscriptsuperscriptsubscript𝑜𝑖𝑁𝑖1\{o_{i}\}^{N}_{i=1}{ italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT and basic features {fir}r=1Msubscriptsuperscriptsubscriptsuperscript𝑓𝑟𝑖𝑀𝑟1\{f^{r}_{i}\}^{M}_{r=1}{ italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT of each object. As shown in Fig. 2, the outer auto-encoder 𝒜outr={Ψoutr,Φoutr}r=1Msubscriptsuperscript𝒜𝑟𝑜𝑢𝑡superscriptsubscriptsubscriptsuperscriptΨ𝑟𝑜𝑢𝑡subscriptsuperscriptΦ𝑟𝑜𝑢𝑡𝑟1𝑀\mathcal{A}^{r}_{out}=\{\Psi^{r}_{out},\Phi^{r}_{out}\}_{r=1}^{M}caligraphic_A start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT = { roman_Ψ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT , roman_Φ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT compresses the basic features into a unified space 𝕊usubscript𝕊𝑢\mathbb{S}_{u}blackboard_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and does the reverse reconstruction, for better representation, which can be defined as follows:

{uir=Ψoutr(fir)f^ir=Φoutr(uir),\left\{\begin{aligned} u^{r}_{i}=&\Psi^{r}_{out}(f^{r}_{i})\\ \hat{f}^{r}_{i}=&\Phi^{r}_{out}(u^{r}_{i})\\ \end{aligned}\right.,{ start_ROW start_CELL italic_u start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = end_CELL start_CELL roman_Ψ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = end_CELL start_CELL roman_Φ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW , (2)

where Ψoutr:=𝕊r𝕊uassignsubscriptsuperscriptΨ𝑟𝑜𝑢𝑡subscript𝕊𝑟subscript𝕊𝑢\Psi^{r}_{out}:=\mathbb{S}_{r}\rightarrow\mathbb{S}_{u}roman_Ψ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT := blackboard_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT → blackboard_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is the encoder that maps the r𝑟ritalic_r-th modality space 𝕊rsubscript𝕊𝑟\mathbb{S}_{r}blackboard_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT into the unified space 𝕊usubscript𝕊𝑢\mathbb{S}_{u}blackboard_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, and Φoutr:=𝕊u𝕊rassignsubscriptsuperscriptΦ𝑟𝑜𝑢𝑡subscript𝕊𝑢subscript𝕊𝑟\Phi^{r}_{out}:=\mathbb{S}_{u}\rightarrow\mathbb{S}_{r}roman_Φ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT := blackboard_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT → blackboard_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is the decoder that maps the features from unified space 𝕊usubscript𝕊𝑢\mathbb{S}_{u}blackboard_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT back to 𝕊rsubscript𝕊𝑟\mathbb{S}_{r}blackboard_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. uirdusubscriptsuperscript𝑢𝑟𝑖superscriptsubscript𝑑𝑢u^{r}_{i}\in\mathbb{R}^{d_{u}}italic_u start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and f^ind0subscriptsuperscript^𝑓𝑛𝑖superscriptsubscript𝑑0\hat{f}^{n}_{i}\in\mathbb{R}^{d_{0}}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the compressed features and reconstruction features of each modality.

An aggregation function 𝒰𝒰\mathcal{U}caligraphic_U are adopted to generate the unified embedding ui=𝒰({uir}r=1M),uiduformulae-sequencesubscript𝑢𝑖𝒰superscriptsubscriptsubscriptsuperscript𝑢𝑟𝑖𝑟1𝑀subscript𝑢𝑖superscriptsubscript𝑑𝑢u_{i}=\mathcal{U}(\{u^{r}_{i}\}_{r=1}^{M}),u_{i}\in\mathbb{R}^{d_{u}}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_U ( { italic_u start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ) , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT of object oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which are treated as the semantic center. The inner auto-encoder 𝒜inr={Ψinr,Φinr}r=1Msubscriptsuperscript𝒜𝑟𝑖𝑛superscriptsubscriptsubscriptsuperscriptΨ𝑟𝑖𝑛subscriptsuperscriptΦ𝑟𝑖𝑛𝑟1𝑀\mathcal{A}^{r}_{in}=\{\Psi^{r}_{in},\Phi^{r}_{in}\}_{r=1}^{M}caligraphic_A start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT = { roman_Ψ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , roman_Φ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT aims to generate the semantic center uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of each object and the residual-center embedding between each modality f^insubscriptsuperscript^𝑓𝑛𝑖\hat{f}^{n}_{i}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Specifically, we construct learnable parameter encodings erdusuperscript𝑒𝑟superscriptsubscript𝑑𝑢e^{r}\in\mathbb{R}^{d_{u}}italic_e start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for each modality, and 𝒜inrsubscriptsuperscript𝒜𝑟𝑖𝑛\mathcal{A}^{r}_{in}caligraphic_A start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT takes them aligned with f^irsubscriptsuperscript^𝑓𝑟𝑖\hat{f}^{r}_{i}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to get middle embedding:

{δir=Ψinr(f^ir+er)cir=Φinr(f^ir+δir),\left\{\begin{aligned} &\delta^{r}_{i}=\Psi^{r}_{in}(\hat{f}^{r}_{i}+e^{r})\\ &c^{r}_{i}=\Phi^{r}_{in}(\hat{f}^{r}_{i}+\delta^{r}_{i})\end{aligned}\right.,{ start_ROW start_CELL end_CELL start_CELL italic_δ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Ψ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_e start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_c start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Φ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_δ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW , (3)

where ΨinrsubscriptsuperscriptΨ𝑟𝑖𝑛\Psi^{r}_{in}roman_Ψ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT and ΦinrsubscriptsuperscriptΦ𝑟𝑖𝑛\Phi^{r}_{in}roman_Φ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT denote the encoder and decoder map function between modality and residual space, δirsubscriptsuperscript𝛿𝑟𝑖\delta^{r}_{i}italic_δ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the residual-center embedding of r𝑟ritalic_r-th modality of object oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

3.3.2 Loss Function for RCE

To get a better representation of modality embedding and residual-center embedding, the Residual-Center Loss rcsubscript𝑟𝑐\mathcal{L}_{rc}caligraphic_L start_POSTSUBSCRIPT italic_r italic_c end_POSTSUBSCRIPT and Cross-Reconstruction Loss crsubscript𝑐𝑟\mathcal{L}_{cr}caligraphic_L start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT are adopted here. The constraints for each loss are derived from different modalities data of the same object, rather than class labels.

Residual-Center Loss. The loss is designed to pull the distance among the estimated embeddings {uir}r=1Msuperscriptsubscriptsubscriptsuperscript𝑢𝑟𝑖𝑟1𝑀\{u^{r}_{i}\}_{r=1}^{M}{ italic_u start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT from different modalities closer, which is defined as follows:

rc=1Mr=1M(uirui2+cirui2),subscript𝑟𝑐1𝑀superscriptsubscript𝑟1𝑀subscriptdelimited-∥∥subscriptsuperscript𝑢𝑟𝑖subscript𝑢𝑖2subscriptdelimited-∥∥subscriptsuperscript𝑐𝑟𝑖subscript𝑢𝑖2\mathcal{L}_{rc}=\frac{1}{M}\sum\nolimits_{r=1}^{M}(\lVert u^{r}_{i}-u_{i}% \rVert_{2}+\lVert c^{r}_{i}-u_{i}\rVert_{2}),caligraphic_L start_POSTSUBSCRIPT italic_r italic_c end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( ∥ italic_u start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ italic_c start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , (4)

where 2subscriptdelimited-∥∥2\lVert\cdot\rVert_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm function.

Cross-Reconstrution Loss. To promote the generalization ability of the RCE, we propose the cross-reconstruction Loss. Motivated by [9], the crsubscript𝑐𝑟\mathcal{L}_{cr}caligraphic_L start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT are defined as the distance of results by exchanging decoder in the inner auto-encoder.

cr=1M(M1)k=1Mlk(Φinl(Ψink(f^ik+δik))cik2),subscript𝑐𝑟1𝑀𝑀1superscriptsubscript𝑘1𝑀subscript𝑙𝑘subscriptdelimited-∥∥subscriptsuperscriptΦ𝑙𝑖𝑛subscriptsuperscriptΨ𝑘𝑖𝑛subscriptsuperscript^𝑓𝑘𝑖subscriptsuperscript𝛿𝑘𝑖subscriptsuperscript𝑐𝑘𝑖2\mathcal{L}_{cr}=\frac{1}{M(M-1)}\sum_{k=1}^{M}\sum_{l\neq k}(\lVert\Phi^{l}_{% in}(\Psi^{k}_{in}(\hat{f}^{k}_{i}+\delta^{k}_{i}))-c^{k}_{i}\rVert_{2}),caligraphic_L start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_M ( italic_M - 1 ) end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l ≠ italic_k end_POSTSUBSCRIPT ( ∥ roman_Φ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( roman_Ψ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_δ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , (5)

where 2subscriptdelimited-∥∥2\lVert\cdot\rVert_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm function.

Joint Optimization. In the residual-center embedding stage, the overall loss function is given combined Eq. 4 and Eq. 5:

RCE=αrc+(1α)cr,subscript𝑅𝐶𝐸𝛼subscript𝑟𝑐1𝛼subscript𝑐𝑟\mathcal{L}_{RCE}=\alpha\mathcal{L}_{rc}+(1-\alpha)\mathcal{L}_{cr},caligraphic_L start_POSTSUBSCRIPT italic_R italic_C italic_E end_POSTSUBSCRIPT = italic_α caligraphic_L start_POSTSUBSCRIPT italic_r italic_c end_POSTSUBSCRIPT + ( 1 - italic_α ) caligraphic_L start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT , (6)

where α𝛼\alphaitalic_α is the hyper-parameter for trade-off.

3.4 Hierarchical Structure Learning

Although the RCE module generates the residual-center embedding of different modalities, the distribution gaps between seen and unseen categories still affect the retrieval under the open-set setting. As shown in Fig. 2, we proposed the hierarchical structure learning module for generalization across modalities and categories. Specifically, the hierarchical hypergraph is constructed to capture the hierarchical correlations. Then, the hypergraph convolution and memory bank are adopted for embedding smoothing and distilling.

3.4.1 Hierarchical Hypergraph Construction

We adopt a hierarchical hypergraph to take the most advantage of high-order correlations between modalities, objects, and categories. A hypergraph can be represented as 𝒢={𝒱,}𝒢𝒱\mathcal{G}=\{\mathcal{V},\mathcal{E}\}caligraphic_G = { caligraphic_V , caligraphic_E }, where 𝒱𝒱\mathcal{V}caligraphic_V and \mathcal{E}caligraphic_E are the vertex set and the hyperedge set, respectively.

Heterogeneous Vertices. For the vertices, we first construct the centralized embedding of each modality by aligning f^irsubscriptsuperscript^𝑓𝑟𝑖\hat{f}^{r}_{i}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the residual feature δirsubscriptsuperscript𝛿𝑟𝑖\delta^{r}_{i}italic_δ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, then we treat centralized embeddings of each object as the heterogeneous vertices.

vir=τf^ir+(1τ)δir𝒱=r=1M{vir}i=1N,missing-subexpressionsubscriptsuperscript𝑣𝑟𝑖𝜏subscriptsuperscript^𝑓𝑟𝑖1𝜏subscriptsuperscript𝛿𝑟𝑖missing-subexpression𝒱superscriptsubscript𝑟1𝑀superscriptsubscriptsubscriptsuperscript𝑣𝑟𝑖𝑖1𝑁\left.\begin{aligned} &v^{r}_{i}=\tau\hat{f}^{r}_{i}+(1-\tau)\delta^{r}_{i}\\ &\mathcal{V}=\bigcup\nolimits_{r=1}^{M}\{v^{r}_{i}\}_{i=1}^{N}\end{aligned}% \right.,start_ROW start_CELL end_CELL start_CELL italic_v start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_τ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_τ ) italic_δ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL caligraphic_V = ⋃ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT { italic_v start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_CELL end_ROW , (7)

where τ𝜏\tauitalic_τ denotes the hyper-parameters for centralized fusion, f^irsubscriptsuperscript^𝑓𝑟𝑖\hat{f}^{r}_{i}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and δirsubscriptsuperscript𝛿𝑟𝑖\delta^{r}_{i}italic_δ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the modality embedding and residual-center embedding of object oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in r𝑟ritalic_r-th modality, M𝑀Mitalic_M and N𝑁Nitalic_N denote the number of modalities and object samples.

Hierarchical Hyperedges. The hierarchical hypergraph is composed of three types of hyperedges, including inter-modality, intra-object, and implicit-category, which can be defined as follows:

m={v(r)r{1,,M}}o={𝒩v(i)i{1,,N}}c={𝒩KNNk(v)v𝒱},missing-subexpressionsubscript𝑚conditional-setsubscript𝑣𝑟𝑟1𝑀missing-subexpressionsubscript𝑜conditional-setsubscript𝒩𝑣𝑖𝑖1𝑁missing-subexpressionsubscript𝑐conditional-setsubscript𝒩subscriptKNN𝑘𝑣𝑣𝒱\left.\begin{aligned} &\mathcal{E}_{m}=\{\mathcal{M}_{v}(r)\mid r\in\{1,\cdots% ,M\}\}\\ &\mathcal{E}_{o}=\{\mathcal{N}_{v}(i)\mid i\in\{1,\cdots,N\}\}\\ &\mathcal{E}_{c}=\{\mathcal{N}_{\mathrm{KNN}_{k}}(v)\mid v\in\mathcal{V}\}\end% {aligned}\right.,start_ROW start_CELL end_CELL start_CELL caligraphic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { caligraphic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_r ) ∣ italic_r ∈ { 1 , ⋯ , italic_M } } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL caligraphic_E start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = { caligraphic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_i ) ∣ italic_i ∈ { 1 , ⋯ , italic_N } } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL caligraphic_E start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { caligraphic_N start_POSTSUBSCRIPT roman_KNN start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_v ) ∣ italic_v ∈ caligraphic_V } end_CELL end_ROW , (8)

where v(r)subscript𝑣𝑟\mathcal{M}_{v}(r)caligraphic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_r ) denotes the vertex subset that belong to the same modality r𝑟ritalic_r, 𝒩v(i)subscript𝒩𝑣𝑖\mathcal{N}_{v}(i)caligraphic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_i ) denotes the vertex subset that belong to the same object oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and 𝒩KNNk(v)subscript𝒩subscriptKNN𝑘𝑣\mathcal{N}_{\mathrm{KNN}_{k}}(v)caligraphic_N start_POSTSUBSCRIPT roman_KNN start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_v ) denotes the k-nearest neighbors of vertex v𝑣vitalic_v.

In this way, M𝑀Mitalic_M inter-modality hyperedges, N𝑁Nitalic_N intra-object hyperedges and M×N𝑀𝑁M\times Nitalic_M × italic_N implicit-category hyperedges are constructed. Finally, we combine these three hyperedge groups to get the total hyperedges: =mocsubscript𝑚subscript𝑜subscript𝑐\mathcal{E}=\mathcal{E}_{m}\cup\mathcal{E}_{o}\cup\mathcal{E}_{c}caligraphic_E = caligraphic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∪ caligraphic_E start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∪ caligraphic_E start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.

3.4.2 Hypergraph Convolution and Alignment

To leverage the high-order correlation between objects and modalities, we utilize the hypergraph convolution [10] to smooth the embedding under the hierarchical structure, which is formulated as:

𝐕~=σ(𝐃v12𝐇𝐖𝐃e1𝐇𝐃v12𝐕𝚯),~𝐕𝜎subscriptsuperscript𝐃12𝑣subscriptsuperscript𝐇𝐖𝐃1𝑒superscript𝐇topsubscriptsuperscript𝐃12𝑣𝐕𝚯\tilde{\mathbf{V}}=\sigma\left(\mathbf{D}^{-\frac{1}{2}}_{v}\mathbf{H}\mathbf{% W}\mathbf{D}^{-1}_{e}\mathbf{H}^{\top}\mathbf{D}^{-\frac{1}{2}}_{v}\mathbf{V}% \mathbf{\Theta}\right),over~ start_ARG bold_V end_ARG = italic_σ ( bold_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT bold_HWD start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT bold_H start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_D start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT bold_V bold_Θ ) , (9)

where 𝐇𝐇\mathbf{H}bold_H denotes the incidence matrix of the hypergraph. 𝐃vsubscript𝐃𝑣\mathbf{D}_{v}bold_D start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and 𝐃esubscript𝐃𝑒\mathbf{D}_{e}bold_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT are the diagonal degree matrices for vertex and hyperedge, respectively.

After obtaining the structure-aware embedding v~irsubscriptsuperscript~𝑣𝑟𝑖\tilde{v}^{r}_{i}over~ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the 3D object oirsubscriptsuperscript𝑜𝑟𝑖o^{r}_{i}italic_o start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we construct a memory bank \mathcal{B}caligraphic_B that contains L𝐿Litalic_L invariant memory anchors. Following [8], we compute the activation score for each memory anchor in the memory bank by sijr=𝒟m(v~ir,ajr)subscriptsuperscript𝑠𝑟𝑖𝑗subscript𝒟𝑚subscriptsuperscript~𝑣𝑟𝑖subscriptsuperscript𝑎𝑟𝑗s^{r}_{ij}=\mathcal{D}_{m}(\tilde{v}^{r}_{i},a^{r}_{j})italic_s start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( over~ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), where ajrsubscriptsuperscript𝑎𝑟𝑗a^{r}_{j}italic_a start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT denotes the anchor and Dm(,)subscript𝐷𝑚D_{m}(\cdot,\cdot)italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( ⋅ , ⋅ ) denotes the distance metric function. We rebuild the aligned embedding of each object by zi=j=1Ls^ijrajr,zirdzformulae-sequencesubscript𝑧𝑖superscriptsubscript𝑗1𝐿subscriptsuperscript^𝑠𝑟𝑖𝑗subscriptsuperscript𝑎𝑟𝑗subscriptsuperscript𝑧𝑟𝑖superscriptsubscript𝑑𝑧z_{i}=\sum\nolimits_{j=1}^{L}\hat{s}^{r}_{ij}a^{r}_{j},z^{r}_{i}\in\mathbb{R}^% {d_{z}}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where s^ijrsubscriptsuperscript^𝑠𝑟𝑖𝑗\hat{s}^{r}_{ij}over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denotes the normlization of activation score.

3.4.3 Loss Function for HSL

To train the hypergraph convolution and learnable memory anchors under hierarchical structure, we adopt the self-supervised Memory Reconstruction Loss mrsubscript𝑚𝑟\mathcal{L}_{mr}caligraphic_L start_POSTSUBSCRIPT italic_m italic_r end_POSTSUBSCRIPT for HSL:

HSL=mr=v~irzir2,subscript𝐻𝑆𝐿subscript𝑚𝑟subscriptdelimited-∥∥subscriptsuperscript~𝑣𝑟𝑖subscriptsuperscript𝑧𝑟𝑖2\mathcal{L}_{HSL}=\mathcal{L}_{mr}=\big{\lVert}\tilde{v}^{r}_{i}-z^{r}_{i}\big% {\rVert}_{2},caligraphic_L start_POSTSUBSCRIPT italic_H italic_S italic_L end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_m italic_r end_POSTSUBSCRIPT = ∥ over~ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_z start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (10)

where 2subscriptdelimited-∥∥2\lVert\cdot\rVert_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm function.

Table 1: Experimental results of Image2Point retrieval on the OCAB, OCNT, OCES, and OCMN datasets.
Image2Point OCAB OCNT OCES OCMN
mAP\uparrow NDCG\uparrow ANMRR\downarrow mAP\uparrow NDCG\uparrow ANMRR\downarrow mAP\uparrow NDCG\uparrow ANMRR\downarrow mAP\uparrow NDCG\uparrow ANMRR\downarrow
SDML 0.1489 0.1061 0.8824 0.0465 0.0316 0.9657 0.0942 0.0442 0.9486 0.0578 0.0248 0.9735
CMCL 0.1702 0.1520 0.8565 0.0623 0.0332 0.9665 0.0991 0.0477 0.9444 0.1175 0.0917 0.9001
MMSAE 0.1218 0.0802 0.9093 0.0410 0.0191 0.9817 0.0810 0.0362 0.9567 0.0571 0.0235 0.9746
PROSER 0.1119 0.0446 0.9386 0.0426 0.0171 0.9752 0.0968 0.0402 0.9641 0.0523 0.0133 0.9806
HGM2R 0.1367 0.0925 0.8978 0.1812 0.1072 0.8184 0.2184 0.1126 0.8215 0.0988 0.0789 0.9282
Ours 0.2220 0.1714 0.7947 0.2861 0.1585 0.7292 0.4004 0.1835 0.6378 0.1549 0.1488 0.8625
Table 2: Experimental results of Point2Image retrieval on the OCAB, OCNT, OCES, and OCMN datasets.
Point2Image OCAB OCNT OCES OCMN
mAP\uparrow NDCG\uparrow ANMRR\downarrow mAP\uparrow NDCG\uparrow ANMRR\downarrow mAP\uparrow NDCG\uparrow ANMRR\downarrow mAP\uparrow NDCG\uparrow ANMRR\downarrow
SDML 0.1636 0.1367 0.8629 0.0393 0.0191 0.9820 0.0811 0.0413 0.9654 0.0682 0.0475 0.9512
CMCL 0.1628 0.1343 0.8594 0.0394 0.0203 0.9786 0.0815 0.0420 0.9598 0.1219 0.1116 0.9074
MMSAE 0.0821 0.0657 0.9211 0.0347 0.0191 0.9846 0.0891 0.0413 0.9496 0.0460 0.0278 0.9699
PROSER 0.0708 0.0555 0.9579 0.0387 0.0182 0.9834 0.0885 0.0406 0.9492 0.0693 0.0336 0.9653
HGM2R 0.1553 0.1613 0.8593 0.1452 0.0742 0.8996 0.2260 0.1265 0.8430 0.1006 0.0747 0.9305
Ours 0.3013 0.3202 0.7048 0.2811 0.1377 0.7316 0.4471 0.1914 0.5936 0.1277 0.1117 0.8994

4 Experiments

4.1 Experimental Settings

OCMR Datasets. We generate four open-set 3D cross-modal retrieval (OCMR) datasets, including OCAB, OCNT, OCES, OCMN, based on the public datasets ABO [11], NTU [12], ESB [13], and ModelNet40 [14], respectively. These datasets are split into seen and unseen categories, each object has three modalities including multi-view, voxel, and point cloud.

Implemental Details. In our experiment, we choose all three modalities of 3D objects. We set α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 for the hyper-parameters in Eq. 6, and τ=0.75𝜏0.75\tau=0.75italic_τ = 0.75 in Eq. 7. The two modules are trained separately with 40 epochs on learning rate lr=0.1𝑙𝑟0.1lr=0.1italic_l italic_r = 0.1 and 120 epochs on lr=0.001𝑙𝑟0.001lr=0.001italic_l italic_r = 0.001, the random seed is fixed as 2022 for all experiments.

4.2 Retrieval Performance

Compared Methods. As no methods are specifically designed for the open-set 3D cross-modal retrieval to date, we refine the current state-of-the-art methods from two tasks for comparison: close-set 3D cross-modal retrieval (SDML [15], CMCL [1], MMSAE [16]), and open-set multi-modal recognition or retrieval (PROSER [17], HGM2[8]).

Evaluation Metrics. For a fair comparison, we employ the commonly used retrieval metric, including Mean Average Precision (mAP), Normalized Discounted Cumulative Gain (NDCG), Average Normalized Modified Retrieval Rank (ANMRR), and the Precision-Recall Curve (PR-Curve). For the mAP and NDCG metric, higher scores are better. For the ANMRR metric, the lower score is better. We construct 6666 query-target types for cross-modal retrieval according to these three modalities, including Image2Point (I2P), Image2Voxel (I2V), Point2Image (P2I), Point2Voxel (P2V), Voxel2Image (V2I), and Voxel2Point (V2P).

Comparison Analysis. We evaluate open-set 3D cross-modal retrieval results on four datasets, quantitative results of SRCR framework and other state-of-the-art methods are provided in Tab. 1 and Tab. 2. Results show that the proposed method outperforms the other methods on all four datasets. We also provide the Precision-Recall (PR) Curve to evaluate the performance of the proposed SRCR framework and other compared methods, as illustrated in Fig. 3. The larger area below the curve indicates better performance. From the results, we can observe that our method outperforms all other compared methods. The better performance indicates that by the residual-center embedding and hierarchical structure learning, the proposed method has the capability to overcome modality gaps while understanding the open-set categories.

Table 3: Ablation studies on OCNT dataset.
mAP\uparrow NDCG\uparrow ANMRR\downarrow
On RCE Direct Center 0.0362 0.0194 0.9801
Category Center 0.0433 0.0230 0.9772
On HSL HSL w/o msubscript𝑚\mathcal{E}_{m}caligraphic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 0.1511 0.1081 0.8626
HSL w/o msubscript𝑚\mathcal{E}_{m}caligraphic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT&osubscript𝑜\mathcal{E}_{o}caligraphic_E start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT 0.1474 0.1084 0.8636
GCN-based HSL 0.2575 0.1640 0.7374
MLP-based HSL 0.1553 0.1101 0.8380
RCE+HSL 0.2861 0.1585 0.7292
Refer to caption
(a) PR Curve on OCAB.
Refer to caption
(b) PR Curve on OCNT.
Refer to caption
(c) PR Curve on OCES.
Refer to caption
(d) PR Curve on OCMN.
Fig. 3: The precision-recall curves comparison of Image2Point retrieval on four datasets, respectively.

4.3 Ablation Study

We conduct ablation studies to verify the effectiveness of the proposed modules. For the residual-center embedding module, we compare the proposed RCE with the Direct Center and Category Center. The Direct Center denotes the network that use auto-encoder to generate the center embedding directly instead of residually, and Category Center denotes the network that generates the category center rather than semantic center of each object. During the ablation of the hierarchical structure learning module, we compared the proposed HSL with naive structures (HSL w/o msubscript𝑚\mathcal{E}_{m}caligraphic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and HSL w/o msubscript𝑚\mathcal{E}_{m}caligraphic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT&osubscript𝑜\mathcal{E}_{o}caligraphic_E start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT), where “w/o” denotes “without”. We also replace the hypergraph-based correlation learning with MLP and GCN. As shown in Tab. 3, the combination of RCE and HSL yields the best performance, substituting either the embedding or the learning approach is observed to lead to a decline in performance. These results demonstrate the proposed modules can effectively obtain the center-embedding of objects and generalize it to unseen categories.

5 Conclusion

In this paper, we propose the Structure-Aware Residual-Center Representation (SRCR) framework for self-supervised open-set 3D cross-modal retrieval. We utilize the Residual-Center Embedding (RCE) for each object by nested auto-encoders to address the center deviation due to category distribution differences, rather than directly mapping them to the modality or category centerS. Besides, we construct a heterogeneous hypergraph structure based on hierarchical inter-modality, intra-object, and implicit-category correlations, and perform the Hierarchical Structure Learning (HSL) approach to leverage the high-order correlations among objects for generalization. Extensive experiments and ablation studies on four benchmarks demonstrate the superiority of our proposed framework compared to state-of-the-art methods.

References

  • [1] Longlong Jing, Elahe Vahdani, Jiaxing Tan, and Yingli Tian, “Cross-modal Center Loss for 3D Cross-Modal Retrieval,” in CVPR, 2021, pp. 3142–3151.
  • [2] Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen, “Adversarial Cross-modal Retrieval,” in ACMMM, 2017, pp. 154–162.
  • [3] Yanglin Feng, Hongyuan Zhu, Dezhong Peng, Xi Peng, and Peng Hu, “RONO: Robust Discriminative Learning With Noisy Labels for 2D-3D Cross-Modal Retrieval,” in CVPR, 2023, pp. 11610–11619.
  • [4] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu, “Deep Canonical Correlation Analysis,” in ICML. PMLR, 2013, pp. 1247–1255.
  • [5] Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes, “On Deep Multi-View Representation Learning,” in ICML. PMLR, 2015, pp. 1083–1092.
  • [6] Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman, “Open-Set Recognition: A Good Closed-Set Classifier is All You Need?,” arXiv preprint arXiv:2110.06207, 2021.
  • [7] Guangyao Chen, Peixi Peng, Xiangqian Wang, and Yonghong Tian, “Adversarial Reciprocal Points Learning for Open Set Recognition,” TPAMI, vol. 44, no. 11, pp. 8065–8081, 2021.
  • [8] Yifan Feng, Shuyi Ji, Yu-Shen Liu, Shaoyi Du, Qionghai Dai, and Yue Gao, “Hypergraph-based Multi-Modal Representation for Open-Set 3D Object Retrieval,” TPAMI, , no. 01, pp. 1–18, 2023.
  • [9] Fangxiang Feng, Xiaojie Wang, and Ruifan Li, “Cross-Modal Retrieval with Correspondence AutoEncoder,” in ACMMM, 2014, pp. 7–16.
  • [10] Yue Gao, Yifan Feng, Shuyi Ji, and Rongrong Ji, “HGNN+: General Hypergraph Neural Networks,” TPAMI, vol. 45, no. 3, pp. 3181–3199, 2022.
  • [11] Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al., “ABO: Dataset and Benchmarks for Real-World 3D Object Understanding,” in CVPR, 2022, pp. 21126–21136.
  • [12] Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung, “On Visual Similarity based 3D Model Retrieval,” in Computer graphics forum. Wiley Online Library, 2003, pp. 223–232.
  • [13] Subramaniam Jayanti, Yagnanarayanan Kalyanaraman, Natraj Iyer, and Karthik Ramani, “Developing an Engineering Shape Benchmark for CAD Models,” Computer-Aided Design, vol. 38, no. 9, pp. 939–953, 2006.
  • [14] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao, “3D Shapenets: A Deep Representation for Volumetric Shapes,” in CVPR, 2015, pp. 1912–1920.
  • [15] Peng Hu, Liangli Zhen, Dezhong Peng, and Pei Liu, “Scalable Deep Multimodal Learning for Cross-Modal Retrieval,” in SIGIR, 2019, pp. 635–644.
  • [16] Yiling Wu, Shuhui Wang, and Qingming Huang, “Multi-Modal Semantic AutoEncoder for Cross-Modal Retrieval,” Neurocomputing, vol. 331, pp. 165–175, 2019.
  • [17] Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan, “Learning Placeholders for Open-Set Recognition,” in CVPR, 2021, pp. 4401–4410.