\UseRawInputEncoding

Structure-Aware Residual-Center Representation for
Self-Supervised Open-Set 3D Cross-Modal Retrieval

Abstract

Existing methods of 3D cross-modal retrieval heavily lean on category distribution priors within the training set, which diminishes their efficacy when tasked with unseen categories under open-set environments. To tackle this problem, we propose the Structure-Aware Residual-Center Representation (SRCR) framework for self-supervised open-set 3D cross-modal retrieval. To address the center deviation due to category distribution differences, we utilize the Residual-Center Embedding (RCE) for each object by nested auto-encoders, rather than directly mapping them to the modality or category centers. Besides, we perform the Hierarchical Structure Learning (HSL) approach to leverage the high-order correlations among objects for generalization, by constructing a heterogeneous hypergraph structure based on hierarchical inter-modality, intra-object, and implicit-category correlations. Extensive experiments and ablation studies on four benchmarks demonstrate the superiority of our proposed framework compared to state-of-the-art methods.

Index Terms— 3D Object Retrieval, Cross-Modal Retrieval, Open-Set Learning, Self-Supervised Learning, Hypergraph

1 Introduction

The proliferation of multimedia data on the Internet, including videos, images, text, and more, has sparked growing interest within the community in the field of cross-modal retrieval tasks. Among them, 3D cross-modal retrieval (3DCMR) has garnered growing attention due to the inherent diverse modalities of 3D data [1] and its relevance across crucial domains such as robotics, medicine, and other significant fields.

Typical 3D cross-modal retrieval task aims to retrieve 3D data from one modality given queries from different modalities. To address the heterogeneity gap [2] from different modalities, a widely adopted strategy of 3D cross-modal retrieval methods is to seek a function that maps data samples from diverse modalities into a unified global representation space [1, 3], which is called the center.

Refer to caption — Fig. 1: Illustration of the proposed SRCR. Given 3D objects of unseen categories represented by different modalities, our method generates the residual-center embeddings for each modality of each object. Then unified center representations are generated via hierarchical structure learning for cross-modal retrieval with unseen categories generalization.

Current methods of constructing such mapping can be broadly categorized into two approaches. One straightforward solution for this task is to construct complex nonlinear transformations [4, 5] that map two types of pre-trained features into a shared space. The alternative approach employs adversarial loss to learn category-related central embedding [1, 3] through end-to-end training. However, both methods exhibit a pronounced dependence on the prior distribution of category spaces within the training sets, which leads to substantial representational biases when confronted with objects of unseen categories. Furthermore, dependency on training labels in adversarial loss also complicates the deployment of 3D cross-modal retrieval.

To overcome the aforementioned challenges, we propose the Structure-Aware Residual-Center Representation (SRCR) framework for the self-supervised open-set 3D cross-modal retrieval task, as shown in Fig. 1. On one hand, to overcome the center deviation due to category distribution differences, we utilize the residual-center embedding for each object by nested auto-encoders, rather than directly mapping them to the modality or category center. On the other hand, we perform a hierarchical structure learning approach to utilize the high-order correlations among objects for generalization, by constructing a heterogeneous hypergraph structure based on hierarchical intra-modality, inter-object, and implicit-category correlations. Our contributions are summarized as follows:

•

We introduce a practical open-set setting for 3D cross-modal retrieval and generate four datasets for benchmarking of downstream 3D cross-modal tasks.
•

We propose the Structure-Aware Residual-Center Representation (SRCR) framework for the open-set 3D cross-modal retrieval, including the Residual-Center Embedding (RCE) and Hierarchical Structure Learning (HSL) modules, which are designed to overcome the modality diversion accentuated by unseen categories distribution.
•

We propose a hierarchical hypergraph structure to capture the high-order correlations among objects, under the guidance of hierarchical inter-modality, intra-object, and implicit-category correlations.
•

The proposed framework significantly outperforms the state-of-the-art 3D cross-modal retrieval methods under the open-set setting.

2 Related Work

2.1 Cross-Modal Retrieval

Existing methods usually construct a mapping function into a unified common space to overcome the heterogeneity gap [2] from different modalities. These approaches could be roughly classified into projection-based [4] and discrimination-based [1, 3] methods. While such methods excel under the closed-set assumption, their reliance on training category distribution limits their generalization in real-world, open-set environments.

2.2 Open-Environment Learning

Most current methods of open-environment learning are designed for open-set recognition [6, 7], which is usually used to detect whether the sample belongs to the seen categories or not. While some methods have succeeded in open-set 3D multi-modal retrieval [8], the complexity and inherent disparities between different modalities still present considerable challenges in open-set cross-modal retrieval.

3 Methodology

3.1 Problem Setup

Given $N$ 3D objects $\{o_{i}\}=\{o^{r}_{i}\}_{r=1}^{M}$ represented by $M$ modalities, the goal of 3D cross-modal retrieval (3DCMR) is to develop a model using the training set $\mathcal{D}_{trn}=\{(o_{i},y_{i})\}^{L}_{i=1}$ , and then employ it to identify similar objects from the query set $\mathcal{D}_{q}=\{(o^{q}_{i},\hat{y}_{i})\}^{Q}_{i=1}$ to the target set $\mathcal{D}_{t}=\{(o^{t}_{i},\hat{y}_{i})\}^{T}_{i=1}$ , where the query and target objects are represented in different modalities ( $t\neq q$ ). Here, $L$ , $Q$ , and $T$ denote the number of samples in the training, query, and target set, respectively. The query set and target set are from testing set $\mathcal{D}_{tes}=\{\mathcal{D}_{q},\mathcal{D}_{t}\}$ , $y_{i}\in\mathcal{Y}=\{c_{j}\}^{Y}_{j=1}$ and $\hat{y}_{i}\in\mathcal{\hat{Y}}=\{\hat{c}_{j}\}^{\hat{Y}}_{j=1}$ denote the category space of the training set and target set, where $Y$ and $\hat{Y}$ are the numbers of categories in the training and testing sets, respectively.

Traditional 3DCMR task is based on the close-set assumption, which means that in the testing set $\mathcal{D}_{tes}=\{\mathcal{D}_{q},\mathcal{D}_{t}\}$ , all categories of objects in the testing set have been seen in the training set $\mathcal{D}_{trn}$ . The category spaces of the training set and testing set are the same indicating $\mathcal{Y}=\mathcal{\hat{Y}}$ .

Different from the traditional closed-set assumption, we consider a more practical condition that the testing set consists entirely of categories not encountered in the training set. We term this task as Open-Set 3D Cross-Modal Retrieval. Under this circumstance, $\mathcal{D}_{trn}$ and $\mathcal{D}_{tes}$ have their individual distributions, which means $\mathcal{Y}\neq\mathcal{\hat{Y}}$ . This task seeks to minimize the expected risk:

\begin{split}f^{*}=\mathop{argmin}\limits_{f\in\mathcal{H}}&\mathbb{E}_{(D_{i}% ,D_{j})\sim(\mathcal{D}_{q},\mathcal{D}_{t})}\left[\mathbb{I}(\hat{y}_{i}\neq% \hat{y}_{j})e^{-\mathbb{D}(f(o^{q}_{i}),f(o^{t}_{j}))}\right.\\ &\left.+\mathbb{I}(\hat{y}_{i}=\hat{y}_{j})(1-e^{-\mathbb{D}(f(o^{q}_{i}),f(o^% {t}_{j}))})\right]\end{split},

(1)

where $D_{i}=(o^{q}_{i},\hat{y}_{i})$ and $D_{j}=(o^{t}_{j},\hat{y}_{j})$ are samples drawn from the query set $\mathcal{D}_{q}$ and target set $\mathcal{D}_{t}$ . $\mathbb{I}(\cdot)$ is the indicator function, which returns $1$ if the expression is true and $0$ otherwise. $f:=o^{r}_{i}\rightarrow z_{i}$ is the function that maps the 3D object $o^{r}_{i}$ represented in different modalities into the same embedding $z_{i}\in\mathbb{R}^{d}$ . $\mathcal{H}$ is the hypothesis space of function $f(\cdot)$ and $\mathbb{D}(z_{i},z_{j})$ is a distance metric function.

3.2 Framework Architecture

The architecture of SRCR, as illustrated in Fig. 2, is composed of two modules: Residual-Center Embedding (RCE) and Hierarchical Structure Learning (HSL). Given basic features of different modalities extracted by common-used networks. The Residual-Center Embedding module is designed to generate the residual center embeddings for each object, rather than directly mapping them to the modality or category center. Then, in the Hierarchical Structure Learning stage, the hierarchical hypergraph structure is constructed based on the inter-modality, intra-object, and implicit-category correlations. Guided by this structure, the combination of hypergraph convolution and memory bank effectively leverages the high-order correlations between seen and unseen categories and different modalities. Finally, the aligned embedding of each modality is generated for the cross-modal retrieval or other downstream tasks.

3.3 Residual-Center Embedding

In order to improve category generalization while projecting into the unified space, the residual-center embedding module is developed. Specifically, the RCE consists of two nested auto-encoders and takes the basic features of different modalities as input. The outer auto-encoder $\mathcal{A}^{r}_{out}$ encodes the basic features into a latent space and pulls them together into a unified embedding. The inner auto-encoder $\mathcal{A}^{r}_{in}$ encodes the modality embeddings from the outer auto-encoder to the residual space, which transforms the embedding between the modality space and the unified space.

3.3.1 Residual Learning

Given $N$ 3D objects $\{o_{i}\}^{N}_{i=1}$ and basic features $\{f^{r}_{i}\}^{M}_{r=1}$ of each object. As shown in Fig. 2, the outer auto-encoder $\mathcal{A}^{r}_{out}=\{\Psi^{r}_{out},\Phi^{r}_{out}\}_{r=1}^{M}$ compresses the basic features into a unified space $\mathbb{S}_{u}$ and does the reverse reconstruction, for better representation, which can be defined as follows:

\left\{\begin{aligned} u^{r}_{i}=&\Psi^{r}_{out}(f^{r}_{i})\\ \hat{f}^{r}_{i}=&\Phi^{r}_{out}(u^{r}_{i})\\ \end{aligned}\right.,

(2)

where $\Psi^{r}_{out}:=\mathbb{S}_{r}\rightarrow\mathbb{S}_{u}$ is the encoder that maps the $r$ -th modality space $\mathbb{S}_{r}$ into the unified space $\mathbb{S}_{u}$ , and $\Phi^{r}_{out}:=\mathbb{S}_{u}\rightarrow\mathbb{S}_{r}$ is the decoder that maps the features from unified space $\mathbb{S}_{u}$ back to $\mathbb{S}_{r}$ . $u^{r}_{i}\in\mathbb{R}^{d_{u}}$ and $\hat{f}^{n}_{i}\in\mathbb{R}^{d_{0}}$ denotes the compressed features and reconstruction features of each modality.

An aggregation function $\mathcal{U}$ are adopted to generate the unified embedding $u_{i}=\mathcal{U}(\{u^{r}_{i}\}_{r=1}^{M}),u_{i}\in\mathbb{R}^{d_{u}}$ of object $o_{i}$ , which are treated as the semantic center. The inner auto-encoder $\mathcal{A}^{r}_{in}=\{\Psi^{r}_{in},\Phi^{r}_{in}\}_{r=1}^{M}$ aims to generate the semantic center $u_{i}$ of each object and the residual-center embedding between each modality $\hat{f}^{n}_{i}$ . Specifically, we construct learnable parameter encodings $e^{r}\in\mathbb{R}^{d_{u}}$ for each modality, and $\mathcal{A}^{r}_{in}$ takes them aligned with $\hat{f}^{r}_{i}$ to get middle embedding:

\left\{\begin{aligned} &\delta^{r}_{i}=\Psi^{r}_{in}(\hat{f}^{r}_{i}+e^{r})\\ &c^{r}_{i}=\Phi^{r}_{in}(\hat{f}^{r}_{i}+\delta^{r}_{i})\end{aligned}\right.,

(3)

where $\Psi^{r}_{in}$ and $\Phi^{r}_{in}$ denote the encoder and decoder map function between modality and residual space, $\delta^{r}_{i}$ denotes the residual-center embedding of $r$ -th modality of object $o_{i}$ .

3.3.2 Loss Function for RCE

To get a better representation of modality embedding and residual-center embedding, the Residual-Center Loss $\mathcal{L}_{rc}$ and Cross-Reconstruction Loss $\mathcal{L}_{cr}$ are adopted here. The constraints for each loss are derived from different modalities data of the same object, rather than class labels.

Residual-Center Loss. The loss is designed to pull the distance among the estimated embeddings $\{u^{r}_{i}\}_{r=1}^{M}$ from different modalities closer, which is defined as follows:

\mathcal{L}_{rc}=\frac{1}{M}\sum\nolimits_{r=1}^{M}(\lVert u^{r}_{i}-u_{i}% \rVert_{2}+\lVert c^{r}_{i}-u_{i}\rVert_{2}),

(4)

where $\lVert\cdot\rVert_{2}$ is the $\mathcal{L}_{2}$ norm function.

Cross-Reconstrution Loss. To promote the generalization ability of the RCE, we propose the cross-reconstruction Loss. Motivated by [9], the $\mathcal{L}_{cr}$ are defined as the distance of results by exchanging decoder in the inner auto-encoder.

\mathcal{L}_{cr}=\frac{1}{M(M-1)}\sum_{k=1}^{M}\sum_{l\neq k}(\lVert\Phi^{l}_{% in}(\Psi^{k}_{in}(\hat{f}^{k}_{i}+\delta^{k}_{i}))-c^{k}_{i}\rVert_{2}),

(5)

where $\lVert\cdot\rVert_{2}$ is the $\mathcal{L}_{2}$ norm function.

Joint Optimization. In the residual-center embedding stage, the overall loss function is given combined Eq. 4 and Eq. 5:

\mathcal{L}_{RCE}=\alpha\mathcal{L}_{rc}+(1-\alpha)\mathcal{L}_{cr},

(6)

where $\alpha$ is the hyper-parameter for trade-off.

3.4 Hierarchical Structure Learning

Although the RCE module generates the residual-center embedding of different modalities, the distribution gaps between seen and unseen categories still affect the retrieval under the open-set setting. As shown in Fig. 2, we proposed the hierarchical structure learning module for generalization across modalities and categories. Specifically, the hierarchical hypergraph is constructed to capture the hierarchical correlations. Then, the hypergraph convolution and memory bank are adopted for embedding smoothing and distilling.

3.4.1 Hierarchical Hypergraph Construction

We adopt a hierarchical hypergraph to take the most advantage of high-order correlations between modalities, objects, and categories. A hypergraph can be represented as $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ , where $\mathcal{V}$ and $\mathcal{E}$ are the vertex set and the hyperedge set, respectively.

Heterogeneous Vertices. For the vertices, we first construct the centralized embedding of each modality by aligning $\hat{f}^{r}_{i}$ with the residual feature $\delta^{r}_{i}$ , then we treat centralized embeddings of each object as the heterogeneous vertices.

\left.\begin{aligned} &v^{r}_{i}=\tau\hat{f}^{r}_{i}+(1-\tau)\delta^{r}_{i}\\ &\mathcal{V}=\bigcup\nolimits_{r=1}^{M}\{v^{r}_{i}\}_{i=1}^{N}\end{aligned}% \right.,

(7)

where $\tau$ denotes the hyper-parameters for centralized fusion, $\hat{f}^{r}_{i}$ and $\delta^{r}_{i}$ denote the modality embedding and residual-center embedding of object $o_{i}$ in $r$ -th modality, $M$ and $N$ denote the number of modalities and object samples.

Hierarchical Hyperedges. The hierarchical hypergraph is composed of three types of hyperedges, including inter-modality, intra-object, and implicit-category, which can be defined as follows:

\left.\begin{aligned} &\mathcal{E}_{m}=\{\mathcal{M}_{v}(r)\mid r\in\{1,\cdots% ,M\}\}\\ &\mathcal{E}_{o}=\{\mathcal{N}_{v}(i)\mid i\in\{1,\cdots,N\}\}\\ &\mathcal{E}_{c}=\{\mathcal{N}_{\mathrm{KNN}_{k}}(v)\mid v\in\mathcal{V}\}\end% {aligned}\right.,

(8)

where $\mathcal{M}_{v}(r)$ denotes the vertex subset that belong to the same modality $r$ , $\mathcal{N}_{v}(i)$ denotes the vertex subset that belong to the same object $o_{i}$ , and $\mathcal{N}_{\mathrm{KNN}_{k}}(v)$ denotes the k-nearest neighbors of vertex $v$ .

In this way, $M$ inter-modality hyperedges, $N$ intra-object hyperedges and $M\times N$ implicit-category hyperedges are constructed. Finally, we combine these three hyperedge groups to get the total hyperedges: $\mathcal{E}=\mathcal{E}_{m}\cup\mathcal{E}_{o}\cup\mathcal{E}_{c}$ .

3.4.2 Hypergraph Convolution and Alignment

To leverage the high-order correlation between objects and modalities, we utilize the hypergraph convolution [10] to smooth the embedding under the hierarchical structure, which is formulated as:

\tilde{\mathbf{V}}=\sigma\left(\mathbf{D}^{-\frac{1}{2}}_{v}\mathbf{H}\mathbf{% W}\mathbf{D}^{-1}_{e}\mathbf{H}^{\top}\mathbf{D}^{-\frac{1}{2}}_{v}\mathbf{V}% \mathbf{\Theta}\right),

(9)

where $\mathbf{H}$ denotes the incidence matrix of the hypergraph. $\mathbf{D}_{v}$ and $\mathbf{D}_{e}$ are the diagonal degree matrices for vertex and hyperedge, respectively.

After obtaining the structure-aware embedding $\tilde{v}^{r}_{i}$ of the 3D object $o^{r}_{i}$ , we construct a memory bank $\mathcal{B}$ that contains $L$ invariant memory anchors. Following [8], we compute the activation score for each memory anchor in the memory bank by $s^{r}_{ij}=\mathcal{D}_{m}(\tilde{v}^{r}_{i},a^{r}_{j})$ , where $a^{r}_{j}$ denotes the anchor and $D_{m}(\cdot,\cdot)$ denotes the distance metric function. We rebuild the aligned embedding of each object by $z_{i}=\sum\nolimits_{j=1}^{L}\hat{s}^{r}_{ij}a^{r}_{j},z^{r}_{i}\in\mathbb{R}^% {d_{z}}$ , where $\hat{s}^{r}_{ij}$ denotes the normlization of activation score.

3.4.3 Loss Function for HSL

To train the hypergraph convolution and learnable memory anchors under hierarchical structure, we adopt the self-supervised Memory Reconstruction Loss $\mathcal{L}_{mr}$ for HSL:

\mathcal{L}_{HSL}=\mathcal{L}_{mr}=\big{\lVert}\tilde{v}^{r}_{i}-z^{r}_{i}\big% {\rVert}_{2},

(10)

where $\lVert\cdot\rVert_{2}$ is the $\mathcal{L}_{2}$ norm function.

Table 1: Experimental results of Image2Point retrieval on the OCAB, OCNT, OCES, and OCMN datasets.

Image2Point	OCAB			OCNT			OCES			OCMN
Image2Point	mAP $\uparrow$	NDCG $\uparrow$	ANMRR $\downarrow$	mAP $\uparrow$	NDCG $\uparrow$	ANMRR $\downarrow$	mAP $\uparrow$	NDCG $\uparrow$	ANMRR $\downarrow$	mAP $\uparrow$	NDCG $\uparrow$	ANMRR $\downarrow$
SDML	0.1489	0.1061	0.8824	0.0465	0.0316	0.9657	0.0942	0.0442	0.9486	0.0578	0.0248	0.9735
CMCL	0.1702	0.1520	0.8565	0.0623	0.0332	0.9665	0.0991	0.0477	0.9444	0.1175	0.0917	0.9001
MMSAE	0.1218	0.0802	0.9093	0.0410	0.0191	0.9817	0.0810	0.0362	0.9567	0.0571	0.0235	0.9746
PROSER	0.1119	0.0446	0.9386	0.0426	0.0171	0.9752	0.0968	0.0402	0.9641	0.0523	0.0133	0.9806
HGM²R	0.1367	0.0925	0.8978	0.1812	0.1072	0.8184	0.2184	0.1126	0.8215	0.0988	0.0789	0.9282
Ours	0.2220	0.1714	0.7947	0.2861	0.1585	0.7292	0.4004	0.1835	0.6378	0.1549	0.1488	0.8625

Table 2: Experimental results of Point2Image retrieval on the OCAB, OCNT, OCES, and OCMN datasets.

Point2Image	OCAB			OCNT			OCES			OCMN
Point2Image	mAP $\uparrow$	NDCG $\uparrow$	ANMRR $\downarrow$	mAP $\uparrow$	NDCG $\uparrow$	ANMRR $\downarrow$	mAP $\uparrow$	NDCG $\uparrow$	ANMRR $\downarrow$	mAP $\uparrow$	NDCG $\uparrow$	ANMRR $\downarrow$
SDML	0.1636	0.1367	0.8629	0.0393	0.0191	0.9820	0.0811	0.0413	0.9654	0.0682	0.0475	0.9512
CMCL	0.1628	0.1343	0.8594	0.0394	0.0203	0.9786	0.0815	0.0420	0.9598	0.1219	0.1116	0.9074
MMSAE	0.0821	0.0657	0.9211	0.0347	0.0191	0.9846	0.0891	0.0413	0.9496	0.0460	0.0278	0.9699
PROSER	0.0708	0.0555	0.9579	0.0387	0.0182	0.9834	0.0885	0.0406	0.9492	0.0693	0.0336	0.9653
HGM²R	0.1553	0.1613	0.8593	0.1452	0.0742	0.8996	0.2260	0.1265	0.8430	0.1006	0.0747	0.9305
Ours	0.3013	0.3202	0.7048	0.2811	0.1377	0.7316	0.4471	0.1914	0.5936	0.1277	0.1117	0.8994

4 Experiments

4.1 Experimental Settings

OCMR Datasets. We generate four open-set 3D cross-modal retrieval (OCMR) datasets, including OCAB, OCNT, OCES, OCMN, based on the public datasets ABO [11], NTU [12], ESB [13], and ModelNet40 [14], respectively. These datasets are split into seen and unseen categories, each object has three modalities including multi-view, voxel, and point cloud.

Implemental Details. In our experiment, we choose all three modalities of 3D objects. We set $\alpha=0.5$ for the hyper-parameters in Eq. 6, and $\tau=0.75$ in Eq. 7. The two modules are trained separately with 40 epochs on learning rate $lr=0.1$ and 120 epochs on $lr=0.001$ , the random seed is fixed as 2022 for all experiments.

4.2 Retrieval Performance

Compared Methods. As no methods are specifically designed for the open-set 3D cross-modal retrieval to date, we refine the current state-of-the-art methods from two tasks for comparison: close-set 3D cross-modal retrieval (SDML [15], CMCL [1], MMSAE [16]), and open-set multi-modal recognition or retrieval (PROSER [17], HGM²R [8]).

Evaluation Metrics. For a fair comparison, we employ the commonly used retrieval metric, including Mean Average Precision (mAP), Normalized Discounted Cumulative Gain (NDCG), Average Normalized Modified Retrieval Rank (ANMRR), and the Precision-Recall Curve (PR-Curve). For the mAP and NDCG metric, higher scores are better. For the ANMRR metric, the lower score is better. We construct $6$ query-target types for cross-modal retrieval according to these three modalities, including Image2Point (I2P), Image2Voxel (I2V), Point2Image (P2I), Point2Voxel (P2V), Voxel2Image (V2I), and Voxel2Point (V2P).

Comparison Analysis. We evaluate open-set 3D cross-modal retrieval results on four datasets, quantitative results of SRCR framework and other state-of-the-art methods are provided in Tab. 1 and Tab. 2. Results show that the proposed method outperforms the other methods on all four datasets. We also provide the Precision-Recall (PR) Curve to evaluate the performance of the proposed SRCR framework and other compared methods, as illustrated in Fig. 3. The larger area below the curve indicates better performance. From the results, we can observe that our method outperforms all other compared methods. The better performance indicates that by the residual-center embedding and hierarchical structure learning, the proposed method has the capability to overcome modality gaps while understanding the open-set categories.

Table 3: Ablation studies on OCNT dataset.

		mAP $\uparrow$	NDCG $\uparrow$	ANMRR $\downarrow$
On RCE	Direct Center	0.0362	0.0194	0.9801
On RCE	Category Center	0.0433	0.0230	0.9772
On HSL	HSL w/o $\mathcal{E}_{m}$	0.1511	0.1081	0.8626
	HSL w/o $\mathcal{E}_{m}$ & $\mathcal{E}_{o}$	0.1474	0.1084	0.8636
	GCN-based HSL	0.2575	0.1640	0.7374
	MLP-based HSL	0.1553	0.1101	0.8380
	RCE+HSL	0.2861	0.1585	0.7292

4.3 Ablation Study

We conduct ablation studies to verify the effectiveness of the proposed modules. For the residual-center embedding module, we compare the proposed RCE with the Direct Center and Category Center. The Direct Center denotes the network that use auto-encoder to generate the center embedding directly instead of residually, and Category Center denotes the network that generates the category center rather than semantic center of each object. During the ablation of the hierarchical structure learning module, we compared the proposed HSL with naive structures (HSL w/o $\mathcal{E}_{m}$ and HSL w/o $\mathcal{E}_{m}$ & $\mathcal{E}_{o}$ ), where “w/o” denotes “without”. We also replace the hypergraph-based correlation learning with MLP and GCN. As shown in Tab. 3, the combination of RCE and HSL yields the best performance, substituting either the embedding or the learning approach is observed to lead to a decline in performance. These results demonstrate the proposed modules can effectively obtain the center-embedding of objects and generalize it to unseen categories.

5 Conclusion

In this paper, we propose the Structure-Aware Residual-Center Representation (SRCR) framework for self-supervised open-set 3D cross-modal retrieval. We utilize the Residual-Center Embedding (RCE) for each object by nested auto-encoders to address the center deviation due to category distribution differences, rather than directly mapping them to the modality or category centerS. Besides, we construct a heterogeneous hypergraph structure based on hierarchical inter-modality, intra-object, and implicit-category correlations, and perform the Hierarchical Structure Learning (HSL) approach to leverage the high-order correlations among objects for generalization. Extensive experiments and ablation studies on four benchmarks demonstrate the superiority of our proposed framework compared to state-of-the-art methods.

References

[1] Longlong Jing, Elahe Vahdani, Jiaxing Tan, and Yingli Tian, “Cross-modal Center Loss for 3D Cross-Modal Retrieval,” in CVPR, 2021, pp. 3142–3151.
[2] Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen, “Adversarial Cross-modal Retrieval,” in ACMMM, 2017, pp. 154–162.
[3] Yanglin Feng, Hongyuan Zhu, Dezhong Peng, Xi Peng, and Peng Hu, “RONO: Robust Discriminative Learning With Noisy Labels for 2D-3D Cross-Modal Retrieval,” in CVPR, 2023, pp. 11610–11619.
[4] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu, “Deep Canonical Correlation Analysis,” in ICML. PMLR, 2013, pp. 1247–1255.
[5] Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes, “On Deep Multi-View Representation Learning,” in ICML. PMLR, 2015, pp. 1083–1092.
[6] Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman, “Open-Set Recognition: A Good Closed-Set Classifier is All You Need?,” arXiv preprint arXiv:2110.06207, 2021.
[7] Guangyao Chen, Peixi Peng, Xiangqian Wang, and Yonghong Tian, “Adversarial Reciprocal Points Learning for Open Set Recognition,” TPAMI, vol. 44, no. 11, pp. 8065–8081, 2021.
[8] Yifan Feng, Shuyi Ji, Yu-Shen Liu, Shaoyi Du, Qionghai Dai, and Yue Gao, “Hypergraph-based Multi-Modal Representation for Open-Set 3D Object Retrieval,” TPAMI, , no. 01, pp. 1–18, 2023.
[9] Fangxiang Feng, Xiaojie Wang, and Ruifan Li, “Cross-Modal Retrieval with Correspondence AutoEncoder,” in ACMMM, 2014, pp. 7–16.
[10] Yue Gao, Yifan Feng, Shuyi Ji, and Rongrong Ji, “HGNN+: General Hypergraph Neural Networks,” TPAMI, vol. 45, no. 3, pp. 3181–3199, 2022.
[11] Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al., “ABO: Dataset and Benchmarks for Real-World 3D Object Understanding,” in CVPR, 2022, pp. 21126–21136.
[12] Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung, “On Visual Similarity based 3D Model Retrieval,” in Computer graphics forum. Wiley Online Library, 2003, pp. 223–232.
[13] Subramaniam Jayanti, Yagnanarayanan Kalyanaraman, Natraj Iyer, and Karthik Ramani, “Developing an Engineering Shape Benchmark for CAD Models,” Computer-Aided Design, vol. 38, no. 9, pp. 939–953, 2006.
[14] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao, “3D Shapenets: A Deep Representation for Volumetric Shapes,” in CVPR, 2015, pp. 1912–1920.
[15] Peng Hu, Liangli Zhen, Dezhong Peng, and Pei Liu, “Scalable Deep Multimodal Learning for Cross-Modal Retrieval,” in SIGIR, 2019, pp. 635–644.
[16] Yiling Wu, Shuhui Wang, and Qingming Huang, “Multi-Modal Semantic AutoEncoder for Cross-Modal Retrieval,” Neurocomputing, vol. 331, pp. 165–175, 2019.
[17] Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan, “Learning Placeholders for Open-Set Recognition,” in CVPR, 2021, pp. 4401–4410.

Structure-Aware Residual-Center Representation for Self-Supervised Open-Set 3D Cross-Modal Retrieval

Abstract

1 Introduction

2 Related Work

2.1 Cross-Modal Retrieval

2.2 Open-Environment Learning

3 Methodology

3.1 Problem Setup

3.2 Framework Architecture

3.3 Residual-Center Embedding

3.3.1 Residual Learning

3.3.2 Loss Function for RCE

3.4 Hierarchical Structure Learning

3.4.1 Hierarchical Hypergraph Construction

3.4.2 Hypergraph Convolution and Alignment

3.4.3 Loss Function for HSL

4 Experiments

4.1 Experimental Settings

4.2 Retrieval Performance

4.3 Ablation Study

5 Conclusion

References

Structure-Aware Residual-Center Representation for
Self-Supervised Open-Set 3D Cross-Modal Retrieval