Learning Point-Language Hierarchical Alignment for 3D Visual Grounding

Chen, Jiaming; Luo, Weixin; Song, Ran; Wei, Xiaolin; Ma, Lin; Zhang, Wei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2210.12513 (cs)

[Submitted on 22 Oct 2022 (v1), last revised 9 Jun 2023 (this version, v4)]

Title:Learning Point-Language Hierarchical Alignment for 3D Visual Grounding

Authors:Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma, Wei Zhang

View PDF

Abstract:This paper presents a novel hierarchical alignment model (HAM) that learns multi-granularity visual and linguistic representations in an end-to-end manner. We extract key points and proposal points to model 3D contexts and instances, and propose point-language alignment with context modulation (PLACM) mechanism, which learns to gradually align word-level and sentence-level linguistic embeddings with visual representations, while the modulation with the visual context captures latent informative relationships. To further capture both global and local relationships, we propose a spatially multi-granular modeling scheme that applies PLACM to both global and local fields. Experimental results demonstrate the superiority of HAM, with visualized results showing that it can dynamically model fine-grained visual and linguistic representations. HAM outperforms existing methods by a significant margin and achieves state-of-the-art performance on two publicly available datasets, and won the championship in ECCV 2022 ScanRefer challenge. Code is available at~\url{this https URL}.

Comments:	Champion on ECCV 2022 ScanRefer Challenge
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2210.12513 [cs.CV]
	(or arXiv:2210.12513v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2210.12513

Submission history

From: Jiaming Chen [view email]
[v1] Sat, 22 Oct 2022 18:02:10 UTC (5,929 KB)
[v2] Sun, 30 Oct 2022 09:22:05 UTC (6,328 KB)
[v3] Mon, 5 Jun 2023 10:09:54 UTC (6,950 KB)
[v4] Fri, 9 Jun 2023 04:06:39 UTC (7,022 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Learning Point-Language Hierarchical Alignment for 3D Visual Grounding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Learning Point-Language Hierarchical Alignment for 3D Visual Grounding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators