Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

Mo, Wentao; Liu, Yang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2402.15933 (cs)

[Submitted on 24 Feb 2024]

Title:Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

Authors:Wentao Mo, Yang Liu

View PDF HTML (experimental)

Abstract:In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts (e.g., only around 800 scenes are utilized in ScanQA and SQA dataset). Current approaches resort supplement 3D reasoning with 2D information. However, these methods face challenges: either they use top-down 2D views that introduce overly complex and sometimes question-irrelevant visual clues, or they rely on globally aggregated scene/image-level representations from 2D VLMs, losing the fine-grained vision-language correlations. To overcome these limitations, our approach utilizes question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues. We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure. This structure, featuring a Twin-Transformer design, compactly combines 2D and 3D modalities and captures fine-grained correlations between modalities, allowing them mutually augmenting each other. Integrating proposed mechanisms above, we present BridgeQA, that offers a fresh perspective on multi-modal transformer-based architectures for 3D-VQA. Experiments validate that BridgeQA achieves state-of-the-art on 3D-VQA datasets and significantly outperforms existing solutions. Code is available at $\href{this https URL}{\text{this URL}}$.

Comments:	To be published in AAAI 24
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2402.15933 [cs.CV]
	(or arXiv:2402.15933v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2402.15933

Submission history

From: Wentao Mo [view email]
[v1] Sat, 24 Feb 2024 23:31:34 UTC (470 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators