Structured scene memory for vision-language navigation

H Wang, W Wang, W Liang… - Proceedings of the …, 2021 - openaccess.thecvf.com
Proceedings of the IEEE/CVF conference on Computer Vision and …, 2021openaccess.thecvf.com
Recently, numerous algorithms have been developed to tackle the problem of vision-
language navigation (VLN), ie, entailing an agent to navigate 3D environments through
following linguistic instructions. However, current VLN agents simply store their past
experiences/observations as latent states in recurrent networks, failing to capture
environment layouts and make long-term planning. To address these limitations, we
propose a crucial architecture, called Structured Scene Memory (SSM). It is …
Abstract
Recently, numerous algorithms have been developed to tackle the problem of vision-language navigation (VLN), ie, entailing an agent to navigate 3D environments through following linguistic instructions. However, current VLN agents simply store their past experiences/observations as latent states in recurrent networks, failing to capture environment layouts and make long-term planning. To address these limitations, we propose a crucial architecture, called Structured Scene Memory (SSM). It is compartmentalized enough to accurately memorize the percepts during navigation. It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment. SSM has a collect-read controller that adaptively collects information for supporting current decision making and mimics iterative algorithms for long-range reasoning. As SSM provides a complete action space, ie, all the navigable places on the map, a frontier-exploration based navigation decision making strategy is introduced to enable efficient and global planning. Experiment results on two VLN datasets (ie, R2R and R4R) show that our method achieves state-of-the-art performance on several metrics.
openaccess.thecvf.com
Showing the best result for this search. See all results