License: CC BY 4.0
arXiv:2309.12188v2 [cs.RO] 24 Mar 2024

SG-Bot: Object Rearrangement via
Coarse-to-Fine Robotic Imagination on Scene Graphs

Guangyao Zhai11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Xiaoni Cai11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Dianye Huang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Yan Di1,1{}^{1,\dagger}start_FLOATSUPERSCRIPT 1 , † end_FLOATSUPERSCRIPT
Fabian Manhardt22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Federico Tombari1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT, Nassir Navab11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT and Benjamin Busam11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
https://sites.google.com/view/sg-bot
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Technical University of Munich. 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Google. {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT Corresponding author.
Abstract

Object rearrangement is pivotal in robotic-environment interactions, representing a significant capability in embodied AI. In this paper, we present SG-Bot, a novel rearrangement framework that utilizes a coarse-to-fine scheme with a scene graph as the scene representation. Unlike previous methods that rely on either known goal priors or zero-shot large models, SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics, seamlessly blending the consideration of commonsense knowledge with automatic generation capabilities. SG-Bot employs a three-fold procedure–observation, imagination, and execution–to adeptly address the task. Initially, objects are discerned and extracted from a cluttered scene during the observation. These objects are first coarsely organized and depicted within a scene graph, guided by either commonsense or user-defined criteria. Then, this scene graph subsequently informs a generative model, which forms a fine-grained goal scene considering the shape information from the initial scene and object semantics. Finally, for execution, the initial and envisioned goal scenes are matched to formulate robotic action policies. Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.

I Introduction

Object rearrangement is an essential but challenging task in robot-environment interaction, marking a crucial capability in embodied AI [1]. This interactive ability attains its zenith of automation by synergizing vision [2, 3, 4, 5], textual insights from sources [6, 7, 8], and strategic motion planning [9, 10]. Together, these elements culminate in a sophisticated physical embodiment for robots.

Robotic rearrangement refers to the process wherein a robotic agent, starting from an initial configuration within a scene, re-positions objects according to specific rules or instructions. The purpose is to achieve desired goal states, relying solely on sensory data and onboard perceptions. Recently proposed vision-based solutions to this task can be categorized into three approaches: utilizing known geometric and semantic goal states, sequential object pose estimation, and zero-shot rearrangement with large models. Typically, for goal-guided methods [11, 12], the quality of such priors significantly affects the accuracy of the rearrangement. When the goal state is unavailable, such methods become inapplicable for real-world use. Moreover, for pose estimation based approaches [13], while the sequential design aligns well with robotic manipulations, it can be affected by cumulative errors in autoregressive predictions. The last type of methods [14, 15, 16, 17, 18] tap into commonsense knowledge stored in zero-shot models. They necessitate either intricate post-filter procedures or prompt template designs, which tend to overlook scene-specific contextual cues and result in diverse undesired outcomes.

Orthogonal to the above methodologies, we explore a novel rearrangement routine embodied as SG-Bot, using goal imagination on scene graphs and goal-guided object matching as shown in Fig. 1. SG-Bot stacks three stages for the task, which are observation, imagination, and execution. Specifically, in the first stage, it processes initial scenes to extract objects by semantic instance segmentation. The imagination stage follows a coarse-to-fine solution, where objects are firstly treated as semantic nodes in a constructed goal scene graph. This graph is either directed by commonsense reasoning or user-defined rules, serving as coarse goal states. For a finer generation, the goal scene graph can already be decoded to an actual scene using a scene generative model, Graph-to-3D [19]. However, inherited from the features of generative models, Graph-to-3D can produce diverse generation results inconsistent with the observation, potentially affecting the precision of subsequent object matching. We control the generation process by enriching the graph with shape priors to make a shape-aware graph, equipping the initial shape knowledge. Next, SG-Bot performs finer goal scene imagination conditioned on this graph, ensuring that the imagined shapes are coherent with the initial observation. Finally, in the execution stage, the imagined objects serve as anchors to guide the object matching by point cloud registration during the scene transformation. At each transformation step, we check occupancy between objects in the current observation and the imagination for safe rearrangement. The uniqueness of SG-Bot manifests in three aspects: First, SG-Bot does not need known goal priors but can self-generate goal scenes exclusively for the initial scenes, compared to the goal-required methods, e.g.[11, 12]. Second, SG-Bot decouples the transformation policy using per-object matching to decrease the risk of error accumulation, compared to autoregressive methods, e.g.[13]. Third, the concrete goal states and the closed-loop rearrangement strategy guarantee the rearrangement performance, compared to the loose-coupled zero-shot methods, e.g.[16].

Our contributions are summarized as:

  • We present SG-Bot, a new paradigm for the object rearrangement. The goal states are coarse-to-fine generated on the rules represented as scene graphs, with which goal-guided matching defines our motion policies.

  • Ambiguous goal scene generation is alleviated by extracting shape priors from the initial observation. This leads to improved rearrangement performance.

  • Experimental results in simulation show that SG-Bot can achieve competitive performance with state-of-the-art methods. Moreover, the rearrangement performance remains consistent in real-world scenarios.

II Related Work

II-A Scene Graph

Scene graphs offer a rich symbolic and semantic representation of scenes [20, 21]. They can reason about objects and their relationships more explicitly than language [22]. This compact relationship description can be obtained through spatial grounding [23, 24], predicted from images [25, 26, 27], or even a GUI [28]. Scene graphs have applications in numerous computer vision areas such as 2D image generation [29, 22], manipulation [25], caption generation [30], camera localization [31], and 3D scene synthesis [32, 19, 33]. Recent robotics manipulation research also leverages scene graphs in planning [34, 35, 36]. In the context of this work, scene graphs serve to generate scenes, acting as anchors that guide the rearrangement.

II-B Object Rearrangement

The task necessitates that an embodied agent transition from initial states to goal states, adhering to specific rules based on perception and planning [1], as indicated by earlier works [37, 38, 39, 40, 41]. By leveraging the development of visual perception [42, 43, 44, 45, 46], robotic grasping [47, 48, 49], motion planning [50, 51, 52], and research platforms [53, 54, 55, 56, 57, 58], a number of related methods have emerged. Solutions for this task fall into two categories. First, the goal states are given to the embodied agent, subsequently solving the problem by object matching, for example, using optical flow [11] or feature cosine similarity [12]. However, deriving such configurations can be challenging in real-world scenarios. Secondly, the goal states can be generated conditioned on the initial states. These goal states can be implicitly represented, such as by gradient fields [59], scene distributions [60], or sequential reasoning on the observation [13]. Alternatively, goals can be explicit in various formats, such as images [14] on prompts, bounding boxes [61] or poses [62] on descriptions, and direct language instructions [15, 63, 64], leveraging recent off-the-shelf large language models [65, 6, 66]. More powerful models even treat the initial-goal transformation as an end-to-end problem [67, 17], building on the large resource consumption. In this work, we generate the goal in a two-stage fashion, where coarse relationships are symbolized as a scene graph and finer concrete goals as the imagined scene given by the scene graph.

III Preliminary

Refer to caption
Figure 2: SG-Bot pipeline. a) SG-Bot segments the input RGB image via MaskRCNN [68] to obtain individual object nodes visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, the corresponding point cloud of visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is obtained via back-projecting the depth map with camera intrinsics K𝐾Kitalic_K. b) Coarse: the graph constructor connects each pair of nodes according to commonsense or user-defined rules, yielding scene graph 𝒢𝒢\mathcal{G}caligraphic_G. Fine: 𝒢𝒢\mathcal{G}caligraphic_G is embedded and enhanced to 𝒢zβsuperscriptsubscript𝒢𝑧𝛽\mathcal{G}_{z}^{\beta}caligraphic_G start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT by combining estimated shape priors β*superscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT extracted from the normalized point clouds using the trained encoder Esubscript𝐸\mathcal{B}_{E}caligraphic_B start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT and latent code z𝑧zitalic_z sampled from the learned layout-shape distribution. 𝒢zβsuperscriptsubscript𝒢𝑧𝛽\mathcal{G}_{z}^{\beta}caligraphic_G start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT then informs ΦDsubscript𝛷𝐷\mathit{\Phi}_{D}italic_Φ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and Dsubscript𝐷\mathcal{L}_{D}caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT of Graph-to-3D [19] to generate shape codes α*superscript𝛼\alpha^{*}italic_α start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and the scene layout respectively. α*superscript𝛼\alpha^{*}italic_α start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT are decoded as shapes via 𝒜Dsubscript𝒜𝐷\mathcal{A}_{D}caligraphic_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, which are then populated in the layouts to form the goal scene. c) SG-Bot matches the initial and envisioned goal using point cloud registration and performs an occupancy check to determine the final movement in each step, as illustrated in V-E. The robot iteratively executes the action, transforming scenes into intermediate states and updating the observation until it reaches the goal state.

Scene Graph. The scene graph we use is semantic scene graph [20], denoted as 𝒢={𝒱,}𝒢𝒱\mathcal{G}=\left\{\mathcal{V},\mathcal{E}\right\}caligraphic_G = { caligraphic_V , caligraphic_E }, which serves as a structured representation of a visual scene. In such representation, 𝒱={vi|i=1,,N}𝒱conditional-setsubscript𝑣𝑖𝑖1𝑁\mathcal{V}=\{v_{i}~{}|~{}i={1,\ldots,N\}}caligraphic_V = { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , … , italic_N } refers to the set of object nodes, while ={eij|i,j=1,,N,ij}conditional-setsubscript𝑒𝑖𝑗formulae-sequence𝑖𝑗1𝑁𝑖𝑗\mathcal{E}=\{e_{i\to j}~{}|~{}i,j={1,\ldots,N},i\neq j\}caligraphic_E = { italic_e start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT | italic_i , italic_j = 1 , … , italic_N , italic_i ≠ italic_j } represents the set of directed edges connecting each pair of nodes vivjsubscript𝑣𝑖subscript𝑣𝑗v_{i}\rightarrow v_{j}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. As structured in the left of Fig. 3.b, each node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can encompass various extensible attributes, e.g., object category information oiOsubscript𝑜𝑖𝑂o_{i}\in Oitalic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_O, with O𝑂Oitalic_O containing all categories. As same as the node representation, each edge eijsubscript𝑒𝑖𝑗e_{i\to j}italic_e start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT is associated with a class label γijΓsubscript𝛾𝑖𝑗Γ\gamma_{i\to j}\in\Gammaitalic_γ start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT ∈ roman_Γ. In this paper, ΓΓ\Gammaroman_Γ contains all pre-defined edge types, i.e., {{\{{left/right, front/behind, standing on, close by}normal-}\}}.

IV SG-Bot: Overview

IV-A Problem Definition

From an initial layout state 𝒮0subscript𝒮0\mathcal{S}_{0}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the embodied agent is tasked with a sequential transformation of objects towards a desired goal state 𝒮*superscript𝒮\mathcal{S}^{*}caligraphic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. This transformation is achieved by utilizing sequential motion policies 𝒫𝒫\mathcal{P}caligraphic_P, guided by sensory observations.

IV-B Inference workflow

Observation. Given an RGB-D image capturing the initial object layout state 𝒮0subscript𝒮0\mathcal{S}_{0}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, as shown in Fig. 2.a, SG-Bot first extracts all target objects as nodes 𝒱(O)𝒱𝑂\mathcal{V}(O)caligraphic_V ( italic_O ) via an arbitrary object detector, e.g., MaskRCNN [68].

Imagination. The extracted object nodes are constructed as a scene graph 𝒢𝒢\mathcal{G}caligraphic_G according to commonsense or user-defined rules, as shown in Fig. 2.b and explained in Sec. V-B. Next, we evolve 𝒢𝒢\mathcal{G}caligraphic_G to a latent shape-aware scene graph 𝒢zβsubscriptsuperscript𝒢𝛽𝑧\mathcal{G}^{\beta}_{z}caligraphic_G start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT with shape priors β𝛽\betaitalic_β from the initial scene and learned layout-shape distribution Z𝑍Zitalic_Z mentioned in Sec. V-C. Finally, SG-Bot imagines a goal scene 𝒮*superscript𝒮\mathcal{S}^{*}caligraphic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT conditioned on 𝒢zβsubscriptsuperscript𝒢𝛽𝑧\mathcal{G}^{\beta}_{z}caligraphic_G start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT via the shape decoder ΦDsubscript𝛷𝐷\mathit{\Phi}_{D}italic_Φ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and layout decoder Dsubscript𝐷\mathcal{L}_{D}caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT of a scene generative model Graph-to-3D [19], where 𝒮*superscript𝒮\mathcal{S}^{*}caligraphic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT comprises of dense point cloud and corresponding bounding box for each object.

Execution. Each target object in 𝒮0subscript𝒮0\mathcal{S}_{0}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is first extracted and represented as the back-projected point cloud from the depth map. Then, as shown in Fig. 2.c and explained in Sec. V-E, these objects are matched with the corresponding dense point clouds in 𝒮*superscript𝒮\mathcal{S}^{*}caligraphic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT through iterative registration, e.g., ICP [69, 70]. Based on the outcomes of this registration process, SG-Bot generates per-object manipulation policies 𝒫tsubscript𝒫𝑡\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT filtered and refined by object occupancy checking at each action step t𝑡titalic_t. SG-Bot continues to iteratively reposition objects in 𝒮0subscript𝒮0\mathcal{S}_{0}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT towards 𝒮*superscript𝒮\mathcal{S}^{*}caligraphic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT until all objects are effectively rearranged.

V SG-Bot: Methodology

V-A Object Extraction

Given a cluttered scene 𝒮0subscript𝒮0\mathcal{S}_{0}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as the initial state, SG-Bot first performs semantic instance segmentation to segment all target objects, as shown in Fig. 2.a. Specifically, we adopt MaskRCNN to jointly predict the object masks and category labels. Then, each object is represented as the back-projected point cloud from the depth map. These objects, denoted as 𝒱(O)={vi(oi)|i=1,,N}𝒱𝑂conditional-setsubscript𝑣𝑖subscript𝑜𝑖𝑖1𝑁\mathcal{V}(O)=\{v_{i}(o_{i})~{}|~{}i=1,\ldots,N\}caligraphic_V ( italic_O ) = { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_i = 1 , … , italic_N }, are further collected and processed in the following Imagination module. This module aims to generate the desired goal scene by treating these objects as individual scene graph nodes.

After obtaining target objects 𝒱(O)𝒱𝑂\mathcal{V}(O)caligraphic_V ( italic_O ), we follow a coarse-to-fine scheme to generate the desired goal scene, which is leveraged to guide the object action.

V-B Coarse Stage: Goal Scene Graph Construction

SG-Bot establishes a goal scene graph 𝒢={𝒱(O),(Γ)}𝒢𝒱𝑂Γ\mathcal{G}=\left\{\mathcal{V}(O),\mathcal{E}(\Gamma)\right\}caligraphic_G = { caligraphic_V ( italic_O ) , caligraphic_E ( roman_Γ ) } via determining the edge type γijΓsubscript𝛾𝑖𝑗Γ\gamma_{i\to j}\in\Gammaitalic_γ start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT ∈ roman_Γ for each edge in (Γ)Γ\mathcal{E}(\Gamma)caligraphic_E ( roman_Γ ), as shown in Fig. 2.b. In this paper, two modes are supported to define edges between nodes:

Commonsense mode. Following the recent trend of knowledge representation with graphs [71], we represent common human knowledge in the form of edge attributes ΓΓ\Gammaroman_Γ within a scene graph. For instance, for the scene containing a plate, the fork and knife are typically placed to the left and right of the plate. Additionally, the spoon needs to be placed in front of the plate if it exists. For the case without a plate, the spoon tends to be placed close by the bowl or cup. Moreover, other objects need to be placed in front of the plate, bowl, and cup, etc. Any unusual objects that appear on the table will be identified as obstacles and subsequently removed, which makes the final M𝑀Mitalic_M nodes from N𝑁Nitalic_N elements, MN𝑀𝑁M\leq Nitalic_M ≤ italic_N. Similar rules are naturally introduced based on the category of the object and commonsense. One way to achieve this is to use LLM to choose the optimal relationship according to the provided ΓΓ\Gammaroman_Γ.

User-defined mode. In contrast to the uncontrollable Commonsense mode, we demonstrate that one of the main advantages of introducing the scene graph representation is that it enables the controllable User-defined mode. Users can manipulate the scene graph from a long-term perspective, e.g., using a GUI, by directly editing the edges and nodes in 𝒢𝒢\mathcal{G}caligraphic_G to interact with the edge database ΓΓ\Gammaroman_Γ and nodes.

V-C Fine Stage: Graph to Scene Generation

SG-Bot stacks the architecture of Graph-to-3D [19] to generate a plausible goal scene. Graph-to-3D conditions on the latent shape-aware scene graph denoted as 𝒢zβsubscriptsuperscript𝒢𝛽𝑧\mathcal{G}^{\beta}_{z}caligraphic_G start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, which evolves from 𝒢𝒢\mathcal{G}caligraphic_G and ensures the coherent shape transformation from the initial scene to the goal scene.

Shape auto-encoders. For this purpose, we first train two shape auto-encoder entities 𝒜,𝒜\mathcal{A},\mathcal{B}caligraphic_A , caligraphic_B of AtlasNet [72] for different usages, as shown in Fig. 3.a. We train 𝒜(𝒜E,𝒜D)𝒜subscript𝒜𝐸subscript𝒜𝐷\mathcal{A}(\mathcal{A}_{E},\mathcal{A}_{D})caligraphic_A ( caligraphic_A start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) with full points under canonical view, whose encoder 𝒜Esubscript𝒜𝐸\mathcal{A}_{E}caligraphic_A start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT offers shape codes α𝛼\alphaitalic_α for training Graph-to-3D after. (E,D)subscript𝐸subscript𝐷\mathcal{B}(\mathcal{B}_{E},\mathcal{B}_{D})caligraphic_B ( caligraphic_B start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , caligraphic_B start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) is trained with normalized object points under camera view in initial scenes to have initial shape priors β𝛽\betaitalic_β. The encoder Esubscript𝐸\mathcal{B}_{E}caligraphic_B start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT of \mathcal{B}caligraphic_B is preserved to produce β𝛽\betaitalic_β during the training of Graph-to-3D and the final SG-Bot workflow. The training process of 𝒜,𝒜\mathcal{A},\mathcal{B}caligraphic_A , caligraphic_B aligns with the original AtlasNet.

Scene generative model. After obtaining α𝛼\alphaitalic_α and β𝛽\betaitalic_β, the training of Graph-to-3D starts with embedding 𝒢𝒢\mathcal{G}caligraphic_G shown in Fig. 3.b. The category information ci𝒞nodesubscript𝑐𝑖superscript𝒞𝑛𝑜𝑑𝑒c_{i}\in\mathcal{C}^{node}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT for i𝑖iitalic_i-th node is obtained by passing its textual information oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT through node embedding layers Osubscript𝑂\mathcal{M}_{O}caligraphic_M start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, while cij𝒞edgesubscript𝑐𝑖𝑗superscript𝒞𝑒𝑑𝑔𝑒c_{i\to j}\in\mathcal{C}^{edge}italic_c start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT is obtained by edge embedding layers ΓsubscriptΓ\mathcal{M}_{\Gamma}caligraphic_M start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT with γijsubscript𝛾𝑖𝑗\gamma_{i\to j}italic_γ start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT. Based on 𝒢𝒢={𝒱(𝒞node),(𝒞edge)}maps-to𝒢𝒢𝒱superscript𝒞𝑛𝑜𝑑𝑒superscript𝒞𝑒𝑑𝑔𝑒\mathcal{G}\mapsto\mathcal{G}=\left\{\mathcal{V}(\mathcal{C}^{node}),\mathcal{% E}(\mathcal{C}^{edge})\right\}caligraphic_G ↦ caligraphic_G = { caligraphic_V ( caligraphic_C start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT ) , caligraphic_E ( caligraphic_C start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT ) }, Graph-to-3D, a subsequent dual-branch GCN architecture, is trained by modeling the layout-shape joint distribution Z𝑍Zitalic_Z of goal scenes. As shown in Fig. 3.c, in training, the shape branch Φ(ΦE,ΦD)𝛷subscript𝛷𝐸subscript𝛷𝐷\mathit{\Phi}(\mathit{\Phi}_{E},\mathit{\Phi}_{D})italic_Φ ( italic_Φ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , italic_Φ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) requires the graph to be augmented with ground truth shape codes α𝛼\alphaitalic_α in goal scenes as input, whose output α^^𝛼\hat{\alpha}over^ start_ARG italic_α end_ARG is supervised by the same shape codes. In the meantime, the layout branch (E,D)subscript𝐸subscript𝐷\mathcal{L}(\mathcal{L}_{E},\mathcal{L}_{D})caligraphic_L ( caligraphic_L start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) takes the scene graph with ground truth bounding boxes B={bi|i=1,..,M}B=\{b_{i}~{}|~{}i=1,..,M\}italic_B = { italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , . . , italic_M } as input and the supervision labels. The two branches interact with each other in the bottleneck to model a latent graph 𝒢zsubscript𝒢𝑧\mathcal{G}_{z}caligraphic_G start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, which shares the same idea of the concept of the latent code in the VAE [73]. 𝒢z={𝒱(z,𝒞node),(𝒞edge)}subscript𝒢𝑧𝒱𝑧superscript𝒞𝑛𝑜𝑑𝑒superscript𝒞𝑒𝑑𝑔𝑒\mathcal{G}_{z}=\{\mathcal{V}(z,\mathcal{C}^{node}),\mathcal{E}(\mathcal{C}^{% edge})\}caligraphic_G start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = { caligraphic_V ( italic_z , caligraphic_C start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT ) , caligraphic_E ( caligraphic_C start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT ) }, consisting of 𝒢𝒢\mathcal{G}caligraphic_G with sampled z𝑧zitalic_z code from the modeled Z𝑍Zitalic_Z. More details can be found in [19]. Here, we change 𝒢zsubscript𝒢𝑧\mathcal{G}_{z}caligraphic_G start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT as 𝒢zβsubscriptsuperscript𝒢𝛽𝑧\mathcal{G}^{\beta}_{z}caligraphic_G start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT by offering each node its shape prior β𝛽\betaitalic_β extracted from its counterpart in the initial scene, i.e., 𝒢zβ={𝒱(z,β,𝒞node),(𝒞edge)}subscriptsuperscript𝒢𝛽𝑧𝒱𝑧𝛽superscript𝒞𝑛𝑜𝑑𝑒superscript𝒞𝑒𝑑𝑔𝑒\mathcal{G}^{\beta}_{z}=\{\mathcal{V}(z,\beta,\mathcal{C}^{node}),\mathcal{E}(% \mathcal{C}^{edge})\}caligraphic_G start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = { caligraphic_V ( italic_z , italic_β , caligraphic_C start_POSTSUPERSCRIPT italic_n italic_o italic_d italic_e end_POSTSUPERSCRIPT ) , caligraphic_E ( caligraphic_C start_POSTSUPERSCRIPT italic_e italic_d italic_g italic_e end_POSTSUPERSCRIPT ) }, to make α^^𝛼\hat{\alpha}over^ start_ARG italic_α end_ARG and b^^𝑏\hat{b}over^ start_ARG italic_b end_ARG aware of initial shapes.

Controllable scene imagination. After training, we subsequently engage in the process of generating the desired goal scene 𝒮*superscript𝒮\mathcal{S}^{*}caligraphic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT conditioned on 𝒢zβsubscriptsuperscript𝒢𝛽𝑧\mathcal{G}^{\beta}_{z}caligraphic_G start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, shown in Fig. 2.b. This is accomplished through combination of code decoder ΦDsubscript𝛷𝐷\mathit{\Phi}_{D}italic_Φ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, shape decoder 𝒜Dsubscript𝒜𝐷\mathcal{A}_{D}caligraphic_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, and layout decoder Dsubscript𝐷\mathcal{L}_{D}caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT:

S=𝒜D(α^),α^=ΦD(𝒢zβ),α^={α^i|i=1,,M},formulae-sequence𝑆subscript𝒜𝐷^𝛼formulae-sequence^𝛼subscript𝛷𝐷superscriptsubscript𝒢𝑧𝛽^𝛼conditional-setsubscript^𝛼𝑖𝑖1𝑀\displaystyle S=\mathcal{A}_{D}(\hat{\alpha}),\quad\hat{\alpha}=\mathit{\Phi_{% D}(\mathcal{G}_{z}^{\beta})},\quad\hat{\alpha}=\{\hat{\alpha}_{i}~{}|~{}i=1,..% .,M\},italic_S = caligraphic_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( over^ start_ARG italic_α end_ARG ) , over^ start_ARG italic_α end_ARG = italic_Φ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ) , over^ start_ARG italic_α end_ARG = { over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , … , italic_M } , (1a)
B^=D(𝒢zβ),B^={b^i|i=1,,M},formulae-sequence^𝐵subscript𝐷superscriptsubscript𝒢𝑧𝛽^𝐵conditional-setsubscript^𝑏𝑖𝑖1𝑀\displaystyle\hat{B}=\mathcal{L}_{D}(\mathcal{G}_{z}^{\beta}),\quad\hat{B}=\{% \hat{b}_{i}~{}|~{}i=1,...,M\},over^ start_ARG italic_B end_ARG = caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ) , over^ start_ARG italic_B end_ARG = { over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , … , italic_M } , (1b)

where α^^𝛼\hat{\alpha}over^ start_ARG italic_α end_ARG denotes the set of estimated shape codes, and S𝑆Sitalic_S is the set of normalized shapes decoded from α^^𝛼\hat{\alpha}over^ start_ARG italic_α end_ARG. B^^𝐵\hat{B}over^ start_ARG italic_B end_ARG denotes the layout of object bounding boxes in the desired scene 𝒮*superscript𝒮\mathcal{S}^{*}caligraphic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. S𝑆Sitalic_S then is transformed and populated into B^^𝐵\hat{B}over^ start_ARG italic_B end_ARG to synthesize 𝒮*superscript𝒮\mathcal{S}^{*}caligraphic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT.

Refer to caption
Figure 3: Modular Training. a) 𝒜E,𝒜Dsubscript𝒜𝐸subscript𝒜𝐷\mathcal{A}_{E},\mathcal{A}_{D}caligraphic_A start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT are trained using full shapes in the canonical view to have the shape code α𝛼\alphaitalic_α, while E,Dsubscript𝐸subscript𝐷\mathcal{B}_{E},\mathcal{B}_{D}caligraphic_B start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , caligraphic_B start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT are trained on partial shapes in the initial scenes under the camera view to have the shape priors β𝛽\betaitalic_β. 𝒜Dsubscript𝒜𝐷\mathcal{A}_{D}caligraphic_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and Esubscript𝐸\mathcal{B}_{E}caligraphic_B start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT are retained during inference. b) A scene graph with textual information is processed through embedding layers O,Γsubscript𝑂subscriptΓ\mathcal{M}_{O},\mathcal{M}_{\Gamma}caligraphic_M start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT to have implicit class features ci,cijsubscript𝑐𝑖subscript𝑐𝑖𝑗c_{i},c_{i\to j}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT on each node and edge. c) For training Graph-to-3D on goal scenes, the processed scene graph is first concatenated with α𝛼\alphaitalic_α and bounding box parameters B𝐵Bitalic_B on the shape branch Φ(ΦE,ΦD)𝛷subscript𝛷𝐸subscript𝛷𝐷\mathit{\Phi}(\mathit{\Phi}_{E},\mathit{\Phi}_{D})italic_Φ ( italic_Φ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , italic_Φ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) and layout branch (E,D)subscript𝐸subscript𝐷\mathcal{L}(\mathcal{L}_{E},\mathcal{L}_{D})caligraphic_L ( caligraphic_L start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) respectively. Φ𝛷\mathit{\Phi}italic_Φ and \mathcal{L}caligraphic_L jointly model the layout-shape distribution Z𝑍Zitalic_Z [19]. This model incorporates β𝛽\betaitalic_β from initial scenes to create 𝒢zβsuperscriptsubscript𝒢𝑧𝛽\mathcal{G}_{z}^{\beta}caligraphic_G start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT, subsequently estimating α^^𝛼\hat{\alpha}over^ start_ARG italic_α end_ARG and B^^𝐵\hat{B}over^ start_ARG italic_B end_ARG. Modules in b) and c) are jointly trained, with O,Γsubscript𝑂subscriptΓ\mathcal{M}_{O},\mathcal{M}_{\Gamma}caligraphic_M start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT, ΦDsubscript𝛷𝐷\mathit{\Phi}_{D}italic_Φ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and Dsubscript𝐷\mathcal{L}_{D}caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT used during inference.

V-D Advantages of Coarse-to-Fine Scheme

SG-Bot features three key advantages: First, in the coarse stage, it utilizes a scene graph as an intermediary form of the target scene. This graph allows for multiple relationships between any two objects and enhances natural and intuitive human-computer interaction. Users can intuitively perceive the spatial distribution of objects within the scene through a 2D graphical scene graph, enabling direct editing through a GUI. Second, leveraging the scene graph as an intermediate representation allows for the seamless integration of commonsense knowledge, enabling automated scene rearrangement. Third, in the fine stage, we introduce the generative model to supplement missing fine-grained details, such as object shapes and poses, in the scene graph representation. This guides the robot in performing precise operations.

V-E Goal-Guided Object Matching and Manipulation

After obtaining 𝒮*superscript𝒮\mathcal{S}^{*}caligraphic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, SG-Bot performs object matching by point cloud registration and rearranges objects after occupancy check in each round, as shown in Fig. 2.c, transferring 𝒮0subscript𝒮0\mathcal{S}_{0}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to 𝒮*superscript𝒮\mathcal{S}^{*}caligraphic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. We illustrate the process with the first round:

Object matching. SG-Bot compares 𝒮*superscript𝒮\mathcal{S}^{*}caligraphic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT with the initial scene 𝒮0subscript𝒮0\mathcal{S}_{0}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to calculate the necessary transformation 𝐓=[𝐑|𝐭]𝐓delimited-[]conditional𝐑𝐭\mathbf{T}=[\mathbf{R}|\mathbf{t}]bold_T = [ bold_R | bold_t ] for each object, where 𝐑3×3𝐑superscript33\mathbf{R}\in\mathbb{R}^{3\times 3}bold_R ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT and 𝐭3𝐭superscript3\mathbf{t}\in\mathbb{R}^{3}bold_t ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT represent rotation and translation respectively. Therefore, in this module, the objective can be defined as,

[𝐑*,𝐭*]=argmin𝐑,𝐭i=1NP(minqQ𝐑pi+𝐭q2)+ISO(3)(𝐑),superscript𝐑superscript𝐭𝐑𝐭argminsuperscriptsubscript𝑖1subscript𝑁𝑃𝑞𝑄superscriptnorm𝐑subscript𝑝𝑖𝐭𝑞2subscript𝐼𝑆𝑂3𝐑\left[\mathbf{R}^{*},\mathbf{t}^{*}\right]=\underset{\mathbf{R},\mathbf{t}}{% \text{argmin}}\sum_{i=1}^{N_{P}}(\underset{q\in Q}{\min}||{\mathbf{R}p_{i}+% \mathbf{t}-q}||^{2})+I_{SO(3)}(\mathbf{R}),[ bold_R start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , bold_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ] = start_UNDERACCENT bold_R , bold_t end_UNDERACCENT start_ARG argmin end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( start_UNDERACCENT italic_q ∈ italic_Q end_UNDERACCENT start_ARG roman_min end_ARG | | bold_R italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_t - italic_q | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_I start_POSTSUBSCRIPT italic_S italic_O ( 3 ) end_POSTSUBSCRIPT ( bold_R ) , (2)

where 𝐑*superscript𝐑\mathbf{R}^{*}bold_R start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and 𝐭*superscript𝐭\mathbf{t}^{*}bold_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT represent the optimal rotation and translation parameters we aim to find. pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes one of the NPsubscript𝑁𝑃N_{P}italic_N start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT points in object P𝑃Pitalic_P of initial scene 𝒮0subscript𝒮0\mathcal{S}_{0}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. After transforming pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from 𝒮0subscript𝒮0\mathcal{S}_{0}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to the goal scene 𝒮*superscript𝒮\mathcal{S}^{*}caligraphic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT with 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t, its corresponding nearest point in 𝒮*superscript𝒮\mathcal{S}^{*}caligraphic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is denoted as q𝑞qitalic_q inside object Q𝑄Qitalic_Q. ISO(3)(𝐑)subscript𝐼𝑆𝑂3𝐑I_{SO(3)}(\mathbf{R})italic_I start_POSTSUBSCRIPT italic_S italic_O ( 3 ) end_POSTSUBSCRIPT ( bold_R ) enforces 𝐑𝐑\mathbf{R}bold_R should lie in the special orthogonal group SO(3)𝑆𝑂3SO(3)italic_S italic_O ( 3 ) [74]. Since the generated objects in the goal scene are dense and complete, we observe that vanilla ICP can effectively solve the problem in Eq. 2 when provided with a well-suited initialization.

Given an object P𝑃Pitalic_P from the initial scene 𝒮0subscript𝒮0\mathcal{S}_{0}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, its goal location is indicated by the generated object Q𝑄Qitalic_Q in 𝒮*superscript𝒮\mathcal{S}^{*}caligraphic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. We initialize the pose 𝐓𝐓\mathbf{T}bold_T by first centralizing each point cloud and then uniformly generating candidate rotations. We represent rotation using angles around the x𝑥xitalic_x, y𝑦yitalic_y, and z𝑧zitalic_z axes, dividing the interval of each axis’s rotation angle [-π𝜋\piitalic_π, π𝜋\piitalic_π] into n𝑛nitalic_n segments, resulting in a total of n3superscript𝑛3n^{3}italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT candidate rotations, where n=5𝑛5n=5italic_n = 5 in the implementation. Finally, we apply ICP to estimate 𝐑*,𝐭*superscript𝐑superscript𝐭\mathbf{R}^{*},\mathbf{t}^{*}bold_R start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , bold_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, where 𝐭𝐭\mathbf{t}bold_t is initialized with 0 vector, while 𝐑𝐑\mathbf{R}bold_R is initialized with each candidate rotation. This will result in n𝑛nitalic_n outcomes from ICP. We select the solution that minimizes Eq. 2 as the final result.

Refer to caption
Figure 4: Visualization results in simulation. We compare SG-Bot with state-of-the-art methods StructFormer [13] and Socratic Models [16]. We highlight the superiority of SG-Bot via rectangles.
TABLE I: Performance evaluation on three aspects – errors (rad,cm𝑟𝑎𝑑𝑐𝑚rad,cmitalic_r italic_a italic_d , italic_c italic_m), success rate (%) and fidelity.
Method Rearrangement Errors ()(\downarrow)( ↓ ) Success Rate ()(\uparrow)( ↑ ) Scene Fidelity ()(\downarrow)( ↓ )
Resubscript𝑅eR_{\text{e}}italic_R start_POSTSUBSCRIPT e end_POSTSUBSCRIPT tesubscript𝑡et_{\text{e}}italic_t start_POSTSUBSCRIPT e end_POSTSUBSCRIPT Rfsubscript𝑅fR_{\text{f}}italic_R start_POSTSUBSCRIPT f end_POSTSUBSCRIPT tfsubscript𝑡ft_{\text{f}}italic_t start_POSTSUBSCRIPT f end_POSTSUBSCRIPT IoU0.25subscriptIoU0.25\text{IoU}_{\text{0.25}}IoU start_POSTSUBSCRIPT 0.25 end_POSTSUBSCRIPT IoU0.50subscriptIoU0.50\text{IoU}_{\text{0.50}}IoU start_POSTSUBSCRIPT 0.50 end_POSTSUBSCRIPT FID FID-CLIP
StructFormer [13] 0.28 10.58 0.18 11.17 28.03 14.01 91.46 6.32
Socratic Models [16] 12.09 13.36 43.71 36.58 86.46 6.96
SG-Bot (Ours) 0.38 4.49 0.09 4.61 53.92 34.20 58.29 3.91

Object manipulation. To determine the final robot action, we select an object P𝑃Pitalic_P from 𝒮0subscript𝒮0\mathcal{S}_{0}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and check for occupancy: We measure the point-wise L2𝐿2L2italic_L 2 distance between its counterpart Q𝑄Qitalic_Q in 𝒮*superscript𝒮\mathcal{S}^{*}caligraphic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, and all objects in 𝒮0subscript𝒮0\mathcal{S}_{0}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. If the shortest distance d𝑑ditalic_d is smaller than a set threshold σ𝜎\sigmaitalic_σ, it implies a potential collision. We then bypass moving P𝑃Pitalic_P and evaluate the next object. This continues until an object with d>σ𝑑𝜎d>\sigmaitalic_d > italic_σ is found, which is then moved to the target pose by its 𝐓𝐓\mathbf{T}bold_T.

The rearrangement ends in this manner when all objects are in their ideal poses.

VI Experiment

VI-A Implementation Details

Dataset. We collect a synthetic dataset containing 1,042 realistic initial-goal RGB-D scene pairs with scene graph labels. First, we mix the meshes in Google Scanned Objects [75] and HouseCat6D [76] as the object database. Then, we randomly place objects on the tables to render the initial scenes into images using NVISII [77]. The goal scenes are set up using the rules mentioned in Sec. V-B. Then, we construct scene graph labels by comparing the spatial relations of the objects following [23, 33]. We define six types of relations as the edge class database ΓΓ\Gammaroman_Γ, including spatial, proximity, and support information, representing the User-defined mode.

Trainval setup. We use 952 scenes as the training split and 90 scenes as the validation (test) split. All modules in our pipeline are trained on a single NVIDIA 3090 GPU. We adopt the Adam optimizer with an initial learning rate of 1e-4 to train each module. 𝒜𝒜\mathcal{A}caligraphic_A is trained for 500 epochs on the meshes in the training split. \mathcal{B}caligraphic_B is trained for 5 epochs in terms of all partial points of each object in the training split. O,Γ,Φ,subscript𝑂subscriptΓ𝛷\mathcal{M}_{O},\mathcal{M}_{\Gamma},\mathit{\Phi},\mathcal{L}caligraphic_M start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT , italic_Φ , caligraphic_L are jointly trained for 600 epochs.


Refer to caption
Figure 5: Real-world experiment. a) We tested unseen cross-category objects with a physical manipulator. b) Action decomposition of one trial during the rearrangement.

VI-B Evaluation Protocols

Baselines. We reproduce two methods representing different routines on the dataset for the comparison: First, StructFormer [13], a transformer-based method that autoregressively transforms objects to the goal state based on the current observation and previous states, is fully trained on our dataset. Second, Socratic Models [16], a LLM-based method that connects an object detection module [2], GPT [6], and a motion planning method CLIPort [78] in a series, where we use text-davinci-002 for LLM and train CLIPort solely using our dataset. All training and evaluation procedures use the same trainval splits as our method. More details about the reproduction can be found on our project website.

Metrics. First, for evaluating the rearrangement accuracy, we report the errors of estimated rotation Resubscript𝑅eR_{\text{e}}italic_R start_POSTSUBSCRIPT e end_POSTSUBSCRIPT and translation tesubscript𝑡et_{\text{e}}italic_t start_POSTSUBSCRIPT e end_POSTSUBSCRIPT comparing final positions with ground truth following [13]. We also report the errors of final poses (Rf,tf)subscript𝑅fsubscript𝑡f(R_{\text{f}},t_{\text{f}})( italic_R start_POSTSUBSCRIPT f end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT f end_POSTSUBSCRIPT ), as the final states of rearrangement are slightly different from the predicted ones because of the table-object physical interaction. Second, for the rearrangement success rate, we calculate the IoU between the bounding boxes of rearranged and ground truth objects. If IoU >σabsent𝜎>\sigma> italic_σ, it counts as a success, σ=0.25,0.50𝜎0.250.50\sigma=0.25,0.50italic_σ = 0.25 , 0.50. Note that this is a strict metric, as objects tend to be tiny, where even a small misalignment can cause failure. Additionally, inspired by some research on indoor scene synthesis [79, 80, 33], we believe that measuring the fidelity of the rearranged scene is critical for evaluating global performance. For this, we render rearranged scenes of all methods and ground truth scenes under a specific viewpoint, and then we employ the commonly adopted Fréchet Inception Distance (FID) [81] and recent FID-CLIP [82].

Refer to caption
Figure 6: Functional shape priors. Without shape priors, SG-Bot-dummy generates inconsistent shapes (left). SG-Bot controls the generated shapes close to the ground truth (right) with the help of initial shape priors (middle).

VI-C Simulation Experiments

We import meshes with their initial poses to a PyBullet environment [41] to evaluate each method. In the simulation, we leverage ground truth instance masks and remove the effect of the robotic low-level control.

Quantitative results. As shown in Table I, our method surpasses the previous approaches on most metrics by a large margin. SG-Bot obtains lower rearrangement errors on the final states and yields competitive success rates, indicating that SG-Bot shows more accurate object-level rearrangement. For instance, SG-Bot decreases 50.0% on Rfsubscript𝑅fR_{\text{f}}italic_R start_POSTSUBSCRIPT f end_POSTSUBSCRIPT and 58.7% on tfsubscript𝑡ft_{\text{f}}italic_t start_POSTSUBSCRIPT f end_POSTSUBSCRIPT compared with StructFormer [13]. When using IoU0.25subscriptIoU0.25\text{IoU}_{0.25}IoU start_POSTSUBSCRIPT 0.25 end_POSTSUBSCRIPT, SG-Bot increases 10.21% on success rate compared with Socratic Models [16]. On the scene-level comparison, SG-Bot shows more fidelity in rearranged scenes than other methods, modeling a more similar scene distribution to ground truth supported by lower FID and FID-CLIP.

Qualitative results. We show several qualitative comparisons of rearranged scenes in Fig. 4, where our method shows clear advantages against others. For example, in the first scene, the rearranged knife collides with the plate or the cup in StructFormer and Socratic Models, which is better placed with our method. In the last scene, our method can separate objects at a sensible distance while others make them unevenly distributed.

Ablation study. We ablate the shape priors, resulting in SG-Bot-dummy, a framework only taking the original latent scene graph 𝒢zsubscript𝒢𝑧\mathcal{G}_{z}caligraphic_G start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. As shown in Fig. 6, SG-Bot powered by 𝒢zβsuperscriptsubscript𝒢𝑧𝛽\mathcal{G}_{z}^{\beta}caligraphic_G start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT has more controllable ability than SG-Bot-dummy, generating more consistent shapes to the objects in the scenes. We also report quantitative comparisons in Table II.

TABLE II: Ablation – errors (rad,cm𝑟𝑎𝑑𝑐𝑚rad,cmitalic_r italic_a italic_d , italic_c italic_m), success rate (%) and fidelity.
Method Errors ()(\downarrow)( ↓ ) Success Rate ()(\uparrow)( ↑ ) Scene Fidelity ()(\downarrow)( ↓ )
Rfsubscript𝑅fR_{\text{f}}italic_R start_POSTSUBSCRIPT f end_POSTSUBSCRIPT tfsubscript𝑡ft_{\text{f}}italic_t start_POSTSUBSCRIPT f end_POSTSUBSCRIPT IoU0.25subscriptIoU0.25\text{IoU}_{\text{0.25}}IoU start_POSTSUBSCRIPT 0.25 end_POSTSUBSCRIPT IoU0.50subscriptIoU0.50\text{IoU}_{\text{0.50}}IoU start_POSTSUBSCRIPT 0.50 end_POSTSUBSCRIPT FID FID-CLIP
SG-Bot-dummy 0.09 4.86 46.32 27.08 64.28 4.20
SG-Bot 0.09 4.61 53.92 34.20 58.29 3.91

VI-D Real-world Experiments

We test SG-Bot in real-world scenarios using a 7-DoF Franka Panda robot with a parallel-jaw gripper as the end-effector. The sensor mounted on the gripper base is a RealSense L515 RGB-D camera. The framework is run on an NVIDIA 3080 laptop GPU. Different from the strategy in the simulation, we use Contact-GraspNet [48] to generate appropriate grasps on each masked object and rearrange them by reasoning the relative pose and executing the best grasp with Moveit! [83]. We show an example work stream in Fig. 5 out of 5 rounds where we test with unseen objects. More trials can be found on the project website. Our method can still maintain the rearrangement performance consistent with the one in the simulation.

VII CONCLUSIONS

In this paper, we present a novel robotic rearrangement framework, SG-Bot, which follows a three-phase procedure: observation, imagination, and execution to handle this task. With its unique coarse-to-fine design, SG-Bot embraces the synergy of commonsense priors and dynamic generation capabilities, all within a lightweight, real-time, and customizable pipeline. Extensive experiments on both simulation and real-world datasets demonstrate the superiority of SG-Bot. Future work will explore deformable point cloud matching for enhanced accuracy or accelerated point cloud alignment [84].

Acknowledgement

We are truly grateful for the reviews provided! Due to the page limit, we are not able to add more content, but we are very open to and feel excited for further discussions! We would like to thank Mr. Shun-Cheng Wu for the early discussion. We also would like to thank Ms. Chang Gao (open to graphic design jobs, email: [email protected]) for the teaser design.

References

  • [1] D. Batra, A. X. Chang, S. Chernova, A. J. Davison, J. Deng, V. Koltun, S. Levine, J. Malik, I. Mordatch, R. Mottaghi et al., “Rearrangement: A challenge for embodied ai,” 2020. [Online]. Available: https://arxiv.org/abs/2011.01975
  • [2] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” in ICLR, 2022.
  • [3] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020.
  • [4] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” in ICCV, 2023.
  • [5] Y. Di, F. Manhardt, G. Wang, X. Ji, N. Navab, and F. Tombari, “So-pose: Exploiting self-occlusion for direct 6d pose estimation,” in ICCV, 2021.
  • [6] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in NeurIPS, 2020.
  • [7] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” 2022. [Online]. Available: https://arxiv.org/abs/2204.02311
  • [8] Y. Jiang, A. Gupta, Z. Zhang, G. Wang, Y. Dou, Y. Chen, L. Fei-Fei, A. Anandkumar, Y. Zhu, and L. Fan, “Vima: General robot manipulation with multimodal prompts,” in ICML, 2023.
  • [9] C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Martín-Martín, C. Wang, G. Levine, M. Lingelbach, J. Sun et al., “Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation,” in CoRL, 2023.
  • [10] Y. Ding, X. Zhang, C. Paxton, and S. Zhang, “Task and motion planning with large language models for object rearrangement,” 2023. [Online]. Available: https://arxiv.org/abs/2303.06247
  • [11] A. Goyal, A. Mousavian, C. Paxton, Y.-W. Chao, B. Okorn, J. Deng, and D. Fox, “Ifor: Iterative flow minimization for robotic object rearrangement,” in CVPR, 2022.
  • [12] W. Goodwin, S. Vaze, I. Havoutis, and I. Posner, “Semantically grounded object matching for robust robotic scene rearrangement,” in ICRA, 2022.
  • [13] W. Liu, C. Paxton, T. Hermans, and D. Fox, “Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects,” in ICRA, 2022.
  • [14] I. Kapelyukh, V. Vosylius, and E. Johns, “Dall-e-bot: Introducing web-scale diffusion models to robotics,” RA-L, vol. 8, no. 7, pp. 3956–3963, 2023.
  • [15] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman et al., “Do as i can, not as i say: Grounding language in robotic affordances,” in CoRL, 2022.
  • [16] A. Zeng, M. Attarian, K. M. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. S. Ryoo, V. Sindhwani, J. Lee et al., “Socratic models: Composing zero-shot multimodal reasoning with language,” in ICLR, 2023.
  • [17] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” in CoRL, 2023.
  • [18] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” in ICML, 2023.
  • [19] H. Dhamo, F. Manhardt, N. Navab, and F. Tombari, “Graph-to-3d: End-to-end generation and manipulation of 3d scenes using scene graphs,” in ICCV, 2021.
  • [20] X. Chang, P. Ren, P. Xu, Z. Li, X. Chen, and A. Hauptmann, “A comprehensive survey of scene graphs: Generation and application,” T-PAMI, vol. 45, no. 1, pp. 1–26, 2021.
  • [21] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei, “Image retrieval using scene graphs,” in CVPR, 2015.
  • [22] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene graphs,” in CVPR, 2018.
  • [23] J. Wald, H. Dhamo, N. Navab, and F. Tombari, “Learning 3d semantic scene graphs from 3d indoor reconstructions,” in CVPR, 2020.
  • [24] I. Armeni, Z.-Y. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese, “3d scene graph: A structure for unified semantics, 3d space, and camera,” in ICCV, 2019.
  • [25] H. Dhamo, A. Farshad, I. Laina, N. Navab, G. D. Hager, F. Tombari, and C. Rupprecht, “Semantic image manipulation using scene graphs,” in CVPR, 2020.
  • [26] S.-C. Wu, J. Wald, K. Tateno, N. Navab, and F. Tombari, “Scenegraphfusion: Incremental 3d scene graph prediction from rgb-d sequences,” in CVPR, 2021.
  • [27] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation by iterative message passing,” in CVPR, 2017.
  • [28] M. Fessenden, “Scenegraph,” https://github.com/mfessenden/SceneGraph, 2017.
  • [29] L. Yang, Z. Huang, Y. Song, S. Hong, G. Li, W. Zhang, B. Cui, B. Ghanem, and M.-H. Yang, “Diffusion-based scene graph to image generation with masked contrastive pre-training,” 2022. [Online]. Available: https://arxiv.org/abs/2211.11138
  • [30] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” IJCV, vol. 123, pp. 32–73, 2017.
  • [31] A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, and L. Carlone, “Kimera: From slam to spatial perception with 3d dynamic scene graphs,” IJRR, vol. 40, no. 12-14, pp. 1510–1546, 2021.
  • [32] A. Luo, Z. Zhang, J. Wu, and J. B. Tenenbaum, “End-to-end optimization of scene layout,” in CVPR, 2020.
  • [33] G. Zhai, E. P. Örnek, S.-C. Wu, Y. Di, F. Tombari, N. Navab, and B. Busam, “Commonscenes: Generating commonsense 3d indoor scenes with scene graphs,” in NeurIPS, 2023.
  • [34] B. Tang and G. S. Sukhatme, “Selective object rearrangement in clutter,” in CoRL, 2023.
  • [35] Y. Zhu, J. Tremblay, S. Birchfield, and Y. Zhu, “Hierarchical planning for long-horizon manipulation with geometric and symbolic scene graphs,” in ICRA, 2021.
  • [36] K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,” CoRL, 2023.
  • [37] A. Cosgun, T. Hermans, V. Emeli, and M. Stilman, “Push planning for object placement on cluttered table surfaces,” in IROS, 2011.
  • [38] J. E. King, M. Cognetti, and S. S. Srinivasa, “Rearrangement planning using object-centric and robot-centric action spaces,” in ICRA, 2016.
  • [39] J. E. King, V. Ranganeni, and S. S. Srinivasa, “Unobservable monte carlo planning for nonprehensile rearrangement tasks,” in ICRA, 2017.
  • [40] J. Lee, Y. Cho, C. Nam, J. Park, and C. Kim, “Efficient obstacle rearrangement for object manipulation tasks in cluttered environments,” in ICRA, 2019.
  • [41] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016–2021.
  • [42] C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, and S. Savarese, “Densefusion: 6d object pose estimation by iterative dense fusion,” in CVPR, 2019.
  • [43] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel-wise voting network for 6dof pose estimation,” in CVPR, 2019.
  • [44] F. Manhardt, D. M. Arroyo, C. Rupprecht, B. Busam, T. Birdal, N. Navab, and F. Tombari, “Explaining the ambiguity of object detection and 6d pose from visual data,” in ICCV, 2019.
  • [45] R. Zhang, Y. Di, F. Manhardt, F. Tombari, and X. Ji, “Ssp-pose: Symmetry-aware shape prior deformation for direct category-level object pose estimation,” in IROS, 2022.
  • [46] Y. Di, R. Zhang, Z. Lou, F. Manhardt, X. Ji, N. Navab, and F. Tombari, “Gpv-pose: Category-level object pose estimation via geometry-guided point-wise voting,” in CVPR, 2022.
  • [47] G. Zhai, Y. Zheng, Z. Xu, X. Kong, Y. Liu, B. Busam, Y. Ren, N. Navab, and Z. Zhang, “Da22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT dataset: Toward dexterity-aware dual-arm grasping,” RA-L, vol. 7, no. 4, pp. 8941–8948, 2022.
  • [48] M. Sundermeyer, A. Mousavian, R. Triebel, and D. Fox, “Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes,” in ICRA, 2021.
  • [49] G. Zhai, D. Huang, S.-C. Wu, H. Jung, Y. Di, F. Manhardt, F. Tombari, N. Navab, and B. Busam, “Monograspnet: 6-dof grasping with a single rgb image,” in ICRA, 2023.
  • [50] R. Wang, K. Gao, D. Nakhimovich, J. Yu, and K. E. Bekris, “Uniform object rearrangement: From complete monotone primitives to efficient non-monotone informed search,” in ICRA, 2021.
  • [51] S. H. Cheong, B. Y. Cho, J. Lee, C. Kim, and C. Nam, “Where to relocate?: Object rearrangement inside cluttered and confined environments for robotic manipulation,” in ICRA, 2020.
  • [52] K. Gao, D. Lau, B. Huang, K. E. Bekris, and J. Yu, “Fast high-quality tabletop rearrangement in bounded workspace,” in ICRA, 2022.
  • [53] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik et al., “Habitat: A platform for embodied ai research,” in ICCV, 2019.
  • [54] A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y. Zhao, J. Turner, N. Maestre, M. Mukadam, D. S. Chaplot, O. Maksymets et al., “Habitat 2.0: Training home assistants to rearrange their habitat,” in NeurIPS, 2021.
  • [55] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y. Zhu et al., “Ai2-thor: An interactive 3d environment for visual ai,” 2017. [Online]. Available: https://arxiv.org/abs/1712.05474
  • [56] S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark & learning environment,” RA-L, vol. 5, no. 2, pp. 3019–3026, 2020.
  • [57] F. Xiang, Y. Qin, K. Mo, Y. Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y. Yuan, H. Wang et al., “Sapien: A simulated part-based interactive environment,” in CVPR, 2020.
  • [58] B. Shen, F. Xia, C. Li, R. Martín-Martín, L. Fan, G. Wang, C. Pérez-D’Arpino, S. Buch, S. Srivastava, L. Tchapmi et al., “igibson 1.0: A simulation environment for interactive tasks in large realistic scenes,” in IROS, 2021.
  • [59] M. Wu, F. Zhong, Y. Xia, and H. Dong, “Targf: Learning target gradient field to rearrange objects without explicit goal specification,” in NeurIPS, 2022.
  • [60] Q. A. Wei, S. Ding, J. J. Park, R. Sajnani, A. Poulenard, S. Sridhar, and L. Guibas, “Lego-net: Learning regular rearrangements of objects in rooms,” in CVPR, 2023.
  • [61] N. Gkanatsios, A. Jain, Z. Xian, Y. Zhang, C. Atkeson, and K. Fragkiadaki, “Energy-based models as zero-shot planners for compositional scene rearrangement,” in RSS, 2023.
  • [62] I. Kapelyukh, Y. Ren, I. Alzugaray, and E. Johns, “Dream2real: Zero-shot 3d object rearrangement with vision-language models,” in ICRA, 2024.
  • [63] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Progprompt: Generating situated robot task plans using large language models,” in ICRA, 2023.
  • [64] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in ICRA, 2023.
  • [65] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” 2022. [Online]. Available: https://arxiv.org/abs/2204.06125
  • [66] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” 2021. [Online]. Available: https://arxiv.org/abs/2107.03374
  • [67] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu et al., “Rt-1: Robotics transformer for real-world control at scale,” 2022. [Online]. Available: https://arxiv.org/abs/2212.06817
  • [68] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
  • [69] P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” in Sensor fusion IV: control paradigms and data structures, vol. 1611.   Spie, 1992, pp. 586–606.
  • [70] Z. Zhang, “Iterative point matching for registration of free-form curves and surfaces,” IJCV, vol. 13, no. 2, pp. 119–152, 1994.
  • [71] X. Chen, S. Jia, and Y. Xiang, “A review: Knowledge reasoning over knowledge graph,” Expert Systems with Applications, vol. 141, p. 112948, 2020.
  • [72] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry, “A papier-mâché approach to learning 3d surface generation,” in CVPR, 2018.
  • [73] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
  • [74] J. Zhang, Y. Yao, and B. Deng, “Fast and robust iterative closest point,” T-PAMI, vol. 44, no. 7, pp. 3450–3466, 2021.
  • [75] L. Downs, A. Francis, N. Koenig, B. Kinman, R. Hickman, K. Reymann, T. B. McHugh, and V. Vanhoucke, “Google scanned objects: A high-quality dataset of 3d scanned household items,” in ICRA, 2022.
  • [76] H. Jung, G. Zhai, S.-C. Wu, P. Ruhkamp, H. Schieber, P. Wang, G. Rizzoli, H. Zhao, S. D. Meier, D. Roth, N. Navab et al., “Housecat6d–a large-scale multi-modal category level 6d object perception dataset with household objects in realistic scenarios,” 2022. [Online]. Available: https://arxiv.org/abs/2212.10428
  • [77] N. Morrical, J. Tremblay, S. Birchfield, and I. Wald, “NVISII: Nvidia scene imaging interface,” 2020, https://github.com/owl-project/NVISII/.
  • [78] M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” in CoRL, 2022.
  • [79] D. Ritchie, K. Wang, and Y.-a. Lin, “Fast and flexible indoor scene synthesis via deep convolutional generative models,” in CVPR, 2019.
  • [80] D. Paschalidou, A. Kar, M. Shugrina, K. Kreis, A. Geiger, and S. Fidler, “Atiss: Autoregressive transformers for indoor scene synthesis,” in NeurIPS, 2021.
  • [81] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in NeurIPS, 2017.
  • [82] T. Kynkäänniemi, T. Karras, M. Aittala, T. Aila, and J. Lehtinen, “The role of imagenet classes in fréchet inception distance,” in ICLR, 2023.
  • [83] D. Coleman, I. Sucan, S. Chitta, and N. Correll, “Reducing the barrier to entry of complex robotic software: a moveit! case study,” 2014. [Online]. Available: https://arxiv.org/abs/1404.3785
  • [84] E. Malis, “Complete closed-form and accurate solution to pose estimation from 3d correspondences,” RA-L, vol. 8, no. 3, pp. 1786–1793, 2023.