Language-driven Grasp Detection

An Dinh Vuong1, Minh Nhat Vu2,∗, Baoru Huang3,∗, Nghia Nguyen1, Hieu Le1, Thieu Vo4, Anh Nguyen5
1FPT Software AI Center, Vietnam 2Automation & Control Institute, TU Wien, Austria 3Imperial College London, UK
4Ton Duc Thang University, Vietnam 5University of Liverpool, UK Co-Corresponding authors
https://airvlab.github.io/grasp-anything
Abstract

Grasp detection is a persistent and intricate challenge with various industrial applications. Recently, many methods and datasets have been proposed to tackle the grasp detection problem. However, most of them do not consider using natural language as a condition to detect the grasp poses. In this paper, we introduce Grasp-Anything++, a new language-driven grasp detection dataset featuring 1M samples, over 3M objects, and upwards of 10M grasping instructions. We utilize foundation models to create a large-scale scene corpus with corresponding images and grasp prompts. We approach the language-driven grasp detection task as a conditional generation problem. Drawing on the success of diffusion models in generative tasks and given that language plays a vital role in this task, we propose a new language-driven grasp detection method based on diffusion models. Our key contribution is the contrastive training objective, which explicitly contributes to the denoising process to detect the grasp pose given the language instructions. We illustrate that our approach is theoretically supportive. The intensive experiments show that our method outperforms state-of-the-art approaches and allows real-world robotic grasping. Finally, we demonstrate our large-scale dataset enables zero-short grasp detection and is a challenging benchmark for future work.

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Figure 1: We present a new dataset and method for language-driven grasp task.

1 Introduction

Imagine we want an assistant robot to grasp a cup among a clutter of daily objects such as a knife, a fork, a cup, and a pair of scissors. Conventionally, to convey the idea of grasping this specific object, humans use the natural language command, “give me the cup”, for instance. Although humans intuitively know how to grasp the cup given the linguistic command, determining specific grasp actions for objects based on natural language instructions or language-driven grasp detection remains challenging for robots [78]. First, natural language is usually overlooked in existing grasp datasets [69] while training vision-and-language neural networks necessitates an excessive number of labeled examples [80]. Second, recent works usually focus on particular manipulation tasks with limited objects [28], imposing a bottleneck for in-the-wild robot execution [77]. Finally, despite recent developments, bridging the gap between language, vision, and control for real-world robotic experiments remains a challenging task [100].

Recently, language-driven robotic frameworks are gaining traction, offering the potential for robots to process natural language, and bridging the gap between robotic manipulations and real-world human-robot interaction [63]. PaLM-E [24], EgoCOT [63], and ConceptFusion [38] are some notable embodied robots with the ability to comprehend natural language by harnessing the power of large foundation models such as ChatGPT [67]. However, most works assume the high-level actions of robots and ignore the fundamental grasping actions, restricting the structure for generalization across robotic domains, tasks, and skills [62]. In this paper, we explore training a language-driven agent to implement low-level actions, focusing on the task of object grasping via image observations. Specifically, our hypothesis is centered around the establishment of a robotic system that can execute grasping actions following a given language instruction for any universal object.

We first present Grasp-Anything++ to serve as a large-scale dataset for language-driven grasp detection. Our dataset is based on the Grasp-Anything [87], and is synthesized from foundation models. Compared to the original Grasp-Anything dataset, we provide more than 10M grasp prompts and 3M associated object masks, 6M ground truth poses at the object part level. Our dataset showcases the ability to facilitate grasp detection using language instructions. We label the ground truth at both the object level and part level, providing a comprehensive understanding of real-world scenarios. For example, our ground truth includes both general instructions “give me the knife” and detail ones such as “grasp the handle of the steak knife”. We empirically show that our large-scale dataset successfully facilitates zero-shot grasp detection on both vision-based tasks and real-world robotic experiments.

To tackle the challenging language-driven grasp detection task, we propose a new diffusion model-based method. Our selection of diffusion models is motivated by their proven efficacy in conditional generation tasks [33]. These models have shown efficiency beyond image synthesis, including other image-based tasks such as image segmentation [92], and visual grounding [51]. Despite achieving notable success, integrating visual and text features effectively remains a challenge [14] as the majority of existing literature employs latent strategies to combine visual and text features [46]. We address this challenge by employing a new training strategy for learning text and image features, focusing on the use of feature maps as guidance information for grasp pose generation. Our main contribution is a new training objective that incorporates the feature maps and explicitly contributes to the denoising process. In summary, our contributions are three-fold:

  • We propose Grasp-Anything++, a large-scale language-driven dataset for grasp detection tasks.

  • We propose a diffusion model with a training objective that explicitly contributes to the denoising process to detect the grasp poses.

  • We demonstrate that our Grasp-Anything++ dataset and the proposed method outperform other approaches and enable successful robotic applications.

2 Related Work

Grasp Detection. Grasp detection is a popular task in both computer vision and robotic community [18, 65, 78, 2, 27]. Recently, establishing robotic systems with the ability to follow natural commands has been actively researched [100, 78, 96]. The prevalent solution to the language-driven grasp detection task is to split into two stages: one for grounding the target object, and the other is to synthesize grasp poses from the grounding visual-text correlations [96, 1]. Training in two stages may result in longer inference time [53]. In addition, several works [100, 98, 107] adopt foundation models, such as GroundDINO [54] and GPT-3 [8]. Accessing such commercial foundation models is not always available [85], especially on robotic systems with limited resources or unstable internet connection [47]. In our work, we directly train the model on the large-scale Grasp-Anything++ dataset to inherit the power of a foundation-based dataset, while ensuring a straightforward inference process for the downstream robotic applications.

Language-driven Grasp Detection Datasets. While there are many grasp datasets have been introduced [39, 18, 68, 49, 94, 57, 61, 25, 27, 60, 26, 10], the majority of them overlook the text modality. Therefore, the grasping of objects out of a clutter typically experiences ambiguities in what object to grasp [105]. DailyGrasp [100] is one of the first grasp datasets employing natural language for scene descriptions; however, the scene description corpus in this dataset is relatively small and does not specify which part of the object should be grasped. In our work, we present Grasp-Anything++, which is a large-scale language-driven grasping dataset. Furthermore, Grasp-Anything++ describes the grasping object at both the part level and object level, providing more information for the robot to execute the grasping [103].

Step Description
Scene Generation User Please help me generate scene descriptions for natural arrangements of daily objects. Each description has the following form: <Object_1><Object_2>…<Verb><Container_Object>. Please also ensure the incorporation of a rich and varied lexicon in the scene descriptions. [Uncaptioned image]
Sample A steel knife, a polished fork and a pristine ceramic plate on a wooden table.
Text-to- Image We use Stable Diffusion [74] to proceed text-to-image generation.
Object Masking User For object part-level description, given an input list {<Object_1>, <Object_2>, …}, the output will be a list that describes the parts of objects as: {<Object_1>: [<Part_1.1>, <Part_1.2>, …], <Object_2>: [<Part_2.1>, <Part_2.2>, …]}. [Uncaptioned image]
Sample {knife: [handle, blade], fork: [handle, neck, stem, tines], plate: [rim, base]}
Post Process We use OFA [89] and SAM [42] to locate the region describing the objects.
Part Masking User Given the object list and part lists of each scene description, you will generate for me all prompts with the following format: {<Manipulation_Action><Object_ID><Part_ID>}. The part that is more suitable for human grasping is positioned at the start of the list to represent the grasping actions. [Uncaptioned image]
Sample Give me the steel knife; Grasp the knife at its handle.
Post Process We leverage VLPart [81] to locate the region describing the parts of objects.
Grasp Generation User Generate for me a scene description with grasp instructions following the templates. [Uncaptioned image]
Sample Scene description: A steel knife, a polished fork and a pristine ceramic plate on a wooden table. Object list: {knife, fork, plate}. Part lists: {knife: [handle, blade], fork: [handle, neck, stem, tines], plate: [rim, base]}. Prompts: Give me the steel knife; Grasp the knife at its handle.
Grasp Labelling We utilize a pretrained RAGT-3/3 [10] to generate grasp poses corresponding to the located region.
Table 1: Grasp-Anything++ creation pipeline. We utilize ChatGPT to generate scene descriptions and grasp instructions from the user input. We generate images given scene descriptions and automatically synthesize the grasp poses.

Diffusion Models for Robotic Applications. Diffusion models [33] have emerged as the new state-of-the-art method of generative tasks [101]. Recently, we have witnessed growing attention for utilizing diffusion models in robotic applications [75]. Liu et al. [55] propose a diffusion model to handle the language-guided object rearrangement task. Diffusion models are also applied to other robotic tasks such as motion planning [11], and trajectory optimization [37]. The authors in [84] present a diffusion model to determine grasp poses by minimizing a SDF loss. Overall, the diffusion models employed in previous works often combine visual and text features in a latent mechanism [7], which may cause interpretability problems [100] for robotic systems that require low-level controls [22]. To tackle this challenge, we propose a training objective that explicitly contributes to the denoising process. We demonstrate that our proposed strategy is theoretically supported and is more effective than the latent strategy.

3 The Grasp-Anything++ Dataset

We utilize large-scale foundation models to create the Grasp-Anything++. Our dataset offers open-vocabulary grasping commands and images with associated groundtruth. There are three key steps in establishing our dataset: i) prompting procedure, ii) image synthesis and grasp poses annotation, and iii) post-processing.

3.1 Prompting Procedure

We first establish prompt-based procedures to generate a large-scale scene description corpus as well as grasp prompt instructions. In particular, we utilize ChatGPT to generate the prompts for two tasks: i) Scene descriptions: Sentences capturing the scene arrangement, including the extracted object and part lists, and ii) Grasp instructions: Prompts directing the robot to grasp specific objects or parts.

We follow a procedure in Table 1 to implement ChatGPT’s output templates. The reference target in the grasp instruction may be either an object or an object’s part. When the reference is an object’s part, that part is directly selected as the reference in the grasp instruction sentence. If the reference is an object, we determine the grasping region on the part of the object that is likely to be grasped in everyday scenarios as described in affordance theory [66].

3.2 Image Synthesis and Grasp Annotation

Image Synthesis. Given the scene description corpus, we first utilize a large-scale pretrained text-to-image model, namely, Stable Diffusion [74] to generate images from scene descriptions. Next, we perform a series of visual grounding and image segmentation using OFA [89], Segment-Anything [42], and VLPart [81] to locate the referenced object or part to the grasp instruction.

Refer to caption
(a) Number of categories
Refer to caption
(b) Scene description length
Refer to caption
(c) Number of object/part references.
Figure 2: Data statistics. We analyze object categories and the use of natural language in the Grasp-Anything++ dataset.

Grasp Annotation. Grasp poses are represented as 2D rectangles, consistent with prior research and practical compatibility with real-world parallel grippers [39, 18]. Utilizing a pretrained network [10], we annotate grasp poses based on part segmentation masks. Since potential inaccuracies in these candidate poses could occur, we follow the procedure as defined in [87] to evaluate the quality of generated grasp poses to discard unreasonable grasp poses.

Specifically, grasp quality is evaluated through net torque 𝒯=(τ1+τ2)RMg ,𝒯subscript𝜏1subscript𝜏2𝑅𝑀𝑔 ,\mathcal{T}=\left(\tau_{1}+\tau_{2}\right)-RMg\text{~{},}caligraphic_T = ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_R italic_M italic_g , where resistance at contact points is τi=KμsFcosαisubscript𝜏𝑖𝐾subscript𝜇𝑠𝐹subscript𝛼𝑖\tau_{i}=K\mu_{s}F\cos\alpha_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_K italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_F roman_cos italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. With constants such as M𝑀Mitalic_M (mass), g𝑔gitalic_g (gravitational acceleration), K𝐾Kitalic_K (geometrical characteristics), μssubscript𝜇𝑠\mu_{s}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (static friction coefficient), and F𝐹Fitalic_F (applied force), accurately determining 𝒯𝒯\mathcal{T}caligraphic_T directly is challenging due to the physical difficulties in precisely measuring M𝑀Mitalic_M, K𝐾Kitalic_K, and μssubscript𝜇𝑠\mu_{s}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Thus, we employ a surrogate measure, 𝒯~=cosα1+cosα2R ,~𝒯subscript𝛼1subscript𝛼2𝑅 ,\tilde{\mathcal{T}}=\dfrac{\cos\alpha_{1}+\cos\alpha_{2}}{R}\text{~{},}over~ start_ARG caligraphic_T end_ARG = divide start_ARG roman_cos italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + roman_cos italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_R end_ARG , as an alternative. As a result in [13], antipodal grasps score higher on 𝒯~~𝒯\tilde{\mathcal{T}}over~ start_ARG caligraphic_T end_ARG, indicating better quality. Consequently, grasps are evaluated based on 𝒯~~𝒯\tilde{\mathcal{T}}over~ start_ARG caligraphic_T end_ARG, with positive values indicating positive grasps and others considered as negative.

Post Processing. Despite training on extensive datasets, Stable Diffusion [74] may produce subpar content, commonly termed as hallucination[36] when generating images from the text prompts. To address this, we perform manual reviews to filter out such images, with qualitative examples in our figures. Our process includes checks at every stage to prevent duplicate or hallucinated content. However, manual inspection introduces biases, which we counter with guidelines focusing on abnormal structures or implausible gravity (Fig. 3), aligns with approach in the literature [76].

Refer to caption
Figure 3: Failure image generation cases. Images generated by Stable Diffusion [74] may exhibit hallucinatory artifacts such as sunglasses lacking a lens, scissors with an anomalous structure, and a spoon not resting properly on a table.

Additionally, ChatGPT-generated scene prompts often duplicate [56]. To address this, we use duplication checking, filtering out identical prompts with BERTScore [106], which assesses sentence similarity through cosine similarities of token embeddings. We remove sentences with a BERTScore above 0.85 as in prior study [90].

3.3 Data Statistics

Number of Categories. To evaluate the object category diversity, we apply a methodology akin to that in [17]. Utilizing 300 categories from LVIS dataset [29], we employ a pretrained model [108] to identify 300 candidate objects from our dataset for each category. We then curate a subset comprising 90,000 objects, refining it by excluding items that do not align semantically with their designated categories. A category is considered significant if it has more than 40 objects. Fig. 2(a) shows the results. Overall, our dataset spans over 236236236236 categories from LVIS dataset, indicating a notable degree of object diversity in our dataset.

Scene Descriptions. Fig. 2(b) shows the distribution of scene descriptions based on sentence length. The analysis reveals a wide range of sentence lengths, spanning from 10 words to 100 words per sentence. On average, each scene description consists of approximately 54 words, indicative of detailed and descriptive sentences. These scene descriptions correspond to sets of grasp instructions. Fig. 2(c) further shows the objects and object parts in scene descriptions.

Diversity Analysis. We assess the diversity of occlusion and lighting conditions in the dataset. Regarding occlusion, we use a pretrained YOLOv5 model to identify objects within images. The results indicate that 93.8%percent93.893.8\%93.8 % of images have a substantial overlap of five or more bounding boxes, which suggests a diverse range of occlusion within the Grasp-Anything++ dataset. Regarding lighting conditions, we convert images to YCbCr to analyze Y channel (luminance) and find that GraspL1M has the most diverse lighting conditions, identifying by the lowest Gini coefficient (a metric to measure the inequality of a distribution) of 0.260.260.260.26, compared to VMRD [104] (0.310.310.310.31), OCID-grasp [2] (0.320.320.320.32), Cornell [39] (0.620.620.620.62), Jacquard [18] (0.910.910.910.91).

4 Language-driven Grasp Detection

Motivation. The use of diffusion model for language-driven grasp detection is motivated by its efficiency in various generative tasks [33, 101, 44, 55, 16]. Conditional generation, such as our language-driven grasp detection task, aligns seamlessly with diffusion models’ capabilities [34]. Moreover, language-driven grasp detection represents a fine-grained problem in which the outputs strongly depend on the text input [4]. For example, “grasp the steak knife” and “grasp the kraft knife” refer to two different objects on the image. To this end, we propose using contrastive loss with diffusion model to tackle this task, as contrastive learning is a popular solution for fine-grained tasks [9, 21, 102].

4.1 Constrastive Loss for Diffusion Model

We represent the target grasp pose as 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in the diffusion model. The objective of our diffusion process of language-driven grasp detection involves denoising from a noisy state 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to the original grasp pose 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, conditioned on the input image and grasp instruction represented by y𝑦yitalic_y.

In a diffusion process [33], assume that q(𝐱1:T|𝐱0)𝑞conditionalsubscript𝐱:1𝑇subscript𝐱0q(\mathbf{x}_{1:T}|\mathbf{x}_{0})italic_q ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is the forward process and we parameterize the reverse process by pθ(𝐱0:T)subscript𝑝𝜃subscript𝐱:0𝑇p_{\theta}(\mathbf{x}_{0:T})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ). The conditional diffusion process [20] assumes q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG is the forward process but with the inclusion of a condition y𝑦yitalic_y. The goal of the reverse process is to optimize the variational bound on negative log likelihood [33]

=𝔼[logpθ(𝐱T)t1logpθ(𝐱t1|𝐱t)q(𝐱t|𝐱t1)] .𝔼delimited-[]subscript𝑝𝜃subscript𝐱𝑇subscript𝑡1subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1 .\mathcal{L}=\mathbb{E}\left[-\log p_{\theta}(\mathbf{x}_{T})-\sum_{t\geq 1}% \log\frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t}|% \mathbf{x}_{t-1})}\right]\text{~{}\@.}caligraphic_L = blackboard_E [ - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_t ≥ 1 end_POSTSUBSCRIPT roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG ] . (1)

We prove in the Appendix that

=𝔼[Clogpθ(𝐱T)Constant+logq^(𝐱0|𝐱T,y)Contrastive+t>1DKL(q^(𝐱t1|𝐱t,𝐱0,y)pθ(𝐱t1|𝐱t))logpθ(𝐱0|𝐱1,y)Denoising score] .\begin{gathered}\mathcal{L}=\mathbb{E}\bigg{[}\underbrace{C-\log p_{\theta}(% \mathbf{x}_{T})}_{\textrm{Constant}}+\underbrace{\log\hat{q}(\mathbf{x}_{0}|% \mathbf{x}_{T},y)}_{\textrm{Contrastive}}+\\ \underbrace{\sum_{t>1}D_{\text{KL}}(\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}_{t},% \mathbf{x}_{0},y)\|p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}))-\log p_{\theta% }(\mathbf{x}_{0}|\mathbf{x}_{1},y)}_{\textrm{Denoising score}}\bigg{]}\text{~{% }\@.}\end{gathered}start_ROW start_CELL caligraphic_L = blackboard_E [ under⏟ start_ARG italic_C - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT Constant end_POSTSUBSCRIPT + under⏟ start_ARG roman_log over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y ) end_ARG start_POSTSUBSCRIPT Contrastive end_POSTSUBSCRIPT + end_CELL end_ROW start_ROW start_CELL under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_t > 1 end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) ∥ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y ) end_ARG start_POSTSUBSCRIPT Denoising score end_POSTSUBSCRIPT ] . end_CELL end_ROW (2)

The terms DKL(q^(𝐱t1|𝐱t,𝐱0,y)pθ(𝐱t1|𝐱t))D_{\text{KL}}(\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0},y)\|p_{% \theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}))italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) ∥ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) and logpθ(𝐱0|𝐱1,y)subscript𝑝𝜃conditionalsubscript𝐱0subscript𝐱1𝑦\log p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1},y)roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y ) in Equation 2 are similar to the concept of denoising score used in [33]. Thus, we can represent the quantity t>1DKL(q^(𝐱t1|𝐱t,𝐱0,y)pθ(𝐱t1|𝐱t))logpθ(𝐱0|𝐱1,y)\sum_{t>1}D_{\text{KL}}(\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0}% ,y)\|p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}))-\log p_{\theta}(\mathbf{x}_{% 0}|\mathbf{x}_{1},y)∑ start_POSTSUBSCRIPT italic_t > 1 end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) ∥ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y ) by the loss utilized in [82, 83]

diffusion=𝔼𝐱0q(𝐱0|y),t[1,T][𝐱0f(𝐱t+1,t+1,𝐱~0)]2 .subscriptdiffusionsubscript𝔼formulae-sequencesimilar-tosubscript𝐱0𝑞conditionalsubscript𝐱0𝑦similar-to𝑡1𝑇superscriptdelimited-[]subscript𝐱0𝑓subscript𝐱𝑡1𝑡1subscript~𝐱02 .\mathcal{L}_{\rm{diffusion}}=\mathbb{E}_{\mathbf{x}_{0}\sim q(\mathbf{x}_{0}|y% ),t\sim[1,T]}\left[\mathbf{x}_{0}-f(\mathbf{x}_{t+1},t+1,\tilde{\mathbf{x}}_{0% })\right]^{2}\text{~{}\@.}caligraphic_L start_POSTSUBSCRIPT roman_diffusion end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_y ) , italic_t ∼ [ 1 , italic_T ] end_POSTSUBSCRIPT [ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_f ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_t + 1 , over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (3)

Since q(𝐱T|)𝑞conditionalsubscript𝐱𝑇q(\mathbf{x}_{T}|\cdot)italic_q ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | ⋅ ) is equivalent to an isotropic Gaussian distribution as T+𝑇T\rightarrow+\inftyitalic_T → + ∞, the estimation quantity logpθ(𝐱T)subscript𝑝𝜃subscript𝐱𝑇\log p_{\theta}(\mathbf{x}_{T})roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) converges to a constant when θθ𝜃superscript𝜃\theta\rightarrow\theta^{\ast}italic_θ → italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Therefore, we can ignore the first term of Equation 2.

Finally, the term q^(𝐱0|𝐱T,y)^𝑞conditionalsubscript𝐱0subscript𝐱𝑇𝑦{\hat{q}(\mathbf{x}_{0}|\mathbf{x}_{T},y)}over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y ) of Equation 2 provides more information about the relation between 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. As this term is intractable [79], we parameterize q^(𝐱0|𝐱T,y)^𝑞conditionalsubscript𝐱0subscript𝐱𝑇𝑦{\hat{q}(\mathbf{x}_{0}|\mathbf{x}_{T},y)}over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y ) as pψ,y(𝐱0,𝐱T)subscript𝑝𝜓𝑦subscript𝐱0subscript𝐱𝑇p_{\psi,y}(\mathbf{x}_{0},\mathbf{x}_{T})italic_p start_POSTSUBSCRIPT italic_ψ , italic_y end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ). This estimation resembles the noise-contrastive estimation [30], where the 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be considered as a pair of contrastive estimation and ψ𝜓\psiitalic_ψ can be estimated by a contrastive loss.

In [88, 20, 82], the authors indicate that predicting 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is often infeasible but predicting an estimation 𝐱~0subscript~𝐱0\tilde{\mathbf{x}}_{0}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is tractable and can be used as a ‘pseudo’ estimation of 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We denote 𝐱~0subscript~𝐱0\tilde{\mathbf{x}}_{0}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as an estimation of 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The loss term q^(𝐱0|𝐱T,y)^𝑞conditionalsubscript𝐱0subscript𝐱𝑇𝑦{\hat{q}(\mathbf{x}_{0}|\mathbf{x}_{T},y)}over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y ) can be approximated by using the following contrastive loss

contrastive=max(0,αT¯𝐱~0𝐱T1αT¯22M),subscriptcontrastive0superscriptsubscriptnorm¯subscript𝛼𝑇subscript~𝐱0subscript𝐱𝑇1¯subscript𝛼𝑇22𝑀\mathcal{L}_{\rm{contrastive}}=\max\left(0,\left\|\frac{\sqrt{\overline{\alpha% _{T}}}\tilde{\mathbf{x}}_{0}-\mathbf{x}_{T}}{\sqrt{1-\overline{\alpha_{T}}}}% \right\|_{2}^{2}-M\right),caligraphic_L start_POSTSUBSCRIPT roman_contrastive end_POSTSUBSCRIPT = roman_max ( 0 , ∥ divide start_ARG square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_M ) , (4)

where M𝑀Mitalic_M is the number of dimension of 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the variance schedule at timestep t𝑡titalic_t (t=1;T¯𝑡¯1𝑇t=\overline{1;T}italic_t = over¯ start_ARG 1 ; italic_T end_ARG).

Proposition 1.

Suppose that 𝐱~0,𝐱0subscript~𝐱0subscript𝐱0\tilde{\mathbf{x}}_{0},\,\mathbf{x}_{0}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ϵitalic-ϵ\epsilonitalic_ϵ are independent, and that

αT¯𝐱~0𝐱T1αT¯22M .superscriptsubscriptnorm¯subscript𝛼𝑇subscript~𝐱0subscript𝐱𝑇1¯subscript𝛼𝑇22𝑀 .\left\|\frac{\sqrt{\overline{\alpha_{T}}}\tilde{\mathbf{x}}_{0}-\mathbf{x}_{T}% }{\sqrt{1-\overline{\alpha_{T}}}}\right\|_{2}^{2}\geq M\text{~{}\@.}∥ divide start_ARG square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_M .

Then there exists C>0𝐶0C>0italic_C > 0 such that: for arbitrary δ>0𝛿0\delta>0italic_δ > 0, if contrastive<δsubscriptcontrastive𝛿\mathcal{L}_{\rm{contrastive}}<\deltacaligraphic_L start_POSTSUBSCRIPT roman_contrastive end_POSTSUBSCRIPT < italic_δ, then

𝔼[𝐱~0𝐱022]<Cδ .𝔼delimited-[]superscriptsubscriptnormsubscript~𝐱0subscript𝐱022𝐶𝛿 .\mathbb{E}\left[\|\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}\|_{2}^{2}\right]<C% \delta\text{~{}\@.}blackboard_E [ ∥ over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] < italic_C italic_δ .
Proof.

See Supplementary Material. ∎

Remark 1.1.
  • Proposition 1 suggests that if the contrastive loss contrastivesubscriptcontrastive\mathcal{L}_{\rm{contrastive}}caligraphic_L start_POSTSUBSCRIPT roman_contrastive end_POSTSUBSCRIPT tends to zero, then the prediction 𝐱~0subscript~𝐱0\tilde{\mathbf{x}}_{0}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT will approach the ground truth 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Remark 1.2.
  • The condition αT¯𝐱~0𝐱T1αT¯22Msuperscriptsubscriptnorm¯subscript𝛼𝑇subscript~𝐱0subscript𝐱𝑇1¯subscript𝛼𝑇22𝑀\left\|\frac{\sqrt{\overline{\alpha_{T}}}\tilde{\mathbf{x}}_{0}-\mathbf{x}_{T}% }{\sqrt{1-\overline{\alpha_{T}}}}\right\|_{2}^{2}\geq M∥ divide start_ARG square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_M is suitable for our language-driven grasp detection task as 𝐱~0subscript~𝐱0\tilde{\mathbf{x}}_{0}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are two contrastive quantities, therefore, we can assume there is a minimum distance between 𝐱~0subscript~𝐱0\tilde{\mathbf{x}}_{0}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. In addition, in the proof of Proposition 1, we see that 𝔼[αT¯𝐱~0𝐱T1αT¯22M]=β2𝔼[𝐱~0𝐱022]𝔼delimited-[]superscriptsubscriptnorm¯subscript𝛼𝑇subscript~𝐱0subscript𝐱𝑇1¯subscript𝛼𝑇22𝑀superscript𝛽2𝔼delimited-[]superscriptsubscriptnormsubscript~𝐱0subscript𝐱022\mathbb{E}\left[\left\|\frac{\sqrt{\overline{\alpha_{T}}}\tilde{\mathbf{x}}_{0% }-\mathbf{x}_{T}}{\sqrt{1-\overline{\alpha_{T}}}}\right\|_{2}^{2}-M\right]=% \beta^{2}\mathbb{E}\left[\|\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}\|_{2}^{2}\right]blackboard_E [ ∥ divide start_ARG square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_M ] = italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ], which is always nonnegative. Therefore, it is both theoretically and experimentally reasonable to add this assumption.

Refer to caption
Figure 4: Language-drive Grasp Detection (LGD) network. We present the network architecture (left) and the proposed training objectives of the denoising process (right).

4.2 Language-driven Grasp Detection Network

Refer to caption
Figure 5: Language-driven grasp detection results visualization.

Network. Our network operates on two conditions: an image denoted as I and a corresponding text prompt represented as e𝑒eitalic_e. To process these conditions, we employ a vision encoder to extract visual features from I and a text encoder to derive textual embeddings from e𝑒eitalic_e. The resulting feature vectors, denoted as IsuperscriptI\textit{{I}}^{\prime}I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and esuperscript𝑒e^{\prime}italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, are subsequently subjected to a fusion module, ALBEF [50]. We leverage the attention mask generated by the ALBEF fusion module as the estimation 𝐱~0subscript~𝐱0\tilde{\mathbf{x}}_{0}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Next, we aggregate three elements: the estimation region 𝐱~0subscript~𝐱0\tilde{\mathbf{x}}_{0}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the grasp pose at the current timestep 𝐱t+1subscript𝐱𝑡1\mathbf{x}_{t+1}bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, and the timestep t+1𝑡1t+1italic_t + 1. These inputs are combined using MLP layers, similar to the approach outlined in [82]. Specifically, the output operation can be expressed as: 𝐱t=f(𝐱t+1,t+1,𝐱~0)subscript𝐱𝑡𝑓subscript𝐱𝑡1𝑡1subscript~𝐱0\mathbf{x}_{t}=f(\mathbf{x}_{t+1},t+1,\tilde{\mathbf{x}}_{0})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_t + 1 , over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), where function f𝑓fitalic_f encompasses a composition of multiple MLP layers. Additional specifics regarding these universal MLP layers are provided in the Supplementary Material.

Training Objective. In our context, conditioned grasp detection models the distribution p(𝐱0|y)𝑝conditionalsubscript𝐱0𝑦p(\mathbf{x}_{0}|y)italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_y ) as the reversed diffusion process of gradually cleaning 𝐱t+1subscript𝐱𝑡1\mathbf{x}_{t+1}bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. Instead of predicting 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as formulated by [33], we follow Ramesh et al. [72] and predict the signal itself, i.e., 𝐱t=f(𝐱t+1,t+1,𝐱~0)subscript𝐱𝑡𝑓subscript𝐱𝑡1𝑡1subscript~𝐱0\mathbf{x}_{t}=f(\mathbf{x}_{t+1},t+1,\tilde{\mathbf{x}}_{0})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_t + 1 , over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) with the simple objective [33]. To this end, we utilize the contrastive loss as in Equation 4 to explicitly improve the learning objective of the denoising process:

total=contrastive+diffusion .subscripttotalsubscriptcontrastivesubscriptdiffusion .\mathcal{L}_{\rm{total}}=\mathcal{L}_{\rm{contrastive}}+\mathcal{L}_{\rm{% diffusion}}\text{~{}\@.}caligraphic_L start_POSTSUBSCRIPT roman_total end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT roman_contrastive end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_diffusion end_POSTSUBSCRIPT . (5)

5 Experiments

We conduct experiments to evaluate our proposed method and Grasp-Anything++ dataset using both the vision-based metrics and real robot experiments. We then demonstrate zero-shot grasp results and discuss the challenges and open questions for future works.

5.1 Language-driven Grasp Detection Results

Baselines. We compare our language-driven grasp detection method (LGD) with the linguistically supported versions of GR-CNN [43], Det-Seg-Refine [2], GG-CNN [59], CLIPORT [78] and CLIP-Fusion [96]. In all cases, we employ a pretrained CLIP [71] or BERT [19] as the text embedding. The implementation details of all baselines can be found in our Supplementary Material.

Setup. To assess the generalization of all methods trained on Grasp-Anything++, we utilize the concept of base and new labels [109] in zero-shot learning. We categorize LVIS labels from Section 3.3 to form labels for our experiment. In particular, we select 70% of these labels by frequency for ‘Base’ and assign the remaining 30% to ‘New’. We also use the harmonic mean (‘H’) to measure the overall success rates [109]. Our primary evaluation metric is the success rate, defined similarly to [43], necessitating an IoU score of the predicted grasp exceeding 25%percent2525\%25 % with the ground truth grasp and an offset angle less than 30superscript3030^{\circ}30 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT.

Baseline Seen Unseen H
GR-ConvNet [43] + CLIP [71] 0.37 0.18 0.24
Det-Seg-Refine [2] + CLIP [71] 0.30 0.15 0.20
GG-CNN [59] + CLIP [71] 0.12 0.08 0.10
CLIPORT [78] 0.36 0.26 0.29
CLIP-Fusion [96] 0.40 0.29 0.33
LGD (ours) + BERT [19] 0.44 0.38 0.41
LGD (ours) + CLIP [71] 0.48 0.42 0.45
Table 2: Language-driven grasp detection results.

Main Results. Table 2 shows the results of language-driven grasp detection on the Grasp-Anything++ dataset. The findings indicate a notable performance advantage of our LGD over other baseline approaches, with LGD outperforming the subsequent best-performing baselines (CLIP-Fusion) by margins of 0.140.140.140.14 on Grasp-Anything++ dataset.

Baseline Seen Unseen H
LGD w/o predicting 𝐱~0subscript~𝐱0\tilde{\mathbf{x}}_{0}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 0.15 0.08 0.10
LGD w/o contrastive loss 0.45 0.40 0.42
LGD w contrastive loss 0.48 0.42 0.45
Table 3: Contrastive loss analysis.
Refer to caption
Figure 6: Loss visualization.
Refer to caption
Figure 7: t-SNE visualization. We apply t-SNE to cluster the vision-and-language features with and without contrastive loss.

Contrastive Loss Analysis. Table 3 presents the performance of LGD under varied configurations. The outcomes emphasize the substantial influence of the training objective (contrastive loss) and the importance of language instructions in enhancing LGD performance on both seen and unseen classes in the Grasp-Anything++ dataset.

Fig. 6 shows the contrastive loss contrastivesubscriptcontrastive\mathcal{L}_{\text{contrastive}}caligraphic_L start_POSTSUBSCRIPT contrastive end_POSTSUBSCRIPT approaching towards 00 during training, indicating the grasp pose estimation 𝐱~0subscript~𝐱0\tilde{\mathbf{x}}_{0}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT aligns with the ground truth 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, as anticipated by Proposition 1. The subsequent attention maps visualization in Fig. 8 shows the attention region is meaningful and improves the results when employing our proposed contrastive loss compared to its absence. Moreover, we employ t-SNE for vision-and-language embedding visualization, as in [58], by processing 2,00020002,0002 , 000 samples from the Grasp-Anything++ dataset through the ALBEF module. The outcomes reveal that our contrastive loss facilitates better object classification, as evidenced in Fig.7 by clearer segregation of pixel embeddings across various semantic classes, underscoring contrastive loss’s role in refining embeddings’ differentiation for improved class distinctions.

Qualitative Results. Fig. 5 presents qualitative results of the language-driven grasp detection task, suggesting that our LGD method generates more semantically plausible than other baselines. Despite satisfactory performance, LGD occasionally predicts incorrect results, with a detailed analysis of these cases available in our Appendix.

Refer to caption
Figure 8: Attention map visualization. We compare the attention map when utilizing our proposed contrastive loss and when not.
Baseline Single Cluttered
GR-ConvNet [43] + CLIP [71] 0.33 0.30
Det-Seg-Refine [2] + CLIP [71] 0.30 0.23
GG-CNN [59] + CLIP [71] 0.10 0.07
CLIPORT [78] 0.27 0.30
CLIP-Fusion [96] 0.40 0.40
LGD (ours) 0.43 0.42
Table 4: Robotic language-driven grasp detection results.

Robotic Validation. We provide quantitative results by integrating our language-driven grasp detection pipeline for a robotic grasping application with a KUKA LBR iiwa R820 robot. Using the RealSense D435i camera, the grasp pose inferred from approaches in Table 4 is transformed into the 6DoF grasp pose, similar to [43]. The optimization-based trajectory planner in [86, 6] is employed to execute the grasps. Experiments are conducted for two scenarios, i.e., the single object scenario and the cluttered scene scenario, of a set of 20202020 real-world daily objects. In each scenario, we run 30303030 experiments using baselines listed in Table 4 and a predefined grasping prompt corpus. The results exhibit that our LGD outperforms other baselines. Furthermore, although LGD is trained on our Grasp-Anything++ which is a solely synthesis dataset created by foundation models, it still shows reasonable results on real-world objects.

Grasp-Anything++ (ours) Jacquard [18] Cornell [39] VMRD [104] OCID-grasp [2]
Baseline Base New H Base New H Base New H Base New H Base New H
GR-ConvNet [43] 0.71 0.59 0.64 0.88 0.66 0.75 0.98 0.74 0.84 0.77 0.64 0.70 0.86 0.67 0.75
Det-Seg-Refine [2] 0.62 0.57 0.59 0.86 0.60 0.71 0.99 0.76 0.86 0.75 0.60 0.66 0.80 0.62 0.70
GG-CNN [59] 0.68 0.57 0.62 0.78 0.56 0.65 0.96 0.75 0.84 0.69 0.53 0.59 0.71 0.63 0.67
LGD (no text) (ours) 0.74 0.63 0.68 0.89 0.69 0.77 0.97 0.76 0.85 0.79 0.66 0.72 0.88 0.68 0.76
Table 5: Base-to-new zero-shot grasp detection results.

5.2 Zero-shot Grasp Detection

Our proposed Grasp-Anything++ is a large-scale dataset. Apart from the language-driven grasp detection task, we believe it can be used for other purposes. In this experiment, we seek to answer the question: Can Grasp-Anything++ be useful in the traditional grasp detection task without text? Consequently, we verify our Grasp-Anything++ and LGD (no text) with other existing datasets and grasping methods.

Setup. We setup an LGD (no text) version, and other state-of-the-art grasp detection methods GR-ConvNet [43], Det-Seg-Refine [2], GG-CNN [59]. We use five datasets: our Grasp-Anything++, Jacquard [18], Cornell [39], VMRD [104], and OCID-grasp [2] in this experiment.

Zero-shot Results. Table 5 summarizes the base-to-new grasp detection results on five datasets. Overall, the performance of LGD even without the language branch is better than other baselines across all datasets. Furthermore, this table also shows that our Grasp-Anything++ dataset is more challenging to train as the detection results are lower than related datasets using the same approaches due to the greater coverage of unseen objects in the testing phase.

Train Test Jacquard Cornell VMRD OCID-grasp Grasp-Anything++
Jacquard [18] 0.87 0.51 0.13 0.21 0.17
Cornell [39] 0.07 0.98 0.20 0.12 0.13
VMRD [104] 0.06 0.21 0.79 0.11 0.10
OCID-grasp [2] 0.09 0.12 0.20 0.74 0.11
Grasp-Anything++ (ours) 0.41 0.63 0.30 0.39 0.65
Table 6: Cross-dataset grasp detection results.

Cross-dataset Evaluation. To further verify the usefulness of our Grasp-Anything++ dataset, we conduct the cross-dataset validation in Table 6. We use the GR-ConvNet [43] to reuse its results on existing grasp datasets. GR-ConvNet is trained on a dataset (row) and evaluated on another dataset (column). For example, training on Jacquard and testing on Cornell yields an accuracy of 0.510.510.510.51. Notably, training with our dataset improves performance by approximately 1033%10percent3310-33\%10 - 33 % compared to other datasets.

In the wild grasp detection. Fig. 9 shows visualization results using LGD (no text) trained on our Grasp-Anything++ dataset on random internet images and other datasets images. We can see that the detected grasp poses are adequate in quality and quantity. This demonstrates that although our Grasp-Anything++ is fully created by foundation models without having any real images, models trained on our Grasp-Anything++ dataset still generalize well on real-world images.

Refer to caption
Figure 9: In the wild grasp detection. Top row images are from GraspNet [27], YCB-Video [94], NBMOD [10] datasets; bottom row shows internet images.

5.3 Discussion

Our experiments indicate that Grasp-Anything++ can serve as a foundation dataset for both language-driven and traditional grasp detection tasks. However, there are certain limitations. First, our dataset lacks depth images for directly being applied to robotic applications [64]. Second, we remark that the creation of our dataset is time-consuming and relies on access to the ChatGPT API. Fortunately, future research can reuse our provided assets (images, prompts, etc.) without starting from scratch. Furthermore, our experiments show that adding language to the grasp detection task (Table 2) poses a more challenging problem compared to standard grasp detection task (Table 5).

We see several interesting future research directions. First, future work could investigate the use of text or image-to-3D models [95] or image-to-depth [73] and reuse our dataset’s prompts and images to construct 3D language-driven grasp datasets. Additionally, beyond linguistic grasp instruction adherence, our dataset holds potential for varied applications, including scene understanding [97] and scene generation [3], hallucination analysis [36], and human-robot interaction [12].

6 Conclusion

We introduce Grasp-Anything++, a large-scale dataset with 1M images and 10M grasp prompts for language-driven grasp detection tasks. We propose LGD, a diffusion-based method to tackle the language-driven grasp detection task. Our diffusion model employs a contrastive training objective, which explicitly contributes to the denoising process. Empirically, we have shown that Grasp-Anything++ serves as a foundation grasp detection dataset. Finally, our LGD improves the performance of other baselines, and the real-world robotic experiments further validate the effectiveness of our dataset and approach.

References

  • [1] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  • [2] Stefan Ainetter and Friedrich Fraundorfer. End-to-end trainable deep neural network for robotic grasp detection and semantic segmentation from rgb. In ICRA, 2021.
  • [3] Dor Arad Hudson and Larry Zitnick. Compositional transformers for scene generation. NeurIPS, 2021.
  • [4] Umar Asif, Jianbin Tang, and Stefan Harrer. Graspnet: An efficient convolutional neural network for real-time grasp detection for low-powered devices. In IJCAI, 2018.
  • [5] Florian Beck, Minh Nhat Vu, Christian Hartl-Nesic, and Andreas Kugi. Singularity avoidance with application to online trajectory optimization for serial manipulators. arXiv preprint arXiv:2211.02516, 2022.
  • [6] Florian Beck, Minh Nhat Vu, Christian Hartl-Nesic, and Andreas Kugi. Singularity avoidance with application to online trajectory optimization for serial manipulators. IFAC-PapersOnLine, 56(2):284–291, 2023.
  • [7] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  • [8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS, 33, 2020.
  • [9] Guy Bukchin, Eli Schwartz, Kate Saenko, Ori Shahar, Rogerio Feris, Raja Giryes, and Leonid Karlinsky. Fine-grained angular contrastive learning with coarse labels. In CVPR, 2021.
  • [10] Boyuan Cao, Xinyu Zhou, Congmin Guo, Baohua Zhang, Yuchen Liu, and Qianqiu Tan. Nbmod: Find it and grasp it in noisy background. arXiv preprint arXiv:2306.10265, 2023.
  • [11] Joao Carvalho, An T Le, Mark Baierl, Dorothea Koert, and Jan Peters. Motion planning diffusion: Learning and planning of robot motions with diffusion models. arXiv preprint arXiv:2308.01557, 2023.
  • [12] Paola Cascante-Bonilla, Hui Wu, Letao Wang, Rogerio S Feris, and Vicente Ordonez. Simvqa: Exploring simulated environments for visual question answering. In CVPR, 2022.
  • [13] I-Ming Chen and Joel W Burdick. Finding antipodal point grasps on irregularly shaped objects. T-RA, 1993.
  • [14] Sijia Chen and Baochun Li. Language-guided diffusion model for visual grounding. arXiv preprint arXiv:2308.09599, 2023.
  • [15] Xiahan Chen, Weishen Wang, Yu Jiang, and Xiaohua Qian. A dual-transformation with contrastive learning framework for lymph node metastasis prediction in pancreatic cancer. Medical Image Analysis, 85:102753, 2023.
  • [16] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. TPAMI, 2023.
  • [17] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In CVPR, 2023.
  • [18] Amaury Depierre, Emmanuel Dellandréa, and Liming Chen. Jacquard: A large scale dataset for robotic grasp detection. In IROS, 2018.
  • [19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [20] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. NeurIPS, 34, 2021.
  • [21] Tuong Do, Huy Tran, Erman Tjiputra, Quang D Tran, and Anh Nguyen. Fine-grained visual classification using self assessment classifier. arXiv preprint arXiv:2205.10529, 2022.
  • [22] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
  • [23] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [24] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  • [25] Clemens Eppner, Arsalan Mousavian, and Dieter Fox. A billion ways to grasp: An evaluation of grasp sampling schemes on a dense, physics-based grasp data set. In ISRR, 2019.
  • [26] Clemens Eppner, Arsalan Mousavian, and Dieter Fox. Acronym: A large-scale grasp dataset based on simulation. In ICRA, 2021.
  • [27] Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet-1billion: A large-scale benchmark for general object grasping. In CVPR, 2020.
  • [28] Maximilian Gilles, Yuhao Chen, Tim Robin Winter, E Zhixuan Zeng, and Alexander Wong. Metagraspnet: A large-scale benchmark dataset for scene-aware ambidextrous bin picking via physics-based metaverse synthesis. In CASE, 2022.
  • [29] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  • [30] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTAT, 2010.
  • [31] Zeyu Han, Yuhan Wang, Luping Zhou, Peng Wang, Binyu Yan, Jiliu Zhou, Yan Wang, and Dinggang Shen. Contrastive diffusion model with auxiliary guidance for coarse-to-fine pet reconstruction. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 239–249. Springer, 2023.
  • [32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [33] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 2020.
  • [34] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022.
  • [35] Jiaheng Hua, Xiaodong Cui, Xianghua Li, Keke Tang, and Peican Zhu. Multimodal fake news detection through data augmentation-based contrastive learning. Applied Soft Computing, 136:110125, 2023.
  • [36] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
  • [37] Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. In ICML, 2022.
  • [38] Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, et al. Conceptfusion: Open-set multimodal 3d mapping. arXiv preprint arXiv:2302.07241, 2023.
  • [39] Yun Jiang, Stephen Moseson, and Ashutosh Saxena. Efficient grasping from rgbd images: Learning using a new rectangle representation. In ICRA, 2011.
  • [40] Zhenyu Jiang, Yifeng Zhu, Maxwell Svetlik, Kuan Fang, and Yuke Zhu. Synergies between affordance and geometry: 6-dof grasp detection via implicit representations. arXiv preprint arXiv:2104.01542, 2021.
  • [41] Ishay Kamon, Tamar Flash, and Shimon Edelman. Learning to grasp using visual information. In ICRA, 1996.
  • [42] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  • [43] Sulabh Kumra, Shirin Joshi, and Ferat Sahin. Antipodal robotic grasping using generative residual convolutional neural network. In IROS, 2020.
  • [44] Nhat Le, Tuong Do, Khoa Do, Hien Nguyen, Erman Tjiputra, Quang D Tran, and Anh Nguyen. Controllable group choreography using contrastive diffusion. TOG, 2023.
  • [45] Chaejeong Lee, Jayoung Kim, and Noseong Park. Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis. arXiv preprint arXiv:2304.12654, 2023.
  • [46] Jaewoong Lee, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Yunji Kim, Jin-Hwa Kim, Jung-Woo Ha, and Sung Ju Hwang. Text-conditioned sampling framework for text-to-image generation with masked generative models. arXiv preprint arXiv:2304.01515, 2023.
  • [47] Meng-Lun Lee, Sara Behdad, Xiao Liang, and Minghui Zheng. Task allocation and planning for product disassembly with human–robot collaboration. Robotics and Computer-Integrated Manufacturing, 76, 2022.
  • [48] Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning for detecting robotic grasps. IJRR, 2015.
  • [49] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. IJRR, 2018.
  • [50] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 2021.
  • [51] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023.
  • [52] Hongzhuo Liang, Xiaojian Ma, Shuang Li, Michael Görner, Song Tang, Bin Fang, Fuchun Sun, and Jianwei Zhang. Pointnetgpd: Detecting grasp configurations from point sets. In ICRA, 2019.
  • [53] Jirong Liu, Ruo Zhang, Hao-Shu Fang, Minghao Gou, Hongjie Fang, Chenxi Wang, Sheng Xu, Hengxu Yan, and Cewu Lu. Target-referenced reactive grasping for dynamic objects. In CVPR, 2023.
  • [54] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  • [55] Weiyu Liu, Tucker Hermans, Sonia Chernova, and Chris Paxton. Structdiffusion: Object-centric diffusion for semantic rearrangement of novel objects. arXiv preprint arXiv:2211.04604, 2022.
  • [56] Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang, Yuanyuan Yang, Jiaming Tian, Hao He, Antong Li, Mengshen He, Zhengliang Liu, et al. Summary of chatgpt-related research and perspective towards the future of large language models. Meta-Radiology, page 100017, 2023.
  • [57] Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu, Juan Aparicio Ojea, and Ken Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312, 2017.
  • [58] Jiaxu Miao, Zongxin Yang, Leilei Fan, and Yi Yang. Fedseg: Class-heterogeneous federated learning for semantic segmentation. In CVPR, 2023.
  • [59] Douglas Morrison, Peter Corke, and Jürgen Leitner. Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach. arXiv preprint arXiv:1804.05172, 2018.
  • [60] Douglas Morrison, Peter Corke, and Jürgen Leitner. Egad! an evolved grasping analysis dataset for diversity and reproducibility in robotic manipulation. RA-L, 2020.
  • [61] Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-dof graspnet: Variational grasp generation for object manipulation. In ICCV, 2019.
  • [62] Yao Mu, Shunyu Yao, Mingyu Ding, Ping Luo, and Chuang Gan. Ec2: Emergent communication for embodied control. In CVPR, 2023.
  • [63] Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021, 2023.
  • [64] Rhys Newbury, Morris Gu, Lachlan Chumbley, Arsalan Mousavian, Clemens Eppner, Jürgen Leitner, Jeannette Bohg, Antonio Morales, Tamim Asfour, Danica Kragic, et al. Deep learning approaches to grasp synthesis: A review. T-RO, 2023.
  • [65] Anh Nguyen, Dimitrios Kanoulas, Darwin G Caldwell, and Nikos G Tsagarakis. Preparatory object reorientation for task-oriented grasping. In IROS, 2016.
  • [66] Toan Nguyen, Minh Nhat Vu, An Vuong, Dzung Nguyen, Thieu Vo, Ngan Le, and Anh Nguyen. Open-vocabulary affordance detection in 3d point clouds. In IROS, 2023.
  • [67] OpenAI. Introducing ChatGPT. Software. Accessed: July 6th 2023.
  • [68] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In ICRA, 2016.
  • [69] Robert Platt. Grasp learning: Models, methods, and performance. Annual Review of Control, Robotics, and Autonomous Systems, 2023.
  • [70] Farhad Pourpanah, Moloud Abdar, Yuxuan Luo, Xinlei Zhou, Ran Wang, Chee Peng Lim, Xi-Zhao Wang, and QM Jonathan Wu. A review of generalized zero-shot learning methods. TPAMI, 2022.
  • [71] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
  • [72] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • [73] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In CVP4, 2021.
  • [74] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  • [75] Kallol Saha, Vishal Mandadi, Jayaram Reddy, Ajit Srikanth, Aditya Agarwal, Bipasha Sen, Arun Singh, and Madhava Krishna. Edmp: Ensemble-of-costs-guided diffusion for motion planning. arXiv preprint arXiv:2309.11414, 2023.
  • [76] Schuhmann et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
  • [77] Dhruv Shah, Błażej Osiński, brian ichter, and Sergey Levine. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Karen Liu, Dana Kulic, and Jeff Ichnowski, editors, Proceedings of The 6th Conference on Robot Learning, volume 205 of PMLR. PMLR, 2023.
  • [78] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In CoRL, 2022.
  • [79] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  • [80] Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In ICCV, 2023.
  • [81] Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping Luo, Saining Xie, and Zhicheng Yan. Going denser with open-vocabulary part segmentation. arXiv preprint arXiv:2305.11173, 2023.
  • [82] Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffusion model. In ICLR, 2022.
  • [83] Jonathan Tseng, Rodrigo Castellon, and Karen Liu. Edge: Editable dance generation from music. In CVPR, 2023.
  • [84] Julen Urain, Niklas Funk, Jan Peters, and Georgia Chalvatzaki. Se (3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion. In ICRA, 2023.
  • [85] Sai Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2023.
  • [86] Minh Nhat Vu, Florian Beck, Michael Schwegel, Christian Hartl-Nesic, Anh Nguyen, and Andreas Kugi. Machine learning-based framework for optimally solving the analytical inverse kinematics for redundant manipulators. Mechatronics, 2023.
  • [87] An Vuong, Minh Nhat Vu, Hieu Le, Baoru Huang, Binh Huynh, Thieu Vo, Andreas Kugi, and Anh Nguyen. Grasp-anything: Large-scale grasp dataset from foundation models. In ICRA, 2024.
  • [88] An Vuong, Minh Nhat Vu, Toan Tien Nguyen, Baoru Huang, Dzung Nguyen, Thieu Vo, and Anh Nguyen. Language-driven scene synthesis using multi-conditional diffusion model. In NeurIPS, 2023.
  • [89] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022.
  • [90] Sabine Wehnert, Viju Sudhi, Shipra Dureja, Libin Kutty, Saijal Shahania, and Ernesto W De Luca. Legal norm retrieval with variations of the bert model combined with tf-idf vectorization. In Proceedings of the eighteenth international conference on artificial intelligence and law, pages 285–294, 2021.
  • [91] Bowen Wen, Wenzhao Lian, Kostas Bekris, and Stefan Schaal. Catgrasp: Learning category-level task-relevant grasping in clutter from simulation. In ICRA, 2022.
  • [92] Julia Wolleb, Robin Sandkühler, Florentin Bieder, Philippe Valmaggia, and Philippe C Cattin. Diffusion models for implicit image segmentation ensembles. In MIDL, 2022.
  • [93] Feng Wu, Guoshuai Zhao, Xueming Qian, and Li-wei Lehman. A diffusion model with contrastive learning for icu false arrhythmia alarm reduction. In IJCAI, 2023.
  • [94] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.
  • [95] Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, and Zhangyang Wang. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360deg views. In CVPR, 2023.
  • [96] Kechun Xu, Shuqi Zhao, Zhongxiang Zhou, Zizhang Li, Huaijin Pi, Yifeng Zhu, Yue Wang, and Rong Xiong. A joint modeling of vision-language-action for target-oriented grasping in clutter. arXiv preprint arXiv:2302.12610, 2023.
  • [97] Yiteng Xu, Peishan Cong, Yichen Yao, Runnan Chen, Yuenan Hou, Xinge Zhu, Xuming He, Jingyi Yu, and Yuexin Ma. Human-centric scene understanding for 3d large-scale scenarios. In ICCV, 2023.
  • [98] Zhixuan Xu, Kechun Xu, Rong Xiong, and Yue Wang. Object-centric inference for language conditioned placement: A foundation model based approach. In ICARM, 2023.
  • [99] Xinchen Yan, Jasmined Hsu, Mohammad Khansari, Yunfei Bai, Arkanath Pathak, Abhinav Gupta, James Davidson, and Honglak Lee. Learning 6-dof grasping interaction via deep geometry-aware 3d representations. In ICRA, 2018.
  • [100] Jiange Yang, Wenhui Tan, Chuhao Jin, Bei Liu, Jianlong Fu, Ruihua Song, and Limin Wang. Pave the way to grasp anything: Transferring foundation models for universal pick-place robots. arXiv preprint arXiv:2306.05716, 2023.
  • [101] Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 2022.
  • [102] Serin Yang, Hyunmin Hwang, and Jong Chul Ye. Zero-shot contrastive loss for text-guided diffusion image style transfer. ICCV, 2023.
  • [103] Yang Yang, Xibai Lou, and Changhyun Choi. Interactive robotic grasping with attribute-guided disambiguation. In ICRA, 2022.
  • [104] Hanbo Zhang, Xuguang Lan, Site Bai, Xinwen Zhou, Zhiqiang Tian, and Nanning Zheng. Roi-based robotic grasp detection for object overlapping scenes. In IROS, 2019.
  • [105] Hanbo Zhang, Yunfan Lu, Cunjun Yu, David Hsu, Xuguang La, and Nanning Zheng. Invigorate: Interactive visual grounding and grasping in clutter. arXiv preprint arXiv:2108.11092, 2021.
  • [106] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. In ICLR, 2019.
  • [107] Kaizhi Zheng, Xiaotong Chen, Odest Chadwicke Jenkins, and Xin Wang. Vlmbench: A compositional benchmark for vision-and-language manipulation. NeurIPS, 35, 2022.
  • [108] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In CVPR, 2022.
  • [109] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In CVPR, 2022.

Appendix A Theoretical Findings

In this section, we first show the derivation of Equation 4 in our main paper. We then show the proof of Proposition 1 in the main paper.

A.1 Derivation of Equation 4

It was indicated in [20] that q(𝐱t|𝐱t1)=q^(𝐱t|𝐱t1,y)𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1^𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝑦q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\hat{q}(\mathbf{x}_{t}|\mathbf{x}_{t-1},y)italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y ), therefore, the loss in Equation 3 in our main paper can be written as

=𝔼[logpθ(𝐱T)t1logpθ(𝐱t1|𝐱t)q^(𝐱t|𝐱t1,y)] .𝔼delimited-[]subscript𝑝𝜃subscript𝐱𝑇subscript𝑡1subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡^𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝑦 .\mathcal{L}=\mathbb{E}\left[-\log p_{\theta}(\mathbf{x}_{T})-\sum_{t\geq 1}% \log\frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{\hat{q}(\mathbf{x}_{t}|% \mathbf{x}_{t-1},y)}\right]\text{~{}\@.}caligraphic_L = blackboard_E [ - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_t ≥ 1 end_POSTSUBSCRIPT roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y ) end_ARG ] . (6)

Using Bayes’ Theorem, we can further derive the term q^(𝐱t|𝐱t1,y)^𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝑦\hat{q}(\mathbf{x}_{t}|\mathbf{x}_{t-1},y)over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y ) of Equation 6 as follows

q^^𝑞\displaystyle\hat{q}over^ start_ARG italic_q end_ARG (𝐱t|𝐱t1,y)=q^(𝐱t,𝐱t1,y)q^(𝐱t1,y)conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝑦^𝑞subscript𝐱𝑡subscript𝐱𝑡1𝑦^𝑞subscript𝐱𝑡1𝑦\displaystyle(\mathbf{x}_{t}|\mathbf{x}_{t-1},y)=\frac{\hat{q}(\mathbf{x}_{t},% \mathbf{x}_{t-1},y)}{\hat{q}(\mathbf{x}_{t-1},y)}( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y ) = divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y ) end_ARG
=q^(𝐱t,𝐱t1,y)q^(𝐱t,𝐱t1,𝐱0,y)q^(𝐱t,𝐱t1,𝐱0,y)q^(𝐱t1,y)absent^𝑞subscript𝐱𝑡subscript𝐱𝑡1𝑦^𝑞subscript𝐱𝑡subscript𝐱𝑡1subscript𝐱0𝑦^𝑞subscript𝐱𝑡subscript𝐱𝑡1subscript𝐱0𝑦^𝑞subscript𝐱𝑡1𝑦\displaystyle=\frac{\hat{q}(\mathbf{x}_{t},\mathbf{x}_{t-1},y)}{\hat{q}(% \mathbf{x}_{t},\mathbf{x}_{t-1},\mathbf{x}_{0},y)}\frac{\hat{q}(\mathbf{x}_{t}% ,\mathbf{x}_{t-1},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{x}_{t-1},y)}= divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y ) end_ARG
=1q^(𝐱0|𝐱t1,𝐱t,y)q^(𝐱t,𝐱t1,𝐱0,y)q^(𝐱t,𝐱0,y)q^(𝐱t,𝐱0,y)q^(𝐱t1,y)absent1^𝑞conditionalsubscript𝐱0subscript𝐱𝑡1subscript𝐱𝑡𝑦^𝑞subscript𝐱𝑡subscript𝐱𝑡1subscript𝐱0𝑦^𝑞subscript𝐱𝑡subscript𝐱0𝑦^𝑞subscript𝐱𝑡subscript𝐱0𝑦^𝑞subscript𝐱𝑡1𝑦\displaystyle=\frac{1}{\hat{q}(\mathbf{x}_{0}|\mathbf{x}_{t-1},\mathbf{x}_{t},% y)}\frac{\hat{q}(\mathbf{x}_{t},\mathbf{x}_{t-1},\mathbf{x}_{0},y)}{\hat{q}(% \mathbf{x}_{t},\mathbf{x}_{0},y)}\frac{\hat{q}(\mathbf{x}_{t},\mathbf{x}_{0},y% )}{\hat{q}(\mathbf{x}_{t-1},y)}= divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y ) end_ARG divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y ) end_ARG
=q^(𝐱t1|𝐱t,𝐱0,y)q^(𝐱0|𝐱t1,𝐱t,y)q^(𝐱t,𝐱0,y)q^(𝐱t1,y)absent^𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0𝑦^𝑞conditionalsubscript𝐱0subscript𝐱𝑡1subscript𝐱𝑡𝑦^𝑞subscript𝐱𝑡subscript𝐱0𝑦^𝑞subscript𝐱𝑡1𝑦\displaystyle=\frac{\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0},y)}% {\hat{q}(\mathbf{x}_{0}|\mathbf{x}_{t-1},\mathbf{x}_{t},y)}\frac{\hat{q}(% \mathbf{x}_{t},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{x}_{t-1},y)}= divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y ) end_ARG divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y ) end_ARG
=q^(𝐱t1|𝐱t,𝐱0,y)q^(𝐱0|𝐱t1,𝐱t,y)q^(𝐱t,𝐱0,y)q^(𝐱t1,𝐱0,y)q^(𝐱t1,𝐱0,y)q^(𝐱t1,y)absent^𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0𝑦^𝑞conditionalsubscript𝐱0subscript𝐱𝑡1subscript𝐱𝑡𝑦^𝑞subscript𝐱𝑡subscript𝐱0𝑦^𝑞subscript𝐱𝑡1subscript𝐱0𝑦^𝑞subscript𝐱𝑡1subscript𝐱0𝑦^𝑞subscript𝐱𝑡1𝑦\displaystyle=\frac{\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0},y)}% {\hat{q}(\mathbf{x}_{0}|\mathbf{x}_{t-1},\mathbf{x}_{t},y)}\frac{\hat{q}(% \mathbf{x}_{t},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{x}_{t-1},\mathbf{x}_{0},y)}% \frac{\hat{q}(\mathbf{x}_{t-1},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{x}_{t-1},y)}= divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y ) end_ARG divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y ) end_ARG
=q^(𝐱t1|𝐱t,𝐱0,y)q^(𝐱0|𝐱t1,𝐱t,y)q^(𝐱t|𝐱0,y)q^(𝐱t1|𝐱0,y)q^(𝐱0|𝐱t1,y) .absent^𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0𝑦^𝑞conditionalsubscript𝐱0subscript𝐱𝑡1subscript𝐱𝑡𝑦^𝑞conditionalsubscript𝐱𝑡subscript𝐱0𝑦^𝑞conditionalsubscript𝐱𝑡1subscript𝐱0𝑦^𝑞conditionalsubscript𝐱0subscript𝐱𝑡1𝑦 .\displaystyle=\frac{\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0},y)}% {\hat{q}(\mathbf{x}_{0}|\mathbf{x}_{t-1},\mathbf{x}_{t},y)}\frac{\hat{q}(% \mathbf{x}_{t}|\mathbf{x}_{0},y)}{\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}_{0},y)}% \hat{q}(\mathbf{x}_{0}|\mathbf{x}_{t-1},y)\text{~{}\@.}= divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y ) end_ARG divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y ) .

Follow by Ho et al. [33], we can assume that q^(𝐱0|𝐱t1,𝐱t,y)=q^(𝐱0|𝐱t1,y)^𝑞conditionalsubscript𝐱0subscript𝐱𝑡1subscript𝐱𝑡𝑦^𝑞conditionalsubscript𝐱0subscript𝐱𝑡1𝑦\hat{q}(\mathbf{x}_{0}|\mathbf{x}_{t-1},\mathbf{x}_{t},y)=\hat{q}(\mathbf{x}_{% 0}|\mathbf{x}_{t-1},y)over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y ) = over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y ) due to the Markov chain. Thus, q^(𝐱t|𝐱t1,y)^𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝑦\hat{q}(\mathbf{x}_{t}|\mathbf{x}_{t-1},y)over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y ) can be further derived as follows

q^(𝐱t|𝐱t1,y)=q^(𝐱t1|𝐱t,𝐱0,y)q^(𝐱t|𝐱0,y)q^(𝐱t1|𝐱0,y) .^𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝑦^𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0𝑦^𝑞conditionalsubscript𝐱𝑡subscript𝐱0𝑦^𝑞conditionalsubscript𝐱𝑡1subscript𝐱0𝑦 .\hat{q}(\mathbf{x}_{t}|\mathbf{x}_{t-1},y)=\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}% _{t},\mathbf{x}_{0},y)\frac{\hat{q}(\mathbf{x}_{t}|\mathbf{x}_{0},y)}{\hat{q}(% \mathbf{x}_{t-1}|\mathbf{x}_{0},y)}\text{~{}\@.}over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y ) = over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG . (7)

From Equation 6 and Equation 7, we can express the negative log likelihood loss as follows

L=𝔼[logpθ(𝐱T)q^(𝐱T|𝐱0,y)t>1logpθ(𝐱t1|𝐱t)q^(𝐱t1|𝐱t,𝐱0,y)logpθ(𝐱0|𝐱1,y)] .𝐿𝔼delimited-[]subscript𝑝𝜃subscript𝐱𝑇^𝑞conditionalsubscript𝐱𝑇subscript𝐱0𝑦subscript𝑡1subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡^𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0𝑦subscript𝑝𝜃|subscript𝐱0subscript𝐱1𝑦 .\begin{gathered}L=\mathbb{E}\bigg{[}-\log\frac{p_{\theta}(\mathbf{x}_{T})}{% \hat{q}(\mathbf{x}_{T}|\mathbf{x}_{0},y)}-\sum_{t>1}\log\frac{p_{\theta}(% \mathbf{x}_{t-1}|\mathbf{x}_{t})}{\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}_{t},% \mathbf{x}_{0},y)}\\ -\log p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1},y)\bigg{]}\text{~{}\@.}\end{gathered}start_ROW start_CELL italic_L = blackboard_E [ - roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG - ∑ start_POSTSUBSCRIPT italic_t > 1 end_POSTSUBSCRIPT roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG end_CELL end_ROW start_ROW start_CELL - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y ) ] . end_CELL end_ROW (8)

By using Bayes’ Theorem again, we can formulate q^(𝐱T|𝐱0,y)^𝑞conditionalsubscript𝐱𝑇subscript𝐱0𝑦\hat{q}(\mathbf{x}_{T}|\mathbf{x}_{0},y)over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) of Equation 8 as follows

q^(𝐱T|𝐱0,y)^𝑞conditionalsubscript𝐱𝑇subscript𝐱0𝑦\displaystyle\hat{q}(\mathbf{x}_{T}|\mathbf{x}_{0},y)over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) =q^(𝐱T,𝐱0,y)q^(𝐱0,y)absent^𝑞subscript𝐱𝑇subscript𝐱0𝑦^𝑞subscript𝐱0𝑦\displaystyle=\frac{\hat{q}(\mathbf{x}_{T},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{% x}_{0},y)}= divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG
=q^(𝐱T,𝐱0,y)q^(𝐱0)q^(𝐱0)q^(𝐱0,y)absent^𝑞subscript𝐱𝑇subscript𝐱0𝑦^𝑞subscript𝐱0^𝑞subscript𝐱0^𝑞subscript𝐱0𝑦\displaystyle=\frac{\hat{q}(\mathbf{x}_{T},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{% x}_{0})}\frac{\hat{q}(\mathbf{x}_{0})}{\hat{q}(\mathbf{x}_{0},y)}= divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG
=q^(𝐱0|𝐱T,y)q^(y|𝐱0) .absent^𝑞conditionalsubscript𝐱0subscript𝐱𝑇𝑦^𝑞conditional𝑦subscript𝐱0 .\displaystyle=\frac{\hat{q}(\mathbf{x}_{0}|\mathbf{x}_{T},y)}{\hat{q}(y|% \mathbf{x}_{0})}\text{~{}\@.}= divide start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( italic_y | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG . (9)

Since q^(y|𝐱0)^𝑞conditional𝑦subscript𝐱0\hat{q}(y|\mathbf{x}_{0})over^ start_ARG italic_q end_ARG ( italic_y | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is known labels per sample [20], thus, can be treated as a constant C𝐶Citalic_C. We conclude with the final derivation of Equation 8 by

L𝐿\displaystyle Litalic_L =𝔼[logpθ(𝐱T)q^(𝐱0|𝐱T,y)+logq^(y|𝐱0)\displaystyle=\mathbb{E}\bigg{[}-\log\frac{p_{\theta}(\mathbf{x}_{T})}{\hat{q}% (\mathbf{x}_{0}|\mathbf{x}_{T},y)}+\log\hat{q}(y|\mathbf{x}_{0})= blackboard_E [ - roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y ) end_ARG + roman_log over^ start_ARG italic_q end_ARG ( italic_y | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
t>1logpθ(𝐱t1|𝐱t)q^(𝐱t1|𝐱t,𝐱0,y)logpθ(𝐱0|𝐱1,y)]\displaystyle\qquad-\sum_{t>1}\log\frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}% _{t})}{\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0},y)}-\log p_{% \theta}(\mathbf{x}_{0}|\mathbf{x}_{1},y)\bigg{]}- ∑ start_POSTSUBSCRIPT italic_t > 1 end_POSTSUBSCRIPT roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) end_ARG - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y ) ]
=𝔼[Clogpθ(𝐱T)+logq^(𝐱0|𝐱T,y)+\displaystyle=\mathbb{E}\bigg{[}C-\log p_{\theta}(\mathbf{x}_{T})+\log\hat{q}(% \mathbf{x}_{0}|\mathbf{x}_{T},y)+= blackboard_E [ italic_C - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) + roman_log over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y ) +
t>1DKL(q^(𝐱t1|𝐱t,𝐱0,y)pθ(𝐱t1|𝐱t))\displaystyle\qquad\sum_{t>1}D_{\text{KL}}(\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}% _{t},\mathbf{x}_{0},y)\|p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}))∑ start_POSTSUBSCRIPT italic_t > 1 end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) ∥ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
logpθ(𝐱0|𝐱1,y)] .\displaystyle\qquad\qquad-\log p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1},y)% \bigg{]}\text{~{}\@.}- roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y ) ] . (10)

A.2 Proof of Proposition 1

Proof.

The correlation between 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is given by [33]

𝐱T=α¯T𝐱0+1α¯Tϵ,ϵ𝒩(0,𝐈) .formulae-sequencesubscript𝐱𝑇subscript¯𝛼𝑇subscript𝐱01subscript¯𝛼𝑇italic-ϵsimilar-toitalic-ϵ𝒩0𝐈 .\displaystyle\mathbf{x}_{T}=\sqrt{\overline{\alpha}_{T}}\mathbf{x}_{0}+\sqrt{1% -\overline{\alpha}_{T}}\mathbf{\epsilon},\mathbf{\epsilon}\sim\mathcal{N}(0,% \mathbf{I})\text{~{}\@.}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG italic_ϵ , italic_ϵ ∼ caligraphic_N ( 0 , bold_I ) . (11)

It follows from Equation 11 that

αT¯𝐱~0𝐱T1αT¯¯subscript𝛼𝑇subscript~𝐱0subscript𝐱𝑇1¯subscript𝛼𝑇\displaystyle\frac{\sqrt{\overline{\alpha_{T}}}\tilde{\mathbf{x}}_{0}-\mathbf{% x}_{T}}{\sqrt{1-\overline{\alpha_{T}}}}divide start_ARG square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG end_ARG =αT¯𝐱~0(α¯T𝐱0+1α¯Tϵ)1αT¯absent¯subscript𝛼𝑇subscript~𝐱0subscript¯𝛼𝑇subscript𝐱01subscript¯𝛼𝑇italic-ϵ1¯subscript𝛼𝑇\displaystyle=\frac{\sqrt{\overline{\alpha_{T}}}\tilde{\mathbf{x}}_{0}-\left(% \sqrt{\overline{\alpha}_{T}}\mathbf{x}_{0}+\sqrt{1-\overline{\alpha}_{T}}% \mathbf{\epsilon}\right)}{\sqrt{1-\overline{\alpha_{T}}}}= divide start_ARG square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG italic_ϵ ) end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG end_ARG
=β(𝐱~0𝐱0)ϵ ,absent𝛽subscript~𝐱0subscript𝐱0italic-ϵ ,\displaystyle=\beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0})-\epsilon\text{~{},}= italic_β ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_ϵ ,

with β=α¯T1α¯T𝛽subscript¯𝛼𝑇1subscript¯𝛼𝑇\beta=\sqrt{\frac{\overline{\alpha}_{T}}{1-\overline{\alpha}_{T}}}italic_β = square-root start_ARG divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG. Thus, under the condition αT¯𝐱~0𝐱T1αT¯22Msubscriptsuperscriptnorm¯subscript𝛼𝑇subscript~𝐱0subscript𝐱𝑇1¯subscript𝛼𝑇22𝑀\|\frac{\sqrt{\overline{\alpha_{T}}}\tilde{\mathbf{x}}_{0}-\mathbf{x}_{T}}{% \sqrt{1-\overline{\alpha_{T}}}}\|^{2}_{2}\geq M∥ divide start_ARG square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_M, we have

𝔼[contrastive]𝔼delimited-[]subscriptcontrastive\displaystyle\mathbb{E}[\mathcal{L}_{\text{contrastive}}]blackboard_E [ caligraphic_L start_POSTSUBSCRIPT contrastive end_POSTSUBSCRIPT ] =𝔼[αT¯𝐱~0𝐱T1αT¯22M]absent𝔼delimited-[]superscriptsubscriptnorm¯subscript𝛼𝑇subscript~𝐱0subscript𝐱𝑇1¯subscript𝛼𝑇22𝑀\displaystyle=\mathbb{E}\left[\left\|\frac{\sqrt{\overline{\alpha_{T}}}\tilde{% \mathbf{x}}_{0}-\mathbf{x}_{T}}{\sqrt{1-\overline{\alpha_{T}}}}\right\|_{2}^{2% }-M\right]= blackboard_E [ ∥ divide start_ARG square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_M ]
=𝔼[β(𝐱~0𝐱0)ϵ22ϵ22]absent𝔼delimited-[]superscriptsubscriptnorm𝛽subscript~𝐱0subscript𝐱0italic-ϵ22superscriptsubscriptnormitalic-ϵ22\displaystyle=\mathbb{E}\left[\|\beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0})-% \epsilon\|_{2}^{2}-\|\epsilon\|_{2}^{2}\right]= blackboard_E [ ∥ italic_β ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=𝔼[β(𝐱~0𝐱0)222β(𝐱~0𝐱0),ϵ]absent𝔼delimited-[]superscriptsubscriptnorm𝛽subscript~𝐱0subscript𝐱0222𝛽subscript~𝐱0subscript𝐱0italic-ϵ\displaystyle=\mathbb{E}\left[\|\beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0})\|% _{2}^{2}-2\left<\beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}),\epsilon\right>\right]= blackboard_E [ ∥ italic_β ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⟨ italic_β ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_ϵ ⟩ ]
=𝔼[β(𝐱~0𝐱0)22]2β𝔼[(𝐱~0𝐱0),ϵ] .absent𝔼delimited-[]superscriptsubscriptnorm𝛽subscript~𝐱0subscript𝐱0222𝛽𝔼delimited-[]subscript~𝐱0subscript𝐱0italic-ϵ .\displaystyle=\mathbb{E}\left[\|\beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0})\|% _{2}^{2}\right]-2\beta\mathbb{E}\left[\left<(\tilde{\mathbf{x}}_{0}-\mathbf{x}% _{0}),\epsilon\right>\right]\text{~{}\@.}= blackboard_E [ ∥ italic_β ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - 2 italic_β blackboard_E [ ⟨ ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_ϵ ⟩ ] .

However, since 𝐱~0𝐱0subscript~𝐱0subscript𝐱0\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ϵitalic-ϵ\epsilonitalic_ϵ are independent and 𝔼[ϵ]=0𝔼delimited-[]italic-ϵ0\mathbb{E}[\epsilon]=0blackboard_E [ italic_ϵ ] = 0, we have

𝔼[(𝐱~0𝐱0),ϵ]=𝔼[𝐱~0𝐱0],𝔼[ϵ]=0 .𝔼delimited-[]subscript~𝐱0subscript𝐱0italic-ϵ𝔼delimited-[]subscript~𝐱0subscript𝐱0𝔼delimited-[]italic-ϵ0 .\mathbb{E}\left[\left<(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}),\epsilon\right>% \right]=\left<\mathbb{E}\left[\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}\right],% \mathbb{E}\left[\epsilon\right]\right>=0\text{~{}\@.}blackboard_E [ ⟨ ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_ϵ ⟩ ] = ⟨ blackboard_E [ over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] , blackboard_E [ italic_ϵ ] ⟩ = 0 .

Thus,

𝔼[contrastive]=𝔼[β(𝐱~0𝐱0)22]=β2𝔼[𝐱~0𝐱022] .𝔼delimited-[]subscriptcontrastive𝔼delimited-[]superscriptsubscriptnorm𝛽subscript~𝐱0subscript𝐱022superscript𝛽2𝔼delimited-[]superscriptsubscriptnormsubscript~𝐱0subscript𝐱022 .\displaystyle\mathbb{E}[\mathcal{L}_{\text{contrastive}}]=\mathbb{E}\left[\|% \beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0})\|_{2}^{2}\right]=\beta^{2}\mathbb% {E}\left[\|\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}\|_{2}^{2}\right]\text{~{}\@.}blackboard_E [ caligraphic_L start_POSTSUBSCRIPT contrastive end_POSTSUBSCRIPT ] = blackboard_E [ ∥ italic_β ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

Hence, with C=β2𝐶superscript𝛽2C=\beta^{-2}italic_C = italic_β start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, we have 𝔼[𝐱~0𝐱022]Cδ𝔼delimited-[]superscriptsubscriptnormsubscript~𝐱0subscript𝐱022𝐶𝛿\mathbb{E}\left[\|\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}\|_{2}^{2}\right]\leq C\deltablackboard_E [ ∥ over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_C italic_δ as desired. ∎

Appendix B Remark on Related Works

Grasp Datasets. Numerous grasp datasets have been introduced recently [64], each with varying characteristics such as data representation (RGB-D or 3D point clouds), grasp labels (rectangle-based or 6-DoF), and quantity [70]. Our Grasp-Anything++ dataset differs primarily in its universality, contrasting the limited object selection in existing benchmarks. It covers a wide range of everyday objects and includes natural scene descriptions, facilitating research in language-driven grasp detection. Furthermore, the Grasp-Anything++ dataset uniquely presents natural object arrangements, in contrast to the more strictly controlled configurations in previous datasets [69]. Grasp-Anything++ outperforms other benchmarks in both the number of objects and the number of samples.

Contrastive Loss for Diffusion Models. Recent advancements in contrastive learning have become a prominent attraction in diffusion model research, as evidenced in [31]. While most studies in the diffusion literature regard contrastive learning primarily as a method of data augmentation [35] for improving the performance of models on fine-grained prediction [15], several notable works, including [93, 31, 45, 102]. For instance, Yang et al. [102] leverage intermediate layer features for calculating contrastive loss in negative sample pairs. Contrary to these perspectives, our paper considers contrastive learning as an integral aspect of the training objective, explicitly contributing to the denoising process of diffusion models.

Grasp Detection. Deep learning has significantly advanced grasp detection, with initial efforts by Lenz et al.[48] employing deep learning for grasp pose detection. Following this, deep learning-based approaches[99, 52, 40, 91, 2, 43, 10] have become the predominant methodology in the field. Despite extensive research, the real-world application of deep learning for robotic grasping remains a challenge, primarily due to the limited size and diversity of grasp datasets [28, 69].

Refer to caption
Figure 10: Rectangle representation of grasp poses. The groundtruth rectangle is defined with 5555 parameters {x,y,w,h,θ}𝑥𝑦𝑤𝜃\{x,y,w,h,\theta\}{ italic_x , italic_y , italic_w , italic_h , italic_θ } which (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) is the center point of the rectangle, (w,h)𝑤(w,h)( italic_w , italic_h ) is the width and height of the rectangle, and θ𝜃\thetaitalic_θ is the rotational angle of the rectangle with respect to the image plane.

Grasp Ground Truth Definition. 6-DoF and rectangle representations are the most prevalent in grasp detection literature. While 6-DoF poses offer greater flexibility and adaptability [27] for complex tasks, rectangle grasp poses are advantageous for their simplicity [39], efficiency in specific scenarios, and lower hardware and computational requirements [18]. Considering that objects in synthesized images from foundation models often lack 3D information [41], the rectangle representation of grasp poses appears more suitable for our task. The ground truth rectangle is defined in Fig. 10.

Appendix C Grasp-Anything++ Analysis

Additional Visualization. Fig. 11 provides further examples from the Grasp-Anything++ dataset, illustrating its diverse and extensive representation of everyday objects. The additional samples of Grasp-Anything++ showcase a diverse collection of objects typically found in everyday environments, such as home offices, kitchens, and living spaces. The collection includes a variety of shapes, sizes, and types of objects such as writing instruments, electronic devices, and household items, each situated within its own designated space. Furthermore, the annotated grasp poses, produced by the Grasp-Anything++’s pipeline, demonstrate high fidelity, thereby offering a foundation for both qualitative and quantitative grasp detection research.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 11: More samples of our Grasp-Anything++ dataset.

Appendix D LGD Implementation Details

In this section, we first discuss the observation about the attention mask during the diffusion process which led to our motivation for using the contrastive diffusion. We then provide the implementation details of our LGD network.

D.1 Observation

In Fig. 12, we present a visualization of the attention mask generated by the vision transformer backbone and the grasp pose throughout the diffusion process. It is evident that during the initial time steps, there is a substantial overlap between the guiding region and the grasp pose, which diminishes as time progresses. This finding suggests that, by the end of the forward process, the guiding region and the noisy grasp pose can be regarded as a contrasting pair. We mathematically express this contrastive relationship between the guiding region 𝐱~0subscript~𝐱0\tilde{\mathbf{x}}_{0}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This contrastive relationship forms a central role in our network design, as depicted in the overview of our method in the main paper.

Refer to caption
Figure 12: Observation. Comparison between the grasp poses and attention map output by the network backbone.

D.2 LGD Implementation Details

For an image I of resolution W×H𝑊𝐻W\times Hitalic_W × italic_H, we employ a ResNet-50 vision encoder backbone [32] to derive feature representations Iw×hsuperscriptIsuperscript𝑤\textit{{I}}^{\prime}\in\mathbb{R}^{w\times h}I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_w × italic_h end_POSTSUPERSCRIPT with latent dimensions w𝑤witalic_w and hhitalic_h. Similarly, we obtain text embedding e|D|superscript𝑒superscript𝐷e^{\prime}\in\mathbb{R}^{|D|}italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | italic_D | end_POSTSUPERSCRIPT using a text encoder, such as CLIP [71] or BERT [19]. The dimensionality of these latent features depends on the text encoder’s architecture.

Component Description Input size Output size
(i) A vision encoder (ResNet-50 [32]) [H,W,3]𝐻𝑊3[H,W,3][ italic_H , italic_W , 3 ] [h,w,3]𝑤3[h,w,3][ italic_h , italic_w , 3 ]
(ii-a) A text encoder (CLIP [71] or BERT [19]) Any [|D|]delimited-[]𝐷[|D|][ | italic_D | ]
(ii-b) MLP layers [|D|]delimited-[]𝐷[|D|][ | italic_D | ] [N+1,dtext]𝑁1subscript𝑑text[N+1,d_{\text{text}}][ italic_N + 1 , italic_d start_POSTSUBSCRIPT text end_POSTSUBSCRIPT ]
(iii) ALBEF [50] [h,w,3],[N+1,dtext]𝑤3𝑁1subscript𝑑text[h,w,3],[N+1,d_{\text{text}}][ italic_h , italic_w , 3 ] , [ italic_N + 1 , italic_d start_POSTSUBSCRIPT text end_POSTSUBSCRIPT ] [W,H],[]𝑊𝐻delimited-[][W,H],[\ell][ italic_W , italic_H ] , [ roman_ℓ ]
(iv) MLP layers of Equation 12 [1]delimited-[]1[1][ 1 ] [dts]delimited-[]subscript𝑑ts[d_{\text{ts}}][ italic_d start_POSTSUBSCRIPT ts end_POSTSUBSCRIPT ]
(v) MLP layers of Equation 13 []delimited-[][\ell][ roman_ℓ ] [dvl]delimited-[]subscript𝑑vl[d_{\text{vl}}][ italic_d start_POSTSUBSCRIPT vl end_POSTSUBSCRIPT ]
(vi) MLP layers of Equation 14 [M]delimited-[]𝑀[M][ italic_M ] [ddf]delimited-[]subscript𝑑df[d_{\text{df}}][ italic_d start_POSTSUBSCRIPT df end_POSTSUBSCRIPT ]
(vii) MLP layers to output denoising state [dts+dvl+ddf]delimited-[]subscript𝑑tssubscript𝑑vlsubscript𝑑df[d_{\text{ts}}+d_{\text{vl}}+d_{\text{df}}][ italic_d start_POSTSUBSCRIPT ts end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT vl end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT df end_POSTSUBSCRIPT ] [M]delimited-[]𝑀[M][ italic_M ]

Table 7: Architecture specifications of our method.
Hyperparameter Value
W𝑊Witalic_W 224
H𝐻Hitalic_H 224
N𝑁Nitalic_N 196
M𝑀Mitalic_M (number of grasp parameters) 5
|DCLIP|subscript𝐷CLIP|D_{\text{CLIP}}|| italic_D start_POSTSUBSCRIPT CLIP end_POSTSUBSCRIPT | of (ii-a) 512
|DBERT|subscript𝐷BERT|D_{\text{BERT}}|| italic_D start_POSTSUBSCRIPT BERT end_POSTSUBSCRIPT | of (ii-a) 768
dtextsubscript𝑑textd_{\text{text}}italic_d start_POSTSUBSCRIPT text end_POSTSUBSCRIPT of (ii-b) 128
\ellroman_ℓ of (iii) 1024
dtssubscript𝑑tsd_{\text{ts}}italic_d start_POSTSUBSCRIPT ts end_POSTSUBSCRIPT of (iv) 32
dvlsubscript𝑑vld_{\text{vl}}italic_d start_POSTSUBSCRIPT vl end_POSTSUBSCRIPT of (v) 256
ddfsubscript𝑑dfd_{\text{df}}italic_d start_POSTSUBSCRIPT df end_POSTSUBSCRIPT of (vi) 256
Num. attention layers (ALBEF) 6
Table 8: Hyperparameter details.

Utilizing the ALBEF architecture [50], we integrate vision and language embeddings. Specifically, we encode each intermediate feature IsuperscriptI\textit{{I}}^{\prime}I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT into a set v={vcls,v1,v2,,vN}vsubscriptvclssubscriptv1subscriptv2subscriptv𝑁\textit{{v}}=\{\textit{{v}}_{\text{cls}},\textit{{v}}_{1},\textit{{v}}_{2},% \dots,\textit{{v}}_{N}\}v = { v start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT , v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, where N𝑁Nitalic_N denotes the number of segmented patches, similar to the approach in [23]. We feed text embeddings esuperscript𝑒e^{\prime}italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT through MLP layers, producing a sequence of embeddings u={ucls,u1,,uN}usubscriptuclssubscriptu1subscriptu𝑁\textit{{u}}=\{\textit{{u}}_{\text{cls}},\textit{{u}}_{1},\ldots,\textit{{u}}_% {N}\}u = { u start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT , u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , u start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }. Cross-attention mechanisms in the multimodal encoder integrate image features v with text features u. The multimodal attention layer outputs an attention map 𝐱~0W×Hsubscript~𝐱0superscript𝑊𝐻\tilde{\mathbf{x}}_{0}\in\mathbb{R}^{W\times H}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_W × italic_H end_POSTSUPERSCRIPT and a final text-image representation zvlsuperscriptsubscript𝑧vlsuperscriptz_{\text{vl}}^{\ast}\in\mathbb{R}^{\ell}italic_z start_POSTSUBSCRIPT vl end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT.

Using a sequence of MLP layers, we integrate features from timestep t+1𝑡1t+1italic_t + 1, the current state 𝐱t+1subscript𝐱𝑡1\mathbf{x}_{t+1}bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, and the final text-image representation z𝑧superscriptz\in\mathbb{R}^{\ell}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT as follows

zts=MLP(t+1)dts .subscript𝑧tsMLP𝑡1superscriptsubscript𝑑ts .z_{\text{ts}}=\text{MLP}(t+1)\in\mathbb{R}^{d_{\text{ts}}}\text{~{}\@.}italic_z start_POSTSUBSCRIPT ts end_POSTSUBSCRIPT = MLP ( italic_t + 1 ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT ts end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (12)
zvl=MLP(zvl)dvl .subscript𝑧vlMLPsuperscriptsubscript𝑧vlsuperscriptsubscript𝑑vl .z_{\text{vl}}=\text{MLP}(z_{\text{vl}}^{\ast})\in\mathbb{R}^{d_{\text{vl}}}% \text{~{}\@.}italic_z start_POSTSUBSCRIPT vl end_POSTSUBSCRIPT = MLP ( italic_z start_POSTSUBSCRIPT vl end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT vl end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (13)
zdf=MLP(𝐱t+1)ddf .subscript𝑧dfMLPsubscript𝐱𝑡1superscriptsubscript𝑑df .z_{\text{df}}=\text{MLP}(\mathbf{x}_{t+1})\in\mathbb{R}^{d_{\text{df}}}\text{~% {}\@.}italic_z start_POSTSUBSCRIPT df end_POSTSUBSCRIPT = MLP ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT df end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (14)

Finally, we concatenate zts,zvl,zdfsubscript𝑧tssubscript𝑧vlsubscript𝑧dfz_{\text{ts}},z_{\text{vl}},z_{\text{df}}italic_z start_POSTSUBSCRIPT ts end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT vl end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT df end_POSTSUBSCRIPT to form z𝑧zitalic_z, which is then processed through an additional MLP to yield the decoded state for the denoising process.

𝐱t=MLP(z)ddf .subscript𝐱𝑡MLP𝑧superscriptsubscript𝑑df .\mathbf{x}_{t}=\text{MLP}(z)\in\mathbb{R}^{d_{\text{df}}}\text{~{}\@.}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = MLP ( italic_z ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT df end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (15)

Architecture Summarization. As outlined in the main paper, our network architecture contains: (i) a vision encoder, (ii) a text encoder followed by MLP layers, (iii) ALBEF module, (iv) MLP layers to encode timestep information, (v) MLP layers to encode text-image features zvlsuperscriptsubscript𝑧vlz_{\text{vl}}^{\ast}italic_z start_POSTSUBSCRIPT vl end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, (vi) MLP layers to encode noisy state 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and (vii) MLP layers to output denoising state. We summarize the architecture and hyperparameters of LGD in Table 7 and Table 8.

Appendix E Experimental Setups

We present implementation details of other baselines in the language-driven grasp detection task.

E.1 Baseline Setups

Linguistic versions of GR-ConvNet [43], Det-Seg-Refine [2], GG-CNN [59]. We make slight modifications to these baselines by adding a component to fuse image and text features from the input. Specifically, we utilize the CLIP text encoder [71] to extract text embeddings esuperscript𝑒e^{\prime}italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. To ensure a fair comparison between methods, we also utilize ALBEF architecture [50] to do the fusion between the text embedding and the visual features. The remaining training loss and parameter are inherited from the original work.

CLIPORT [78]. The original CLIPORT architecture learns a policy π𝜋\piitalic_π, which does not directly solve our task. We modify the CLIPORT architecture’s final layers with appropriately sized MLPs to output grasp poses defined by five parameters (x,y,w,h,θ)𝑥𝑦𝑤𝜃(x,y,w,h,\theta)( italic_x , italic_y , italic_w , italic_h , italic_θ ). This adaptation ensures consistency with our grasp detection baselines, diverging from CLIPORT’s original policy π𝜋\piitalic_π learning framework.

CLIP-Fusion [96]. In our re-implementation of the architecture from [96], we follow the cross-attention module in CLIP-Fusion with constructed MLP layers. The final MLP layers in the architecture is modified to output five parameters, corresponding to predicted grasp poses.

Refer to caption
RealSense
Query
Robotiq 2F-85
LGD
Refer to caption

ROS

Grapsing pose

Trajectory

optimization

generation

Refer to caption

NIC

TwinCAT

Refer to caption
Real time controller
Refer to caption
Figure 13: Overview of the robotic experiment setup.
Refer to caption
Figure 14: Additional language-driven grasp detection visualizations. We provide fine-grained grasp detection cases, our method successfully differentiates between objects of similar structure but varying color, such as a green bottle and a blue bottle.

E.2 Robotic Setup

In Figure 13, we present the robotic evaluation conducted on a KUKA robot. Our grasp detection leverages our proposed LGD and other methods listed in the real robot experiments (Table 4 of the main paper), and the results are translated into a 6DOF grasp pose through depth images captured by an Intel RealSense D435i depth camera, as in [43]. The trajectory planner [86, 5] is employed for the execution of the grasp. We use two computers for the experiment. The first computer (PC1) runs the real-time control software Beckhoff TwinCAT, the Intel RealSense D435i camera, and the Robotiq 2F-85 gripper, while the second computer (PC2) runs ROS on Ubuntu Noetic 20.04. PC1 communicates with the robot via a network interface card (NIC) using the EtherCAT protocol. The inference process is performed on PC2 with an NVIDIA 3080 GPU. Our assessment encompasses both single-object and cluttered scenarios, involving a diverse set of 20202020 real-world daily objects (Fig. 17). To ensure robustness and reliability, we repeat each experiment for all methods a total of 30303030 times.

Appendix F Extra Experiments

Number of Parameter Comparison. Table 9 shows the number of parameters in all methods. This table illustrates that the results from the language-driven grasp detection studies in the main paper reveal a consistent trade-off between performance and the number of parameters across all baselines. Notably, LGD emerges as the balanced baseline, offering a good balance of performance efficiency and computational resource utilization.

Baseline ##\##Parameters Success rate
GR-ConvNet [43] + CLIP [71] 2.07M 0.24
Det-Seg-Refine [2] + CLIP [71] 1.82M 0.20
GG-CNN [59] + CLIP [71] 1.24M 0.10
CLIPORT [78] 10.65M 0.29
CLIP-Fusion [96] 13.51M 0.33
LGD (ours) 5.18M 0.45
Table 9: Number of parameter comparison.
Refer to caption
Figure 15: Prediction failure cases. In cases where objects have similar structures, such as jars and bottles, our LGD occasionally fails to detect correct grasp poses.

Failure Cases. Though achieving satisfactory results, our method still predicts incorrect grasp poses. A large number of objects and grasping prompts in our dataset suggest a significant challenge for the tasks. Some failure cases are depicted in Fig. 15. From this figure, we can see that the correlation between the text and the attention map of the visual features is not well-aligned, which leads to incorrect prediction of the grasp poses.

Refer to caption
(a)
Refer to caption
(b)
Figure 16: Detection results in robotic experiments. Images are captured from a RealSense camera with experiments in Fig 18.
Refer to caption
Figure 17: Set of 20 objects used in the robotic experiment.

Additional Detection Visualization. Fig. 14 illustrates additional language-driven grasp pose detection using the LGD method. The result demonstrates our method’s capability in reasonably aligning grasp poses with linguistic instructions in fine-grained scenarios, as seen with a green and blue bottle. Remarkably, these qualitative examples demonstrate the effectiveness of our proposed LGD method in fine-grained cases, in line with the contrastive loss objectives outlined in our main paper.

Robotic Demonstration. In Fig. 18, we show a sequence of actions when the KUKA robot grasps different objects in cluttered scenes. Fig. 16 further shows the detection result of our LGD method on an image captured by a RealSense camera mounted on the robot. The robotic experiments demonstrate that although our LGD method is trained on a synthesis Grasp-Anything++ dataset, it still be able to generalize to detect grasp pose in real-world images. More illustrations can be found in our Demonstration Video.

Refer to caption
Figure 18: Snapshots of two example robotic experiments.