Language-driven Grasp Detection

An Dinh Vuong¹, Minh Nhat Vu^2,∗, Baoru Huang^3,∗, Nghia Nguyen¹, Hieu Le¹, Thieu Vo⁴, Anh Nguyen⁵
¹FPT Software AI Center, Vietnam ²Automation & Control Institute, TU Wien, Austria ³Imperial College London, UK
⁴Ton Duc Thang University, Vietnam ⁵University of Liverpool, UK ^∗Co-Corresponding authors
https://airvlab.github.io/grasp-anything

Abstract

Grasp detection is a persistent and intricate challenge with various industrial applications. Recently, many methods and datasets have been proposed to tackle the grasp detection problem. However, most of them do not consider using natural language as a condition to detect the grasp poses. In this paper, we introduce Grasp-Anything++, a new language-driven grasp detection dataset featuring 1M samples, over 3M objects, and upwards of 10M grasping instructions. We utilize foundation models to create a large-scale scene corpus with corresponding images and grasp prompts. We approach the language-driven grasp detection task as a conditional generation problem. Drawing on the success of diffusion models in generative tasks and given that language plays a vital role in this task, we propose a new language-driven grasp detection method based on diffusion models. Our key contribution is the contrastive training objective, which explicitly contributes to the denoising process to detect the grasp pose given the language instructions. We illustrate that our approach is theoretically supportive. The intensive experiments show that our method outperforms state-of-the-art approaches and allows real-world robotic grasping. Finally, we demonstrate our large-scale dataset enables zero-short grasp detection and is a challenging benchmark for future work.

Figure 1: We present a new dataset and method for language-driven grasp task.

1 Introduction

Imagine we want an assistant robot to grasp a cup among a clutter of daily objects such as a knife, a fork, a cup, and a pair of scissors. Conventionally, to convey the idea of grasping this specific object, humans use the natural language command, “give me the cup”, for instance. Although humans intuitively know how to grasp the cup given the linguistic command, determining specific grasp actions for objects based on natural language instructions or language-driven grasp detection remains challenging for robots [78]. First, natural language is usually overlooked in existing grasp datasets [69] while training vision-and-language neural networks necessitates an excessive number of labeled examples [80]. Second, recent works usually focus on particular manipulation tasks with limited objects [28], imposing a bottleneck for in-the-wild robot execution [77]. Finally, despite recent developments, bridging the gap between language, vision, and control for real-world robotic experiments remains a challenging task [100].

Recently, language-driven robotic frameworks are gaining traction, offering the potential for robots to process natural language, and bridging the gap between robotic manipulations and real-world human-robot interaction [63]. PaLM-E [24], EgoCOT [63], and ConceptFusion [38] are some notable embodied robots with the ability to comprehend natural language by harnessing the power of large foundation models such as ChatGPT [67]. However, most works assume the high-level actions of robots and ignore the fundamental grasping actions, restricting the structure for generalization across robotic domains, tasks, and skills [62]. In this paper, we explore training a language-driven agent to implement low-level actions, focusing on the task of object grasping via image observations. Specifically, our hypothesis is centered around the establishment of a robotic system that can execute grasping actions following a given language instruction for any universal object.

We first present Grasp-Anything++ to serve as a large-scale dataset for language-driven grasp detection. Our dataset is based on the Grasp-Anything [87], and is synthesized from foundation models. Compared to the original Grasp-Anything dataset, we provide more than 10M grasp prompts and 3M associated object masks, 6M ground truth poses at the object part level. Our dataset showcases the ability to facilitate grasp detection using language instructions. We label the ground truth at both the object level and part level, providing a comprehensive understanding of real-world scenarios. For example, our ground truth includes both general instructions “give me the knife” and detail ones such as “grasp the handle of the steak knife”. We empirically show that our large-scale dataset successfully facilitates zero-shot grasp detection on both vision-based tasks and real-world robotic experiments.

To tackle the challenging language-driven grasp detection task, we propose a new diffusion model-based method. Our selection of diffusion models is motivated by their proven efficacy in conditional generation tasks [33]. These models have shown efficiency beyond image synthesis, including other image-based tasks such as image segmentation [92], and visual grounding [51]. Despite achieving notable success, integrating visual and text features effectively remains a challenge [14] as the majority of existing literature employs latent strategies to combine visual and text features [46]. We address this challenge by employing a new training strategy for learning text and image features, focusing on the use of feature maps as guidance information for grasp pose generation. Our main contribution is a new training objective that incorporates the feature maps and explicitly contributes to the denoising process. In summary, our contributions are three-fold:

•

We propose Grasp-Anything++, a large-scale language-driven dataset for grasp detection tasks.
•

We propose a diffusion model with a training objective that explicitly contributes to the denoising process to detect the grasp poses.
•

We demonstrate that our Grasp-Anything++ dataset and the proposed method outperform other approaches and enable successful robotic applications.

2 Related Work

Grasp Detection. Grasp detection is a popular task in both computer vision and robotic community [18, 65, 78, 2, 27]. Recently, establishing robotic systems with the ability to follow natural commands has been actively researched [100, 78, 96]. The prevalent solution to the language-driven grasp detection task is to split into two stages: one for grounding the target object, and the other is to synthesize grasp poses from the grounding visual-text correlations [96, 1]. Training in two stages may result in longer inference time [53]. In addition, several works [100, 98, 107] adopt foundation models, such as GroundDINO [54] and GPT-3 [8]. Accessing such commercial foundation models is not always available [85], especially on robotic systems with limited resources or unstable internet connection [47]. In our work, we directly train the model on the large-scale Grasp-Anything++ dataset to inherit the power of a foundation-based dataset, while ensuring a straightforward inference process for the downstream robotic applications.

Language-driven Grasp Detection Datasets. While there are many grasp datasets have been introduced [39, 18, 68, 49, 94, 57, 61, 25, 27, 60, 26, 10], the majority of them overlook the text modality. Therefore, the grasping of objects out of a clutter typically experiences ambiguities in what object to grasp [105]. DailyGrasp [100] is one of the first grasp datasets employing natural language for scene descriptions; however, the scene description corpus in this dataset is relatively small and does not specify which part of the object should be grasped. In our work, we present Grasp-Anything++, which is a large-scale language-driven grasping dataset. Furthermore, Grasp-Anything++ describes the grasping object at both the part level and object level, providing more information for the robot to execute the grasping [103].

Diffusion Models for Robotic Applications. Diffusion models [33] have emerged as the new state-of-the-art method of generative tasks [101]. Recently, we have witnessed growing attention for utilizing diffusion models in robotic applications [75]. Liu et al. [55] propose a diffusion model to handle the language-guided object rearrangement task. Diffusion models are also applied to other robotic tasks such as motion planning [11], and trajectory optimization [37]. The authors in [84] present a diffusion model to determine grasp poses by minimizing a SDF loss. Overall, the diffusion models employed in previous works often combine visual and text features in a latent mechanism [7], which may cause interpretability problems [100] for robotic systems that require low-level controls [22]. To tackle this challenge, we propose a training objective that explicitly contributes to the denoising process. We demonstrate that our proposed strategy is theoretically supported and is more effective than the latent strategy.

3 The Grasp-Anything++ Dataset

We utilize large-scale foundation models to create the Grasp-Anything++. Our dataset offers open-vocabulary grasping commands and images with associated groundtruth. There are three key steps in establishing our dataset: i) prompting procedure, ii) image synthesis and grasp poses annotation, and iii) post-processing.

3.1 Prompting Procedure

We first establish prompt-based procedures to generate a large-scale scene description corpus as well as grasp prompt instructions. In particular, we utilize ChatGPT to generate the prompts for two tasks: i) Scene descriptions: Sentences capturing the scene arrangement, including the extracted object and part lists, and ii) Grasp instructions: Prompts directing the robot to grasp specific objects or parts.

We follow a procedure in Table 1 to implement ChatGPT’s output templates. The reference target in the grasp instruction may be either an object or an object’s part. When the reference is an object’s part, that part is directly selected as the reference in the grasp instruction sentence. If the reference is an object, we determine the grasping region on the part of the object that is likely to be grasped in everyday scenarios as described in affordance theory [66].

3.2 Image Synthesis and Grasp Annotation

Image Synthesis. Given the scene description corpus, we first utilize a large-scale pretrained text-to-image model, namely, Stable Diffusion [74] to generate images from scene descriptions. Next, we perform a series of visual grounding and image segmentation using OFA [89], Segment-Anything [42], and VLPart [81] to locate the referenced object or part to the grasp instruction.

Refer to caption — (a) Number of categories

Grasp Annotation. Grasp poses are represented as 2D rectangles, consistent with prior research and practical compatibility with real-world parallel grippers [39, 18]. Utilizing a pretrained network [10], we annotate grasp poses based on part segmentation masks. Since potential inaccuracies in these candidate poses could occur, we follow the procedure as defined in [87] to evaluate the quality of generated grasp poses to discard unreasonable grasp poses.

Specifically, grasp quality is evaluated through net torque $\mathcal{T}=\left(\tau_{1}+\tau_{2}\right)-RMg\text{~{},}$ where resistance at contact points is $\tau_{i}=K\mu_{s}F\cos\alpha_{i}$ . With constants such as $M$ (mass), $g$ (gravitational acceleration), $K$ (geometrical characteristics), $\mu_{s}$ (static friction coefficient), and $F$ (applied force), accurately determining $\mathcal{T}$ directly is challenging due to the physical difficulties in precisely measuring $M$ , $K$ , and $\mu_{s}$ . Thus, we employ a surrogate measure, $\tilde{\mathcal{T}}=\dfrac{\cos\alpha_{1}+\cos\alpha_{2}}{R}\text{~{},}$ as an alternative. As a result in [13], antipodal grasps score higher on $\tilde{\mathcal{T}}$ , indicating better quality. Consequently, grasps are evaluated based on $\tilde{\mathcal{T}}$ , with positive values indicating positive grasps and others considered as negative.

Post Processing. Despite training on extensive datasets, Stable Diffusion [74] may produce subpar content, commonly termed as hallucination[36] when generating images from the text prompts. To address this, we perform manual reviews to filter out such images, with qualitative examples in our figures. Our process includes checks at every stage to prevent duplicate or hallucinated content. However, manual inspection introduces biases, which we counter with guidelines focusing on abnormal structures or implausible gravity (Fig. 3), aligns with approach in the literature [76].

Additionally, ChatGPT-generated scene prompts often duplicate [56]. To address this, we use duplication checking, filtering out identical prompts with BERTScore [106], which assesses sentence similarity through cosine similarities of token embeddings. We remove sentences with a BERTScore above 0.85 as in prior study [90].

3.3 Data Statistics

Number of Categories. To evaluate the object category diversity, we apply a methodology akin to that in [17]. Utilizing 300 categories from LVIS dataset [29], we employ a pretrained model [108] to identify 300 candidate objects from our dataset for each category. We then curate a subset comprising 90,000 objects, refining it by excluding items that do not align semantically with their designated categories. A category is considered significant if it has more than 40 objects. Fig. 2(a) shows the results. Overall, our dataset spans over $236$ categories from LVIS dataset, indicating a notable degree of object diversity in our dataset.

Scene Descriptions. Fig. 2(b) shows the distribution of scene descriptions based on sentence length. The analysis reveals a wide range of sentence lengths, spanning from 10 words to 100 words per sentence. On average, each scene description consists of approximately 54 words, indicative of detailed and descriptive sentences. These scene descriptions correspond to sets of grasp instructions. Fig. 2(c) further shows the objects and object parts in scene descriptions.

Diversity Analysis. We assess the diversity of occlusion and lighting conditions in the dataset. Regarding occlusion, we use a pretrained YOLOv5 model to identify objects within images. The results indicate that $93.8\%$ of images have a substantial overlap of five or more bounding boxes, which suggests a diverse range of occlusion within the Grasp-Anything++ dataset. Regarding lighting conditions, we convert images to YCbCr to analyze Y channel (luminance) and find that GraspL1M has the most diverse lighting conditions, identifying by the lowest Gini coefficient (a metric to measure the inequality of a distribution) of $0.26$ , compared to VMRD [104] ( $0.31$ ), OCID-grasp [2] ( $0.32$ ), Cornell [39] ( $0.62$ ), Jacquard [18] ( $0.91$ ).

4 Language-driven Grasp Detection

Motivation. The use of diffusion model for language-driven grasp detection is motivated by its efficiency in various generative tasks [33, 101, 44, 55, 16]. Conditional generation, such as our language-driven grasp detection task, aligns seamlessly with diffusion models’ capabilities [34]. Moreover, language-driven grasp detection represents a fine-grained problem in which the outputs strongly depend on the text input [4]. For example, “grasp the steak knife” and “grasp the kraft knife” refer to two different objects on the image. To this end, we propose using contrastive loss with diffusion model to tackle this task, as contrastive learning is a popular solution for fine-grained tasks [9, 21, 102].

4.1 Constrastive Loss for Diffusion Model

We represent the target grasp pose as $\mathbf{x}_{0}$ in the diffusion model. The objective of our diffusion process of language-driven grasp detection involves denoising from a noisy state $\mathbf{x}_{T}$ to the original grasp pose $\mathbf{x}_{0}$ , conditioned on the input image and grasp instruction represented by $y$ .

In a diffusion process [33], assume that $q(\mathbf{x}_{1:T}|\mathbf{x}_{0})$ is the forward process and we parameterize the reverse process by $p_{\theta}(\mathbf{x}_{0:T})$ . The conditional diffusion process [20] assumes $\hat{q}$ is the forward process but with the inclusion of a condition $y$ . The goal of the reverse process is to optimize the variational bound on negative log likelihood [33]

\mathcal{L}=\mathbb{E}\left[-\log p_{\theta}(\mathbf{x}_{T})-\sum_{t\geq 1}% \log\frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q(\mathbf{x}_{t}|% \mathbf{x}_{t-1})}\right]\text{~{}\@.}

(1)

We prove in the Appendix that

\begin{gathered}\mathcal{L}=\mathbb{E}\bigg{[}\underbrace{C-\log p_{\theta}(% \mathbf{x}_{T})}_{\textrm{Constant}}+\underbrace{\log\hat{q}(\mathbf{x}_{0}|% \mathbf{x}_{T},y)}_{\textrm{Contrastive}}+\\ \underbrace{\sum_{t>1}D_{\text{KL}}(\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}_{t},% \mathbf{x}_{0},y)\|p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}))-\log p_{\theta% }(\mathbf{x}_{0}|\mathbf{x}_{1},y)}_{\textrm{Denoising score}}\bigg{]}\text{~{% }\@.}\end{gathered}

(2)

The terms $D_{\text{KL}}(\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0},y)\|p_{% \theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}))$ and $\log p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1},y)$ in Equation 2 are similar to the concept of denoising score used in [33]. Thus, we can represent the quantity $\sum_{t>1}D_{\text{KL}}(\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0}% ,y)\|p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}))-\log p_{\theta}(\mathbf{x}_{% 0}|\mathbf{x}_{1},y)$ by the loss utilized in [82, 83]

\mathcal{L}_{\rm{diffusion}}=\mathbb{E}_{\mathbf{x}_{0}\sim q(\mathbf{x}_{0}|y% ),t\sim[1,T]}\left[\mathbf{x}_{0}-f(\mathbf{x}_{t+1},t+1,\tilde{\mathbf{x}}_{0% })\right]^{2}\text{~{}\@.}

(3)

Since $q(\mathbf{x}_{T}|\cdot)$ is equivalent to an isotropic Gaussian distribution as $T\rightarrow+\infty$ , the estimation quantity $\log p_{\theta}(\mathbf{x}_{T})$ converges to a constant when $\theta\rightarrow\theta^{\ast}$ . Therefore, we can ignore the first term of Equation 2.

Finally, the term ${\hat{q}(\mathbf{x}_{0}|\mathbf{x}_{T},y)}$ of Equation 2 provides more information about the relation between $\mathbf{x}_{T}$ and $\mathbf{x}_{0}$ . As this term is intractable [79], we parameterize ${\hat{q}(\mathbf{x}_{0}|\mathbf{x}_{T},y)}$ as $p_{\psi,y}(\mathbf{x}_{0},\mathbf{x}_{T})$ . This estimation resembles the noise-contrastive estimation [30], where the $\mathbf{x}_{T}$ , $\mathbf{x}_{0}$ can be considered as a pair of contrastive estimation and $\psi$ can be estimated by a contrastive loss.

In [88, 20, 82], the authors indicate that predicting $\mathbf{x}_{0}$ is often infeasible but predicting an estimation $\tilde{\mathbf{x}}_{0}$ is tractable and can be used as a ‘pseudo’ estimation of $\mathbf{x}_{0}$ . We denote $\tilde{\mathbf{x}}_{0}$ as an estimation of $\mathbf{x}_{0}$ . The loss term ${\hat{q}(\mathbf{x}_{0}|\mathbf{x}_{T},y)}$ can be approximated by using the following contrastive loss

\mathcal{L}_{\rm{contrastive}}=\max\left(0,\left\|\frac{\sqrt{\overline{\alpha% _{T}}}\tilde{\mathbf{x}}_{0}-\mathbf{x}_{T}}{\sqrt{1-\overline{\alpha_{T}}}}% \right\|_{2}^{2}-M\right),

(4)

where $M$ is the number of dimension of $\mathbf{x}_{0}$ , and $\alpha_{t}$ is the variance schedule at timestep $t$ ( $t=\overline{1;T}$ ).

Proposition 1.

Suppose that $\tilde{\mathbf{x}}_{0},\,\mathbf{x}_{0}$ and $\epsilon$ are independent, and that

\left\|\frac{\sqrt{\overline{\alpha_{T}}}\tilde{\mathbf{x}}_{0}-\mathbf{x}_{T}% }{\sqrt{1-\overline{\alpha_{T}}}}\right\|_{2}^{2}\geq M\text{~{}\@.}

Then there exists $C>0$ such that: for arbitrary $\delta>0$ , if $\mathcal{L}_{\rm{contrastive}}<\delta$ , then

\mathbb{E}\left[\|\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}\|_{2}^{2}\right]<C% \delta\text{~{}\@.}

Proof.

See Supplementary Material. ∎

Remark 1.1.

Proposition 1 suggests that if the contrastive loss $\mathcal{L}_{\rm{contrastive}}$ tends to zero, then the prediction $\tilde{\mathbf{x}}_{0}$ will approach the ground truth $\mathbf{x}_{0}$ .

Remark 1.2.

The condition $\left\|\frac{\sqrt{\overline{\alpha_{T}}}\tilde{\mathbf{x}}_{0}-\mathbf{x}_{T}% }{\sqrt{1-\overline{\alpha_{T}}}}\right\|_{2}^{2}\geq M$ is suitable for our language-driven grasp detection task as $\tilde{\mathbf{x}}_{0}$ and $\mathbf{x}_{T}$ are two contrastive quantities, therefore, we can assume there is a minimum distance between $\tilde{\mathbf{x}}_{0}$ and $\mathbf{x}_{T}$ . In addition, in the proof of Proposition 1, we see that $\mathbb{E}\left[\left\|\frac{\sqrt{\overline{\alpha_{T}}}\tilde{\mathbf{x}}_{0% }-\mathbf{x}_{T}}{\sqrt{1-\overline{\alpha_{T}}}}\right\|_{2}^{2}-M\right]=% \beta^{2}\mathbb{E}\left[\|\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}\|_{2}^{2}\right]$ , which is always nonnegative. Therefore, it is both theoretically and experimentally reasonable to add this assumption.

4.2 Language-driven Grasp Detection Network

Network. Our network operates on two conditions: an image denoted as I and a corresponding text prompt represented as $e$ . To process these conditions, we employ a vision encoder to extract visual features from I and a text encoder to derive textual embeddings from $e$ . The resulting feature vectors, denoted as $\textit{{I}}^{\prime}$ and $e^{\prime}$ , are subsequently subjected to a fusion module, ALBEF [50]. We leverage the attention mask generated by the ALBEF fusion module as the estimation $\tilde{\mathbf{x}}_{0}$ of $\mathbf{x}_{0}$ . Next, we aggregate three elements: the estimation region $\tilde{\mathbf{x}}_{0}$ , the grasp pose at the current timestep $\mathbf{x}_{t+1}$ , and the timestep $t+1$ . These inputs are combined using MLP layers, similar to the approach outlined in [82]. Specifically, the output operation can be expressed as: $\mathbf{x}_{t}=f(\mathbf{x}_{t+1},t+1,\tilde{\mathbf{x}}_{0})$ , where function $f$ encompasses a composition of multiple MLP layers. Additional specifics regarding these universal MLP layers are provided in the Supplementary Material.

Training Objective. In our context, conditioned grasp detection models the distribution $p(\mathbf{x}_{0}|y)$ as the reversed diffusion process of gradually cleaning $\mathbf{x}_{t+1}$ . Instead of predicting $\mathbf{x}_{t}$ as formulated by [33], we follow Ramesh et al. [72] and predict the signal itself, i.e., $\mathbf{x}_{t}=f(\mathbf{x}_{t+1},t+1,\tilde{\mathbf{x}}_{0})$ with the simple objective [33]. To this end, we utilize the contrastive loss as in Equation 4 to explicitly improve the learning objective of the denoising process:

\mathcal{L}_{\rm{total}}=\mathcal{L}_{\rm{contrastive}}+\mathcal{L}_{\rm{% diffusion}}\text{~{}\@.}

(5)

5 Experiments

We conduct experiments to evaluate our proposed method and Grasp-Anything++ dataset using both the vision-based metrics and real robot experiments. We then demonstrate zero-shot grasp results and discuss the challenges and open questions for future works.

5.1 Language-driven Grasp Detection Results

Baselines. We compare our language-driven grasp detection method (LGD) with the linguistically supported versions of GR-CNN [43], Det-Seg-Refine [2], GG-CNN [59], CLIPORT [78] and CLIP-Fusion [96]. In all cases, we employ a pretrained CLIP [71] or BERT [19] as the text embedding. The implementation details of all baselines can be found in our Supplementary Material.

Setup. To assess the generalization of all methods trained on Grasp-Anything++, we utilize the concept of base and new labels [109] in zero-shot learning. We categorize LVIS labels from Section 3.3 to form labels for our experiment. In particular, we select 70% of these labels by frequency for ‘Base’ and assign the remaining 30% to ‘New’. We also use the harmonic mean (‘H’) to measure the overall success rates [109]. Our primary evaluation metric is the success rate, defined similarly to [43], necessitating an IoU score of the predicted grasp exceeding $25\%$ with the ground truth grasp and an offset angle less than $30^{\circ}$ .

Baseline	Seen	Unseen	H
GR-ConvNet [43] + CLIP [71]	0.37	0.18	0.24
Det-Seg-Refine [2] + CLIP [71]	0.30	0.15	0.20
GG-CNN [59] + CLIP [71]	0.12	0.08	0.10
CLIPORT [78]	0.36	0.26	0.29
CLIP-Fusion [96]	0.40	0.29	0.33
LGD (ours) + BERT [19]	0.44	0.38	0.41
LGD (ours) + CLIP [71]	0.48	0.42	0.45

Table 2: Language-driven grasp detection results.

Main Results. Table 2 shows the results of language-driven grasp detection on the Grasp-Anything++ dataset. The findings indicate a notable performance advantage of our LGD over other baseline approaches, with LGD outperforming the subsequent best-performing baselines (CLIP-Fusion) by margins of $0.14$ on Grasp-Anything++ dataset.

Baseline	Seen	Unseen	H
LGD w/o predicting $\tilde{\mathbf{x}}_{0}$	0.15	0.08	0.10
LGD w/o contrastive loss	0.45	0.40	0.42
LGD w contrastive loss	0.48	0.42	0.45

Table 3: Contrastive loss analysis.

Contrastive Loss Analysis. Table 3 presents the performance of LGD under varied configurations. The outcomes emphasize the substantial influence of the training objective (contrastive loss) and the importance of language instructions in enhancing LGD performance on both seen and unseen classes in the Grasp-Anything++ dataset.

Fig. 6 shows the contrastive loss $\mathcal{L}_{\text{contrastive}}$ approaching towards $0$ during training, indicating the grasp pose estimation $\tilde{\mathbf{x}}_{0}$ aligns with the ground truth $\mathbf{x}_{0}$ , as anticipated by Proposition 1. The subsequent attention maps visualization in Fig. 8 shows the attention region is meaningful and improves the results when employing our proposed contrastive loss compared to its absence. Moreover, we employ t-SNE for vision-and-language embedding visualization, as in [58], by processing $2,000$ samples from the Grasp-Anything++ dataset through the ALBEF module. The outcomes reveal that our contrastive loss facilitates better object classification, as evidenced in Fig.7 by clearer segregation of pixel embeddings across various semantic classes, underscoring contrastive loss’s role in refining embeddings’ differentiation for improved class distinctions.

Qualitative Results. Fig. 5 presents qualitative results of the language-driven grasp detection task, suggesting that our LGD method generates more semantically plausible than other baselines. Despite satisfactory performance, LGD occasionally predicts incorrect results, with a detailed analysis of these cases available in our Appendix.

Baseline	Single	Cluttered
GR-ConvNet [43] + CLIP [71]	0.33	0.30
Det-Seg-Refine [2] + CLIP [71]	0.30	0.23
GG-CNN [59] + CLIP [71]	0.10	0.07
CLIPORT [78]	0.27	0.30
CLIP-Fusion [96]	0.40	0.40
LGD (ours)	0.43	0.42

Table 4: Robotic language-driven grasp detection results.

Robotic Validation. We provide quantitative results by integrating our language-driven grasp detection pipeline for a robotic grasping application with a KUKA LBR iiwa R820 robot. Using the RealSense D435i camera, the grasp pose inferred from approaches in Table 4 is transformed into the 6DoF grasp pose, similar to [43]. The optimization-based trajectory planner in [86, 6] is employed to execute the grasps. Experiments are conducted for two scenarios, i.e., the single object scenario and the cluttered scene scenario, of a set of $20$ real-world daily objects. In each scenario, we run $30$ experiments using baselines listed in Table 4 and a predefined grasping prompt corpus. The results exhibit that our LGD outperforms other baselines. Furthermore, although LGD is trained on our Grasp-Anything++ which is a solely synthesis dataset created by foundation models, it still shows reasonable results on real-world objects.

	Grasp-Anything++ (ours)			Jacquard [18]			Cornell [39]			VMRD [104]			OCID-grasp [2]
Baseline	Base	New	H	Base	New	H	Base	New	H	Base	New	H	Base	New	H
GR-ConvNet [43]	0.71	0.59	0.64	0.88	0.66	0.75	0.98	0.74	0.84	0.77	0.64	0.70	0.86	0.67	0.75
Det-Seg-Refine [2]	0.62	0.57	0.59	0.86	0.60	0.71	0.99	0.76	0.86	0.75	0.60	0.66	0.80	0.62	0.70
GG-CNN [59]	0.68	0.57	0.62	0.78	0.56	0.65	0.96	0.75	0.84	0.69	0.53	0.59	0.71	0.63	0.67
LGD (no text) (ours)	0.74	0.63	0.68	0.89	0.69	0.77	0.97	0.76	0.85	0.79	0.66	0.72	0.88	0.68	0.76

Table 5: Base-to-new zero-shot grasp detection results.

5.2 Zero-shot Grasp Detection

Our proposed Grasp-Anything++ is a large-scale dataset. Apart from the language-driven grasp detection task, we believe it can be used for other purposes. In this experiment, we seek to answer the question: Can Grasp-Anything++ be useful in the traditional grasp detection task without text? Consequently, we verify our Grasp-Anything++ and LGD (no text) with other existing datasets and grasping methods.

Setup. We setup an LGD (no text) version, and other state-of-the-art grasp detection methods GR-ConvNet [43], Det-Seg-Refine [2], GG-CNN [59]. We use five datasets: our Grasp-Anything++, Jacquard [18], Cornell [39], VMRD [104], and OCID-grasp [2] in this experiment.

Zero-shot Results. Table 5 summarizes the base-to-new grasp detection results on five datasets. Overall, the performance of LGD even without the language branch is better than other baselines across all datasets. Furthermore, this table also shows that our Grasp-Anything++ dataset is more challenging to train as the detection results are lower than related datasets using the same approaches due to the greater coverage of unseen objects in the testing phase.

	Jacquard	Cornell	VMRD	OCID-grasp	Grasp-Anything++
Jacquard [18]	0.87	0.51	0.13	0.21	0.17
Cornell [39]	0.07	0.98	0.20	0.12	0.13
VMRD [104]	0.06	0.21	0.79	0.11	0.10
OCID-grasp [2]	0.09	0.12	0.20	0.74	0.11
Grasp-Anything++ (ours)	0.41	0.63	0.30	0.39	0.65

Table 6: Cross-dataset grasp detection results.

Cross-dataset Evaluation. To further verify the usefulness of our Grasp-Anything++ dataset, we conduct the cross-dataset validation in Table 6. We use the GR-ConvNet [43] to reuse its results on existing grasp datasets. GR-ConvNet is trained on a dataset (row) and evaluated on another dataset (column). For example, training on Jacquard and testing on Cornell yields an accuracy of $0.51$ . Notably, training with our dataset improves performance by approximately $10-33\%$ compared to other datasets.

In the wild grasp detection. Fig. 9 shows visualization results using LGD (no text) trained on our Grasp-Anything++ dataset on random internet images and other datasets images. We can see that the detected grasp poses are adequate in quality and quantity. This demonstrates that although our Grasp-Anything++ is fully created by foundation models without having any real images, models trained on our Grasp-Anything++ dataset still generalize well on real-world images.

5.3 Discussion

Our experiments indicate that Grasp-Anything++ can serve as a foundation dataset for both language-driven and traditional grasp detection tasks. However, there are certain limitations. First, our dataset lacks depth images for directly being applied to robotic applications [64]. Second, we remark that the creation of our dataset is time-consuming and relies on access to the ChatGPT API. Fortunately, future research can reuse our provided assets (images, prompts, etc.) without starting from scratch. Furthermore, our experiments show that adding language to the grasp detection task (Table 2) poses a more challenging problem compared to standard grasp detection task (Table 5).

We see several interesting future research directions. First, future work could investigate the use of text or image-to-3D models [95] or image-to-depth [73] and reuse our dataset’s prompts and images to construct 3D language-driven grasp datasets. Additionally, beyond linguistic grasp instruction adherence, our dataset holds potential for varied applications, including scene understanding [97] and scene generation [3], hallucination analysis [36], and human-robot interaction [12].

6 Conclusion

We introduce Grasp-Anything++, a large-scale dataset with 1M images and 10M grasp prompts for language-driven grasp detection tasks. We propose LGD, a diffusion-based method to tackle the language-driven grasp detection task. Our diffusion model employs a contrastive training objective, which explicitly contributes to the denoising process. Empirically, we have shown that Grasp-Anything++ serves as a foundation grasp detection dataset. Finally, our LGD improves the performance of other baselines, and the real-world robotic experiments further validate the effectiveness of our dataset and approach.

References

[1] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
[2] Stefan Ainetter and Friedrich Fraundorfer. End-to-end trainable deep neural network for robotic grasp detection and semantic segmentation from rgb. In ICRA, 2021.
[3] Dor Arad Hudson and Larry Zitnick. Compositional transformers for scene generation. NeurIPS, 2021.
[4] Umar Asif, Jianbin Tang, and Stefan Harrer. Graspnet: An efficient convolutional neural network for real-time grasp detection for low-powered devices. In IJCAI, 2018.
[5] Florian Beck, Minh Nhat Vu, Christian Hartl-Nesic, and Andreas Kugi. Singularity avoidance with application to online trajectory optimization for serial manipulators. arXiv preprint arXiv:2211.02516, 2022.
[6] Florian Beck, Minh Nhat Vu, Christian Hartl-Nesic, and Andreas Kugi. Singularity avoidance with application to online trajectory optimization for serial manipulators. IFAC-PapersOnLine, 56(2):284–291, 2023.
[7] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS, 33, 2020.
[9] Guy Bukchin, Eli Schwartz, Kate Saenko, Ori Shahar, Rogerio Feris, Raja Giryes, and Leonid Karlinsky. Fine-grained angular contrastive learning with coarse labels. In CVPR, 2021.
[10] Boyuan Cao, Xinyu Zhou, Congmin Guo, Baohua Zhang, Yuchen Liu, and Qianqiu Tan. Nbmod: Find it and grasp it in noisy background. arXiv preprint arXiv:2306.10265, 2023.
[11] Joao Carvalho, An T Le, Mark Baierl, Dorothea Koert, and Jan Peters. Motion planning diffusion: Learning and planning of robot motions with diffusion models. arXiv preprint arXiv:2308.01557, 2023.
[12] Paola Cascante-Bonilla, Hui Wu, Letao Wang, Rogerio S Feris, and Vicente Ordonez. Simvqa: Exploring simulated environments for visual question answering. In CVPR, 2022.
[13] I-Ming Chen and Joel W Burdick. Finding antipodal point grasps on irregularly shaped objects. T-RA, 1993.
[14] Sijia Chen and Baochun Li. Language-guided diffusion model for visual grounding. arXiv preprint arXiv:2308.09599, 2023.
[15] Xiahan Chen, Weishen Wang, Yu Jiang, and Xiaohua Qian. A dual-transformation with contrastive learning framework for lymph node metastasis prediction in pancreatic cancer. Medical Image Analysis, 85:102753, 2023.
[16] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. TPAMI, 2023.
[17] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In CVPR, 2023.
[18] Amaury Depierre, Emmanuel Dellandréa, and Liming Chen. Jacquard: A large scale dataset for robotic grasp detection. In IROS, 2018.
[19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[20] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. NeurIPS, 34, 2021.
[21] Tuong Do, Huy Tran, Erman Tjiputra, Quang D Tran, and Anh Nguyen. Fine-grained visual classification using self assessment classifier. arXiv preprint arXiv:2205.10529, 2022.
[22] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
[23] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[24] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
[25] Clemens Eppner, Arsalan Mousavian, and Dieter Fox. A billion ways to grasp: An evaluation of grasp sampling schemes on a dense, physics-based grasp data set. In ISRR, 2019.
[26] Clemens Eppner, Arsalan Mousavian, and Dieter Fox. Acronym: A large-scale grasp dataset based on simulation. In ICRA, 2021.
[27] Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet-1billion: A large-scale benchmark for general object grasping. In CVPR, 2020.
[28] Maximilian Gilles, Yuhao Chen, Tim Robin Winter, E Zhixuan Zeng, and Alexander Wong. Metagraspnet: A large-scale benchmark dataset for scene-aware ambidextrous bin picking via physics-based metaverse synthesis. In CASE, 2022.
[29] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
[30] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTAT, 2010.
[31] Zeyu Han, Yuhan Wang, Luping Zhou, Peng Wang, Binyu Yan, Jiliu Zhou, Yan Wang, and Dinggang Shen. Contrastive diffusion model with auxiliary guidance for coarse-to-fine pet reconstruction. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 239–249. Springer, 2023.
[32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
[33] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 2020.
[34] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022.
[35] Jiaheng Hua, Xiaodong Cui, Xianghua Li, Keke Tang, and Peican Zhu. Multimodal fake news detection through data augmentation-based contrastive learning. Applied Soft Computing, 136:110125, 2023.
[36] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
[37] Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. In ICML, 2022.
[38] Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, et al. Conceptfusion: Open-set multimodal 3d mapping. arXiv preprint arXiv:2302.07241, 2023.
[39] Yun Jiang, Stephen Moseson, and Ashutosh Saxena. Efficient grasping from rgbd images: Learning using a new rectangle representation. In ICRA, 2011.
[40] Zhenyu Jiang, Yifeng Zhu, Maxwell Svetlik, Kuan Fang, and Yuke Zhu. Synergies between affordance and geometry: 6-dof grasp detection via implicit representations. arXiv preprint arXiv:2104.01542, 2021.
[41] Ishay Kamon, Tamar Flash, and Shimon Edelman. Learning to grasp using visual information. In ICRA, 1996.
[42] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
[43] Sulabh Kumra, Shirin Joshi, and Ferat Sahin. Antipodal robotic grasping using generative residual convolutional neural network. In IROS, 2020.
[44] Nhat Le, Tuong Do, Khoa Do, Hien Nguyen, Erman Tjiputra, Quang D Tran, and Anh Nguyen. Controllable group choreography using contrastive diffusion. TOG, 2023.
[45] Chaejeong Lee, Jayoung Kim, and Noseong Park. Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis. arXiv preprint arXiv:2304.12654, 2023.
[46] Jaewoong Lee, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Yunji Kim, Jin-Hwa Kim, Jung-Woo Ha, and Sung Ju Hwang. Text-conditioned sampling framework for text-to-image generation with masked generative models. arXiv preprint arXiv:2304.01515, 2023.
[47] Meng-Lun Lee, Sara Behdad, Xiao Liang, and Minghui Zheng. Task allocation and planning for product disassembly with human–robot collaboration. Robotics and Computer-Integrated Manufacturing, 76, 2022.
[48] Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning for detecting robotic grasps. IJRR, 2015.
[49] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. IJRR, 2018.
[50] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 2021.
[51] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023.
[52] Hongzhuo Liang, Xiaojian Ma, Shuang Li, Michael Görner, Song Tang, Bin Fang, Fuchun Sun, and Jianwei Zhang. Pointnetgpd: Detecting grasp configurations from point sets. In ICRA, 2019.
[53] Jirong Liu, Ruo Zhang, Hao-Shu Fang, Minghao Gou, Hongjie Fang, Chenxi Wang, Sheng Xu, Hengxu Yan, and Cewu Lu. Target-referenced reactive grasping for dynamic objects. In CVPR, 2023.
[54] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
[55] Weiyu Liu, Tucker Hermans, Sonia Chernova, and Chris Paxton. Structdiffusion: Object-centric diffusion for semantic rearrangement of novel objects. arXiv preprint arXiv:2211.04604, 2022.
[56] Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang, Yuanyuan Yang, Jiaming Tian, Hao He, Antong Li, Mengshen He, Zhengliang Liu, et al. Summary of chatgpt-related research and perspective towards the future of large language models. Meta-Radiology, page 100017, 2023.
[57] Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu, Juan Aparicio Ojea, and Ken Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312, 2017.
[58] Jiaxu Miao, Zongxin Yang, Leilei Fan, and Yi Yang. Fedseg: Class-heterogeneous federated learning for semantic segmentation. In CVPR, 2023.
[59] Douglas Morrison, Peter Corke, and Jürgen Leitner. Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach. arXiv preprint arXiv:1804.05172, 2018.
[60] Douglas Morrison, Peter Corke, and Jürgen Leitner. Egad! an evolved grasping analysis dataset for diversity and reproducibility in robotic manipulation. RA-L, 2020.
[61] Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-dof graspnet: Variational grasp generation for object manipulation. In ICCV, 2019.
[62] Yao Mu, Shunyu Yao, Mingyu Ding, Ping Luo, and Chuang Gan. Ec2: Emergent communication for embodied control. In CVPR, 2023.
[63] Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021, 2023.
[64] Rhys Newbury, Morris Gu, Lachlan Chumbley, Arsalan Mousavian, Clemens Eppner, Jürgen Leitner, Jeannette Bohg, Antonio Morales, Tamim Asfour, Danica Kragic, et al. Deep learning approaches to grasp synthesis: A review. T-RO, 2023.
[65] Anh Nguyen, Dimitrios Kanoulas, Darwin G Caldwell, and Nikos G Tsagarakis. Preparatory object reorientation for task-oriented grasping. In IROS, 2016.
[66] Toan Nguyen, Minh Nhat Vu, An Vuong, Dzung Nguyen, Thieu Vo, Ngan Le, and Anh Nguyen. Open-vocabulary affordance detection in 3d point clouds. In IROS, 2023.
[67] OpenAI. Introducing ChatGPT. Software. Accessed: July 6th 2023.
[68] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In ICRA, 2016.
[69] Robert Platt. Grasp learning: Models, methods, and performance. Annual Review of Control, Robotics, and Autonomous Systems, 2023.
[70] Farhad Pourpanah, Moloud Abdar, Yuxuan Luo, Xinlei Zhou, Ran Wang, Chee Peng Lim, Xi-Zhao Wang, and QM Jonathan Wu. A review of generalized zero-shot learning methods. TPAMI, 2022.
[71] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
[72] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
[73] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In CVP4, 2021.
[74] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
[75] Kallol Saha, Vishal Mandadi, Jayaram Reddy, Ajit Srikanth, Aditya Agarwal, Bipasha Sen, Arun Singh, and Madhava Krishna. Edmp: Ensemble-of-costs-guided diffusion for motion planning. arXiv preprint arXiv:2309.11414, 2023.
[76] Schuhmann et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
[77] Dhruv Shah, Błażej Osiński, brian ichter, and Sergey Levine. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Karen Liu, Dana Kulic, and Jeff Ichnowski, editors, Proceedings of The 6th Conference on Robot Learning, volume 205 of PMLR. PMLR, 2023.
[78] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In CoRL, 2022.
[79] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
[80] Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In ICCV, 2023.
[81] Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping Luo, Saining Xie, and Zhicheng Yan. Going denser with open-vocabulary part segmentation. arXiv preprint arXiv:2305.11173, 2023.
[82] Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffusion model. In ICLR, 2022.
[83] Jonathan Tseng, Rodrigo Castellon, and Karen Liu. Edge: Editable dance generation from music. In CVPR, 2023.
[84] Julen Urain, Niklas Funk, Jan Peters, and Georgia Chalvatzaki. Se (3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion. In ICRA, 2023.
[85] Sai Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2023.
[86] Minh Nhat Vu, Florian Beck, Michael Schwegel, Christian Hartl-Nesic, Anh Nguyen, and Andreas Kugi. Machine learning-based framework for optimally solving the analytical inverse kinematics for redundant manipulators. Mechatronics, 2023.
[87] An Vuong, Minh Nhat Vu, Hieu Le, Baoru Huang, Binh Huynh, Thieu Vo, Andreas Kugi, and Anh Nguyen. Grasp-anything: Large-scale grasp dataset from foundation models. In ICRA, 2024.
[88] An Vuong, Minh Nhat Vu, Toan Tien Nguyen, Baoru Huang, Dzung Nguyen, Thieu Vo, and Anh Nguyen. Language-driven scene synthesis using multi-conditional diffusion model. In NeurIPS, 2023.
[89] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022.
[90] Sabine Wehnert, Viju Sudhi, Shipra Dureja, Libin Kutty, Saijal Shahania, and Ernesto W De Luca. Legal norm retrieval with variations of the bert model combined with tf-idf vectorization. In Proceedings of the eighteenth international conference on artificial intelligence and law, pages 285–294, 2021.
[91] Bowen Wen, Wenzhao Lian, Kostas Bekris, and Stefan Schaal. Catgrasp: Learning category-level task-relevant grasping in clutter from simulation. In ICRA, 2022.
[92] Julia Wolleb, Robin Sandkühler, Florentin Bieder, Philippe Valmaggia, and Philippe C Cattin. Diffusion models for implicit image segmentation ensembles. In MIDL, 2022.
[93] Feng Wu, Guoshuai Zhao, Xueming Qian, and Li-wei Lehman. A diffusion model with contrastive learning for icu false arrhythmia alarm reduction. In IJCAI, 2023.
[94] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.
[95] Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, and Zhangyang Wang. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360deg views. In CVPR, 2023.
[96] Kechun Xu, Shuqi Zhao, Zhongxiang Zhou, Zizhang Li, Huaijin Pi, Yifeng Zhu, Yue Wang, and Rong Xiong. A joint modeling of vision-language-action for target-oriented grasping in clutter. arXiv preprint arXiv:2302.12610, 2023.
[97] Yiteng Xu, Peishan Cong, Yichen Yao, Runnan Chen, Yuenan Hou, Xinge Zhu, Xuming He, Jingyi Yu, and Yuexin Ma. Human-centric scene understanding for 3d large-scale scenarios. In ICCV, 2023.
[98] Zhixuan Xu, Kechun Xu, Rong Xiong, and Yue Wang. Object-centric inference for language conditioned placement: A foundation model based approach. In ICARM, 2023.
[99] Xinchen Yan, Jasmined Hsu, Mohammad Khansari, Yunfei Bai, Arkanath Pathak, Abhinav Gupta, James Davidson, and Honglak Lee. Learning 6-dof grasping interaction via deep geometry-aware 3d representations. In ICRA, 2018.
[100] Jiange Yang, Wenhui Tan, Chuhao Jin, Bei Liu, Jianlong Fu, Ruihua Song, and Limin Wang. Pave the way to grasp anything: Transferring foundation models for universal pick-place robots. arXiv preprint arXiv:2306.05716, 2023.
[101] Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 2022.
[102] Serin Yang, Hyunmin Hwang, and Jong Chul Ye. Zero-shot contrastive loss for text-guided diffusion image style transfer. ICCV, 2023.
[103] Yang Yang, Xibai Lou, and Changhyun Choi. Interactive robotic grasping with attribute-guided disambiguation. In ICRA, 2022.
[104] Hanbo Zhang, Xuguang Lan, Site Bai, Xinwen Zhou, Zhiqiang Tian, and Nanning Zheng. Roi-based robotic grasp detection for object overlapping scenes. In IROS, 2019.
[105] Hanbo Zhang, Yunfan Lu, Cunjun Yu, David Hsu, Xuguang La, and Nanning Zheng. Invigorate: Interactive visual grounding and grasping in clutter. arXiv preprint arXiv:2108.11092, 2021.
[106] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. In ICLR, 2019.
[107] Kaizhi Zheng, Xiaotong Chen, Odest Chadwicke Jenkins, and Xin Wang. Vlmbench: A compositional benchmark for vision-and-language manipulation. NeurIPS, 35, 2022.
[108] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In CVPR, 2022.
[109] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In CVPR, 2022.

Appendix A Theoretical Findings

In this section, we first show the derivation of Equation 4 in our main paper. We then show the proof of Proposition 1 in the main paper.

A.1 Derivation of Equation 4

It was indicated in [20] that $q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\hat{q}(\mathbf{x}_{t}|\mathbf{x}_{t-1},y)$ , therefore, the loss in Equation 3 in our main paper can be written as

\mathcal{L}=\mathbb{E}\left[-\log p_{\theta}(\mathbf{x}_{T})-\sum_{t\geq 1}% \log\frac{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{\hat{q}(\mathbf{x}_{t}|% \mathbf{x}_{t-1},y)}\right]\text{~{}\@.}

(6)

Using Bayes’ Theorem, we can further derive the term $\hat{q}(\mathbf{x}_{t}|\mathbf{x}_{t-1},y)$ of Equation 6 as follows

	$\displaystyle\hat{q}$	$\displaystyle(\mathbf{x}_{t}\|\mathbf{x}_{t-1},y)=\frac{\hat{q}(\mathbf{x}_{t},% \mathbf{x}_{t-1},y)}{\hat{q}(\mathbf{x}_{t-1},y)}$
		$\displaystyle=\frac{\hat{q}(\mathbf{x}_{t},\mathbf{x}_{t-1},y)}{\hat{q}(% \mathbf{x}_{t},\mathbf{x}_{t-1},\mathbf{x}_{0},y)}\frac{\hat{q}(\mathbf{x}_{t}% ,\mathbf{x}_{t-1},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{x}_{t-1},y)}$
		$\displaystyle=\frac{1}{\hat{q}(\mathbf{x}_{0}\|\mathbf{x}_{t-1},\mathbf{x}_{t},% y)}\frac{\hat{q}(\mathbf{x}_{t},\mathbf{x}_{t-1},\mathbf{x}_{0},y)}{\hat{q}(% \mathbf{x}_{t},\mathbf{x}_{0},y)}\frac{\hat{q}(\mathbf{x}_{t},\mathbf{x}_{0},y% )}{\hat{q}(\mathbf{x}_{t-1},y)}$
		$\displaystyle=\frac{\hat{q}(\mathbf{x}_{t-1}\|\mathbf{x}_{t},\mathbf{x}_{0},y)}% {\hat{q}(\mathbf{x}_{0}\|\mathbf{x}_{t-1},\mathbf{x}_{t},y)}\frac{\hat{q}(% \mathbf{x}_{t},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{x}_{t-1},y)}$
		$\displaystyle=\frac{\hat{q}(\mathbf{x}_{t-1}\|\mathbf{x}_{t},\mathbf{x}_{0},y)}% {\hat{q}(\mathbf{x}_{0}\|\mathbf{x}_{t-1},\mathbf{x}_{t},y)}\frac{\hat{q}(% \mathbf{x}_{t},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{x}_{t-1},\mathbf{x}_{0},y)}% \frac{\hat{q}(\mathbf{x}_{t-1},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{x}_{t-1},y)}$
		$\displaystyle=\frac{\hat{q}(\mathbf{x}_{t-1}\|\mathbf{x}_{t},\mathbf{x}_{0},y)}% {\hat{q}(\mathbf{x}_{0}\|\mathbf{x}_{t-1},\mathbf{x}_{t},y)}\frac{\hat{q}(% \mathbf{x}_{t}\|\mathbf{x}_{0},y)}{\hat{q}(\mathbf{x}_{t-1}\|\mathbf{x}_{0},y)}% \hat{q}(\mathbf{x}_{0}\|\mathbf{x}_{t-1},y)\text{~{}\@.}$

Follow by Ho et al. [33], we can assume that $\hat{q}(\mathbf{x}_{0}|\mathbf{x}_{t-1},\mathbf{x}_{t},y)=\hat{q}(\mathbf{x}_{% 0}|\mathbf{x}_{t-1},y)$ due to the Markov chain. Thus, $\hat{q}(\mathbf{x}_{t}|\mathbf{x}_{t-1},y)$ can be further derived as follows

\hat{q}(\mathbf{x}_{t}|\mathbf{x}_{t-1},y)=\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}% _{t},\mathbf{x}_{0},y)\frac{\hat{q}(\mathbf{x}_{t}|\mathbf{x}_{0},y)}{\hat{q}(% \mathbf{x}_{t-1}|\mathbf{x}_{0},y)}\text{~{}\@.}

(7)

From Equation 6 and Equation 7, we can express the negative log likelihood loss as follows

\begin{gathered}L=\mathbb{E}\bigg{[}-\log\frac{p_{\theta}(\mathbf{x}_{T})}{% \hat{q}(\mathbf{x}_{T}|\mathbf{x}_{0},y)}-\sum_{t>1}\log\frac{p_{\theta}(% \mathbf{x}_{t-1}|\mathbf{x}_{t})}{\hat{q}(\mathbf{x}_{t-1}|\mathbf{x}_{t},% \mathbf{x}_{0},y)}\\ -\log p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1},y)\bigg{]}\text{~{}\@.}\end{gathered}

(8)

By using Bayes’ Theorem again, we can formulate $\hat{q}(\mathbf{x}_{T}|\mathbf{x}_{0},y)$ of Equation 8 as follows

$\displaystyle\hat{q}(\mathbf{x}_{T}\|\mathbf{x}_{0},y)$	$\displaystyle=\frac{\hat{q}(\mathbf{x}_{T},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{% x}_{0},y)}$
	$\displaystyle=\frac{\hat{q}(\mathbf{x}_{T},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{% x}_{0})}\frac{\hat{q}(\mathbf{x}_{0})}{\hat{q}(\mathbf{x}_{0},y)}$
	$\displaystyle=\frac{\hat{q}(\mathbf{x}_{0}\|\mathbf{x}_{T},y)}{\hat{q}(y\|% \mathbf{x}_{0})}\text{~{}\@.}$	(9)

Since $\hat{q}(y|\mathbf{x}_{0})$ is known labels per sample [20], thus, can be treated as a constant $C$ . We conclude with the final derivation of Equation 8 by

$\displaystyle L$	$\displaystyle=\mathbb{E}\bigg{[}-\log\frac{p_{\theta}(\mathbf{x}_{T})}{\hat{q}% (\mathbf{x}_{0}\|\mathbf{x}_{T},y)}+\log\hat{q}(y\|\mathbf{x}_{0})$
	$\displaystyle\qquad-\sum_{t>1}\log\frac{p_{\theta}(\mathbf{x}_{t-1}\|\mathbf{x}% _{t})}{\hat{q}(\mathbf{x}_{t-1}\|\mathbf{x}_{t},\mathbf{x}_{0},y)}-\log p_{% \theta}(\mathbf{x}_{0}\|\mathbf{x}_{1},y)\bigg{]}$
	$\displaystyle=\mathbb{E}\bigg{[}C-\log p_{\theta}(\mathbf{x}_{T})+\log\hat{q}(% \mathbf{x}_{0}\|\mathbf{x}_{T},y)+$
	$\displaystyle\qquad\sum_{t>1}D_{\text{KL}}(\hat{q}(\mathbf{x}_{t-1}\|\mathbf{x}% _{t},\mathbf{x}_{0},y)\\|p_{\theta}(\mathbf{x}_{t-1}\|\mathbf{x}_{t}))$
	$\displaystyle\qquad\qquad-\log p_{\theta}(\mathbf{x}_{0}\|\mathbf{x}_{1},y)% \bigg{]}\text{~{}\@.}$	(10)

A.2 Proof of Proposition 1

Proof.

The correlation between $\mathbf{x}_{0}$ and $\mathbf{x}_{T}$ is given by [33]

\displaystyle\mathbf{x}_{T}=\sqrt{\overline{\alpha}_{T}}\mathbf{x}_{0}+\sqrt{1% -\overline{\alpha}_{T}}\mathbf{\epsilon},\mathbf{\epsilon}\sim\mathcal{N}(0,% \mathbf{I})\text{~{}\@.}

(11)

It follows from Equation 11 that

	$\displaystyle\frac{\sqrt{\overline{\alpha_{T}}}\tilde{\mathbf{x}}_{0}-\mathbf{% x}_{T}}{\sqrt{1-\overline{\alpha_{T}}}}$	$\displaystyle=\frac{\sqrt{\overline{\alpha_{T}}}\tilde{\mathbf{x}}_{0}-\left(% \sqrt{\overline{\alpha}_{T}}\mathbf{x}_{0}+\sqrt{1-\overline{\alpha}_{T}}% \mathbf{\epsilon}\right)}{\sqrt{1-\overline{\alpha_{T}}}}$
		$\displaystyle=\beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0})-\epsilon\text{~{},}$

with $\beta=\sqrt{\frac{\overline{\alpha}_{T}}{1-\overline{\alpha}_{T}}}$ . Thus, under the condition $\|\frac{\sqrt{\overline{\alpha_{T}}}\tilde{\mathbf{x}}_{0}-\mathbf{x}_{T}}{% \sqrt{1-\overline{\alpha_{T}}}}\|^{2}_{2}\geq M$ , we have

	$\displaystyle\mathbb{E}[\mathcal{L}_{\text{contrastive}}]$	$\displaystyle=\mathbb{E}\left[\left\\|\frac{\sqrt{\overline{\alpha_{T}}}\tilde{% \mathbf{x}}_{0}-\mathbf{x}_{T}}{\sqrt{1-\overline{\alpha_{T}}}}\right\\|_{2}^{2% }-M\right]$
		$\displaystyle=\mathbb{E}\left[\\|\beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0})-% \epsilon\\|_{2}^{2}-\\|\epsilon\\|_{2}^{2}\right]$
		$\displaystyle=\mathbb{E}\left[\\|\beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0})\\|% _{2}^{2}-2\left<\beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}),\epsilon\right>\right]$
		$\displaystyle=\mathbb{E}\left[\\|\beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0})\\|% _{2}^{2}\right]-2\beta\mathbb{E}\left[\left<(\tilde{\mathbf{x}}_{0}-\mathbf{x}% _{0}),\epsilon\right>\right]\text{~{}\@.}$

However, since $\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}$ and $\epsilon$ are independent and $\mathbb{E}[\epsilon]=0$ , we have

\mathbb{E}\left[\left<(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}),\epsilon\right>% \right]=\left<\mathbb{E}\left[\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}\right],% \mathbb{E}\left[\epsilon\right]\right>=0\text{~{}\@.}

Thus,

\displaystyle\mathbb{E}[\mathcal{L}_{\text{contrastive}}]=\mathbb{E}\left[\|% \beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0})\|_{2}^{2}\right]=\beta^{2}\mathbb% {E}\left[\|\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}\|_{2}^{2}\right]\text{~{}\@.}

Hence, with $C=\beta^{-2}$ , we have $\mathbb{E}\left[\|\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}\|_{2}^{2}\right]\leq C\delta$ as desired. ∎

Appendix B Remark on Related Works

Grasp Datasets. Numerous grasp datasets have been introduced recently [64], each with varying characteristics such as data representation (RGB-D or 3D point clouds), grasp labels (rectangle-based or 6-DoF), and quantity [70]. Our Grasp-Anything++ dataset differs primarily in its universality, contrasting the limited object selection in existing benchmarks. It covers a wide range of everyday objects and includes natural scene descriptions, facilitating research in language-driven grasp detection. Furthermore, the Grasp-Anything++ dataset uniquely presents natural object arrangements, in contrast to the more strictly controlled configurations in previous datasets [69]. Grasp-Anything++ outperforms other benchmarks in both the number of objects and the number of samples.

Contrastive Loss for Diffusion Models. Recent advancements in contrastive learning have become a prominent attraction in diffusion model research, as evidenced in [31]. While most studies in the diffusion literature regard contrastive learning primarily as a method of data augmentation [35] for improving the performance of models on fine-grained prediction [15], several notable works, including [93, 31, 45, 102]. For instance, Yang et al. [102] leverage intermediate layer features for calculating contrastive loss in negative sample pairs. Contrary to these perspectives, our paper considers contrastive learning as an integral aspect of the training objective, explicitly contributing to the denoising process of diffusion models.

Grasp Detection. Deep learning has significantly advanced grasp detection, with initial efforts by Lenz et al.[48] employing deep learning for grasp pose detection. Following this, deep learning-based approaches[99, 52, 40, 91, 2, 43, 10] have become the predominant methodology in the field. Despite extensive research, the real-world application of deep learning for robotic grasping remains a challenge, primarily due to the limited size and diversity of grasp datasets [28, 69].

Grasp Ground Truth Definition. 6-DoF and rectangle representations are the most prevalent in grasp detection literature. While 6-DoF poses offer greater flexibility and adaptability [27] for complex tasks, rectangle grasp poses are advantageous for their simplicity [39], efficiency in specific scenarios, and lower hardware and computational requirements [18]. Considering that objects in synthesized images from foundation models often lack 3D information [41], the rectangle representation of grasp poses appears more suitable for our task. The ground truth rectangle is defined in Fig. 10.

Appendix C Grasp-Anything++ Analysis

Additional Visualization. Fig. 11 provides further examples from the Grasp-Anything++ dataset, illustrating its diverse and extensive representation of everyday objects. The additional samples of Grasp-Anything++ showcase a diverse collection of objects typically found in everyday environments, such as home offices, kitchens, and living spaces. The collection includes a variety of shapes, sizes, and types of objects such as writing instruments, electronic devices, and household items, each situated within its own designated space. Furthermore, the annotated grasp poses, produced by the Grasp-Anything++’s pipeline, demonstrate high fidelity, thereby offering a foundation for both qualitative and quantitative grasp detection research.

Appendix D LGD Implementation Details

In this section, we first discuss the observation about the attention mask during the diffusion process which led to our motivation for using the contrastive diffusion. We then provide the implementation details of our LGD network.

D.1 Observation

In Fig. 12, we present a visualization of the attention mask generated by the vision transformer backbone and the grasp pose throughout the diffusion process. It is evident that during the initial time steps, there is a substantial overlap between the guiding region and the grasp pose, which diminishes as time progresses. This finding suggests that, by the end of the forward process, the guiding region and the noisy grasp pose can be regarded as a contrasting pair. We mathematically express this contrastive relationship between the guiding region $\tilde{\mathbf{x}}_{0}$ and $\mathbf{x}_{t}$ . This contrastive relationship forms a central role in our network design, as depicted in the overview of our method in the main paper.

D.2 LGD Implementation Details

For an image I of resolution $W\times H$ , we employ a ResNet-50 vision encoder backbone [32] to derive feature representations $\textit{{I}}^{\prime}\in\mathbb{R}^{w\times h}$ with latent dimensions $w$ and $h$ . Similarly, we obtain text embedding $e^{\prime}\in\mathbb{R}^{|D|}$ using a text encoder, such as CLIP [71] or BERT [19]. The dimensionality of these latent features depends on the text encoder’s architecture.

Component	Description	Input size	Output size
(i)	A vision encoder (ResNet-50 [32])	$[H,W,3]$	$[h,w,3]$
(ii-a)	A text encoder (CLIP [71] or BERT [19])	Any	$[\|D\|]$
(ii-b)	MLP layers	$[\|D\|]$	$[N+1,d_{\text{text}}]$
(iii)	ALBEF [50]	$[h,w,3],[N+1,d_{\text{text}}]$	$[W,H],[\ell]$
(iv)	MLP layers of Equation 12	$[1]$	$[d_{\text{ts}}]$
(v)	MLP layers of Equation 13	$[\ell]$	$[d_{\text{vl}}]$
(vi)	MLP layers of Equation 14	$[M]$	$[d_{\text{df}}]$
(vii)	MLP layers to output denoising state	$[d_{\text{ts}}+d_{\text{vl}}+d_{\text{df}}]$	$[M]$

Table 7: Architecture specifications of our method.

Hyperparameter	Value
$W$	224
$H$	224
$N$	196
$M$ (number of grasp parameters)	5
$\|D_{\text{CLIP}}\|$ of (ii-a)	512
$\|D_{\text{BERT}}\|$ of (ii-a)	768
$d_{\text{text}}$ of (ii-b)	128
$\ell$ of (iii)	1024
$d_{\text{ts}}$ of (iv)	32
$d_{\text{vl}}$ of (v)	256
$d_{\text{df}}$ of (vi)	256
Num. attention layers (ALBEF)	6

Table 8: Hyperparameter details.

Utilizing the ALBEF architecture [50], we integrate vision and language embeddings. Specifically, we encode each intermediate feature $\textit{{I}}^{\prime}$ into a set $\textit{{v}}=\{\textit{{v}}_{\text{cls}},\textit{{v}}_{1},\textit{{v}}_{2},% \dots,\textit{{v}}_{N}\}$ , where $N$ denotes the number of segmented patches, similar to the approach in [23]. We feed text embeddings $e^{\prime}$ through MLP layers, producing a sequence of embeddings $\textit{{u}}=\{\textit{{u}}_{\text{cls}},\textit{{u}}_{1},\ldots,\textit{{u}}_% {N}\}$ . Cross-attention mechanisms in the multimodal encoder integrate image features v with text features u. The multimodal attention layer outputs an attention map $\tilde{\mathbf{x}}_{0}\in\mathbb{R}^{W\times H}$ and a final text-image representation $z_{\text{vl}}^{\ast}\in\mathbb{R}^{\ell}$ .

Using a sequence of MLP layers, we integrate features from timestep $t+1$ , the current state $\mathbf{x}_{t+1}$ , and the final text-image representation $z\in\mathbb{R}^{\ell}$ as follows

z_{\text{ts}}=\text{MLP}(t+1)\in\mathbb{R}^{d_{\text{ts}}}\text{~{}\@.}

(12)

z_{\text{vl}}=\text{MLP}(z_{\text{vl}}^{\ast})\in\mathbb{R}^{d_{\text{vl}}}% \text{~{}\@.}

(13)

z_{\text{df}}=\text{MLP}(\mathbf{x}_{t+1})\in\mathbb{R}^{d_{\text{df}}}\text{~% {}\@.}

(14)

Finally, we concatenate $z_{\text{ts}},z_{\text{vl}},z_{\text{df}}$ to form $z$ , which is then processed through an additional MLP to yield the decoded state for the denoising process.

\mathbf{x}_{t}=\text{MLP}(z)\in\mathbb{R}^{d_{\text{df}}}\text{~{}\@.}

(15)

Architecture Summarization. As outlined in the main paper, our network architecture contains: (i) a vision encoder, (ii) a text encoder followed by MLP layers, (iii) ALBEF module, (iv) MLP layers to encode timestep information, (v) MLP layers to encode text-image features $z_{\text{vl}}^{\ast}$ , (vi) MLP layers to encode noisy state $\mathbf{x}_{t}$ , and (vii) MLP layers to output denoising state. We summarize the architecture and hyperparameters of LGD in Table 7 and Table 8.

Appendix E Experimental Setups

We present implementation details of other baselines in the language-driven grasp detection task.

E.1 Baseline Setups

Linguistic versions of GR-ConvNet [43], Det-Seg-Refine [2], GG-CNN [59]. We make slight modifications to these baselines by adding a component to fuse image and text features from the input. Specifically, we utilize the CLIP text encoder [71] to extract text embeddings $e^{\prime}$ . To ensure a fair comparison between methods, we also utilize ALBEF architecture [50] to do the fusion between the text embedding and the visual features. The remaining training loss and parameter are inherited from the original work.

CLIPORT [78]. The original CLIPORT architecture learns a policy $\pi$ , which does not directly solve our task. We modify the CLIPORT architecture’s final layers with appropriately sized MLPs to output grasp poses defined by five parameters $(x,y,w,h,\theta)$ . This adaptation ensures consistency with our grasp detection baselines, diverging from CLIPORT’s original policy $\pi$ learning framework.

CLIP-Fusion [96]. In our re-implementation of the architecture from [96], we follow the cross-attention module in CLIP-Fusion with constructed MLP layers. The final MLP layers in the architecture is modified to output five parameters, corresponding to predicted grasp poses.

E.2 Robotic Setup

In Figure 13, we present the robotic evaluation conducted on a KUKA robot. Our grasp detection leverages our proposed LGD and other methods listed in the real robot experiments (Table 4 of the main paper), and the results are translated into a 6DOF grasp pose through depth images captured by an Intel RealSense D435i depth camera, as in [43]. The trajectory planner [86, 5] is employed for the execution of the grasp. We use two computers for the experiment. The first computer (PC1) runs the real-time control software Beckhoff TwinCAT, the Intel RealSense D435i camera, and the Robotiq 2F-85 gripper, while the second computer (PC2) runs ROS on Ubuntu Noetic 20.04. PC1 communicates with the robot via a network interface card (NIC) using the EtherCAT protocol. The inference process is performed on PC2 with an NVIDIA 3080 GPU. Our assessment encompasses both single-object and cluttered scenarios, involving a diverse set of $20$ real-world daily objects (Fig. 17). To ensure robustness and reliability, we repeat each experiment for all methods a total of $30$ times.

Appendix F Extra Experiments

Number of Parameter Comparison. Table 9 shows the number of parameters in all methods. This table illustrates that the results from the language-driven grasp detection studies in the main paper reveal a consistent trade-off between performance and the number of parameters across all baselines. Notably, LGD emerges as the balanced baseline, offering a good balance of performance efficiency and computational resource utilization.

Baseline	$\#$ Parameters	Success rate
GR-ConvNet [43] + CLIP [71]	2.07M	0.24
Det-Seg-Refine [2] + CLIP [71]	1.82M	0.20
GG-CNN [59] + CLIP [71]	1.24M	0.10
CLIPORT [78]	10.65M	0.29
CLIP-Fusion [96]	13.51M	0.33
LGD (ours)	5.18M	0.45

Table 9: Number of parameter comparison.

Failure Cases. Though achieving satisfactory results, our method still predicts incorrect grasp poses. A large number of objects and grasping prompts in our dataset suggest a significant challenge for the tasks. Some failure cases are depicted in Fig. 15. From this figure, we can see that the correlation between the text and the attention map of the visual features is not well-aligned, which leads to incorrect prediction of the grasp poses.

Additional Detection Visualization. Fig. 14 illustrates additional language-driven grasp pose detection using the LGD method. The result demonstrates our method’s capability in reasonably aligning grasp poses with linguistic instructions in fine-grained scenarios, as seen with a green and blue bottle. Remarkably, these qualitative examples demonstrate the effectiveness of our proposed LGD method in fine-grained cases, in line with the contrastive loss objectives outlined in our main paper.

Robotic Demonstration. In Fig. 18, we show a sequence of actions when the KUKA robot grasps different objects in cluttered scenes. Fig. 16 further shows the detection result of our LGD method on an image captured by a RealSense camera mounted on the robot. The robotic experiments demonstrate that although our LGD method is trained on a synthesis Grasp-Anything++ dataset, it still be able to generalize to detect grasp pose in real-world images. More illustrations can be found in our Demonstration Video.

Step	Description
Scene Generation	User	Please help me generate scene descriptions for natural arrangements of daily objects. Each description has the following form: <Object_1><Object_2>…<Verb><Container_Object>. Please also ensure the incorporation of a rich and varied lexicon in the scene descriptions.
	Sample	A steel knife, a polished fork and a pristine ceramic plate on a wooden table.
	Text-to- Image	We use Stable Diffusion [74] to proceed text-to-image generation.
Object Masking	User	For object part-level description, given an input list {<Object_1>, <Object_2>, …}, the output will be a list that describes the parts of objects as: {<Object_1>: [<Part_1.1>, <Part_1.2>, …], <Object_2>: [<Part_2.1>, <Part_2.2>, …]}.
	Sample	{knife: [handle, blade], fork: [handle, neck, stem, tines], plate: [rim, base]}
	Post Process	We use OFA [89] and SAM [42] to locate the region describing the objects.
Part Masking	User	Given the object list and part lists of each scene description, you will generate for me all prompts with the following format: {<Manipulation_Action><Object_ID><Part_ID>}. The part that is more suitable for human grasping is positioned at the start of the list to represent the grasping actions.
	Sample	Give me the steel knife; Grasp the knife at its handle.
	Post Process	We leverage VLPart [81] to locate the region describing the parts of objects.
Grasp Generation	User	Generate for me a scene description with grasp instructions following the templates.
	Sample	Scene description: A steel knife, a polished fork and a pristine ceramic plate on a wooden table. Object list: {knife, fork, plate}. Part lists: {knife: [handle, blade], fork: [handle, neck, stem, tines], plate: [rim, base]}. Prompts: Give me the steel knife; Grasp the knife at its handle.
	Grasp Labelling	We utilize a pretrained RAGT-3/3 [10] to generate grasp poses corresponding to the located region.

	$\displaystyle\hat{q}$	$\displaystyle(\mathbf{x}_{t}\|\mathbf{x}_{t-1},y)=\frac{\hat{q}(\mathbf{x}_{t},% \mathbf{x}_{t-1},y)}{\hat{q}(\mathbf{x}_{t-1},y)}$
		$\displaystyle=\frac{\hat{q}(\mathbf{x}_{t},\mathbf{x}_{t-1},y)}{\hat{q}(% \mathbf{x}_{t},\mathbf{x}_{t-1},\mathbf{x}_{0},y)}\frac{\hat{q}(\mathbf{x}_{t}% ,\mathbf{x}_{t-1},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{x}_{t-1},y)}$
		$\displaystyle=\frac{1}{\hat{q}(\mathbf{x}_{0}\|\mathbf{x}_{t-1},\mathbf{x}_{t},% y)}\frac{\hat{q}(\mathbf{x}_{t},\mathbf{x}_{t-1},\mathbf{x}_{0},y)}{\hat{q}(% \mathbf{x}_{t},\mathbf{x}_{0},y)}\frac{\hat{q}(\mathbf{x}_{t},\mathbf{x}_{0},y% )}{\hat{q}(\mathbf{x}_{t-1},y)}$
		$\displaystyle=\frac{\hat{q}(\mathbf{x}_{t-1}\|\mathbf{x}_{t},\mathbf{x}_{0},y)}% {\hat{q}(\mathbf{x}_{0}\|\mathbf{x}_{t-1},\mathbf{x}_{t},y)}\frac{\hat{q}(% \mathbf{x}_{t},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{x}_{t-1},y)}$
		$\displaystyle=\frac{\hat{q}(\mathbf{x}_{t-1}\|\mathbf{x}_{t},\mathbf{x}_{0},y)}% {\hat{q}(\mathbf{x}_{0}\|\mathbf{x}_{t-1},\mathbf{x}_{t},y)}\frac{\hat{q}(% \mathbf{x}_{t},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{x}_{t-1},\mathbf{x}_{0},y)}% \frac{\hat{q}(\mathbf{x}_{t-1},\mathbf{x}_{0},y)}{\hat{q}(\mathbf{x}_{t-1},y)}$
		$\displaystyle=\frac{\hat{q}(\mathbf{x}_{t-1}\|\mathbf{x}_{t},\mathbf{x}_{0},y)}% {\hat{q}(\mathbf{x}_{0}\|\mathbf{x}_{t-1},\mathbf{x}_{t},y)}\frac{\hat{q}(% \mathbf{x}_{t}\|\mathbf{x}_{0},y)}{\hat{q}(\mathbf{x}_{t-1}\|\mathbf{x}_{0},y)}% \hat{q}(\mathbf{x}_{0}\|\mathbf{x}_{t-1},y)\text{~{}\@.}$

$\displaystyle L$	$\displaystyle=\mathbb{E}\bigg{[}-\log\frac{p_{\theta}(\mathbf{x}_{T})}{\hat{q}% (\mathbf{x}_{0}\|\mathbf{x}_{T},y)}+\log\hat{q}(y\|\mathbf{x}_{0})$
	$\displaystyle\qquad-\sum_{t>1}\log\frac{p_{\theta}(\mathbf{x}_{t-1}\|\mathbf{x}% _{t})}{\hat{q}(\mathbf{x}_{t-1}\|\mathbf{x}_{t},\mathbf{x}_{0},y)}-\log p_{% \theta}(\mathbf{x}_{0}\|\mathbf{x}_{1},y)\bigg{]}$
	$\displaystyle=\mathbb{E}\bigg{[}C-\log p_{\theta}(\mathbf{x}_{T})+\log\hat{q}(% \mathbf{x}_{0}\|\mathbf{x}_{T},y)+$
	$\displaystyle\qquad\sum_{t>1}D_{\text{KL}}(\hat{q}(\mathbf{x}_{t-1}\|\mathbf{x}% _{t},\mathbf{x}_{0},y)\\|p_{\theta}(\mathbf{x}_{t-1}\|\mathbf{x}_{t}))$
	$\displaystyle\qquad\qquad-\log p_{\theta}(\mathbf{x}_{0}\|\mathbf{x}_{1},y)% \bigg{]}\text{~{}\@.}$	(10)

	$\displaystyle\mathbb{E}[\mathcal{L}_{\text{contrastive}}]$	$\displaystyle=\mathbb{E}\left[\left\\|\frac{\sqrt{\overline{\alpha_{T}}}\tilde{% \mathbf{x}}_{0}-\mathbf{x}_{T}}{\sqrt{1-\overline{\alpha_{T}}}}\right\\|_{2}^{2% }-M\right]$
		$\displaystyle=\mathbb{E}\left[\\|\beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0})-% \epsilon\\|_{2}^{2}-\\|\epsilon\\|_{2}^{2}\right]$
		$\displaystyle=\mathbb{E}\left[\\|\beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0})\\|% _{2}^{2}-2\left<\beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0}),\epsilon\right>\right]$
		$\displaystyle=\mathbb{E}\left[\\|\beta(\tilde{\mathbf{x}}_{0}-\mathbf{x}_{0})\\|% _{2}^{2}\right]-2\beta\mathbb{E}\left[\left<(\tilde{\mathbf{x}}_{0}-\mathbf{x}% _{0}),\epsilon\right>\right]\text{~{}\@.}$