Advances in text-guided 3D editing: a survey

Lu, Lihua; Li, Ruyang; Zhang, Xiaohui; Wei, Hui; Du, Guoguang; Wang, Binqiang

doi:10.1007/s10462-024-10937-6

Advances in text-guided 3D editing: a survey

Open access
Published: 12 October 2024

Volume 57, article number 321, (2024)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

Advances in text-guided 3D editing: a survey

Download PDF

Lihua Lu¹,
Ruyang Li¹,
Xiaohui Zhang¹,
Hui Wei¹,
Guoguang Du¹ &
…
Binqiang Wang¹

1758 Accesses
Explore all metrics

Abstract

In 3D Artificial Intelligence Generated Content (AIGC), compared with generating 3D assets from scratch, editing extant 3D assets satisfies user prompts, allowing the creation of diverse and high-quality 3D assets in a time and labor-saving manner. More recently, text-guided 3D editing that modifies 3D assets guided by text prompts is user-friendly and practical, which evokes a surge in research within this field. In this survey, we comprehensively investigate recent literature on text-guided 3D editing in an attempt to answer two questions: What are the methodologies of existing text-guided 3D editing? How has current progress in text-guided 3D editing gone so far? Specifically, we focus on text-guided 3D editing methods published in the past 4 years, delving deeply into their frameworks and principles. We then present a fundamental taxonomy in terms of the editing strategy, optimization scheme, and 3D representation. Based on the taxonomy, we review recent advances in this field, considering factors such as editing scale, type, granularity, and perspective. In addition, we highlight four applications of text-guided 3D editing, including texturing, style transfer, local editing of scenes, and insertion editing, to exploit further the 3D editing capacities with in-depth comparisons and discussions. Depending on the insights achieved by this survey, we discuss open challenges and future research directions. We hope this survey will help readers gain a deeper understanding of this exciting field and foster further advancements in text-guided 3D editing.

DreamScene: 3D Gaussian-Based Text-to-3D Scene Generation via Formation Pattern Sampling

UniCanvas: Affordance-Aware Unified Real Image Editing via Customized Text-to-Image Generation

Article 14 January 2025

Diverse Text-to-3D Synthesis with Augmented Text Embedding

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The 3D content generation has traditionally been labor-intensive and time-consuming, primarily improved by 3D AIGC techniques in recent years. More recently, 3D asset editing has been deemed promising research in 3D AIGC, empowering potential applications such as virtual reality, augmented reality, and autonomous driving (Hoffman et al. 2023; Taniguchi 2019; Cui et al. 2024). Different from 3D generation that creates new 3D assets from scratch, 3D editing involves altering the appearance and geometry of 3D assets according to user prompts (e.g., texts, sketches, reference images), encompassing a range of editing operations from modification to addition and deletion. More recently, text-guided 3D editing that manipulates 3D assets adhering to text prompts is more friendly and applicable, which democratizes the 3D editing process and interests a large number of researchers. However, text-guided 3D editing is challenging, as it struggles to modify a 3D asset faithful to the semantics of the given text prompt. In addition, 3D editing starts from a given 3D asset where references to extant components have to be resolved respecting the text prompt, while unreferenced parts should be kept untouched as much as possible. Other challenges, like the deficiency of paired text-3D datasets, multi-view inconsistency, and entangled geometry and appearance, make the situation even worse.

Recently, the advent of large language-visual models, like Contrastive Language-Image Pre-training (CLIP) (Radford et al. 2021) and diffusion models (Ho et al. 2020; Zhang et al. 2023), has promoted attempts at text-guided editing systems. Since these visual-language models are trained with large pairs of text-image datasets, primary efforts have been made in text-guided image generation and editing (Betker et al. 2023; Oppenlaender 2022; Li et al. 2023b; Ruiz et al. 2023). Subsequently, these text-guided 2D editing techniques [e.g., InstructPix2Pix (Li et al. 2023b) and DreamBooth (Ruiz et al. 2023)] are introduced into 3D editing (Haque et al. 2023; Kamata et al. 2023), which avoids the collection of paired text-3D data and leads to pivotal advancements in text-guided 3D editing by lifting 2D image editing to 3D. Following this pipeline, subsequent methods further improve the editing quality and efficiency by multi-view consistent editing (Dong and Wang 2024; Mirzaei et al. 2023b; Song et al. 2023a), generalized editing (Khalid et al. 2023; Chen et al. 2024b; Fang et al. 2023), etc.

In an alternative pipeline, inspired by text-guided 3D generation (Poole et al. 2022; Lin et al. 2023; Shi et al. 2023; Jun and Nichol 2023; Sanghi et al. 2022), exemplified by Clip-forge (Sanghi et al. 2022) and DreamFusion (Poole et al. 2022), the straightforward approach to text-guided 3D editing promises to globally modify 3D assets to satisfy the new text prompts through fine-tuning the pre-trained text-guided 3D generation models. Earlier approaches (Wang et al. 2022; Sanghi et al. 2022; Wang et al. 2023b; Michel et al. 2022) realign 3D assets and new text prompts by CLIP supervision to generate the edited meshes or neural models (e.g., neural radiance fields). The pioneer DreamFusion (Poole et al. 2022) presents score distillation sampling (SDS) to distill the 2D text-image prior to 3D neural representation, providing the foundation for promising diffusion-based 3D editing methods (Chen et al. 2023b, c; Sella et al. 2023; Cheng et al. 2023). Beyond global editing, most recent methods (Zhuang et al. 2023; Chen et al. 2024a; Sella et al. 2023; Li et al. 2024a; Dihlmann et al. 2024) support controllable local editing of 3D assets, where editing occurs in regions of interest localized by text-image cross-attention maps, user-defined 3D bounding boxes, etc., ensuring fine-grained 3D editing. These methods achieve impressive editing results from local modifications of objects or scenes (Zhuang et al. 2023; Chen et al. 2024a) to spatial-constrained insertion (Sella et al. 2023; Li et al. 2024a; Dihlmann et al. 2024) or deletion (Wang et al. 2023).

Given these significant advancements, a survey to systematically review and summarize these contributions is essentially needed. Several efforts have been made to survey text-guided 3D generation and editing systems (Li et al. 2023d, 2024b; Foo et al. 2023; Chen et al. 2023d). Some studies (Li et al. 2023d; Wang et al. 2024) conduct comprehensive surveys on text-guided 3D generation and its use in various applications. Foo et al. (Foo et al. 2023) summarized AIGC for various data modalities, such as image, video, text, 3D shape, 3D motion, and audio, of which text-guided 3D generation and editing take up only a small section. A concurrent work (Chen et al. 2023d) offers a survey on 3D neural stylization driven by texts and images, which focuses more on image-guided editing methods and lacks a detailed classification and evaluation of text-guided 3D editing.

Different from these surveys, we conduct a survey to offer a deep-going and comprehensive analysis that is distinctly focused on text-guided 3D editing. We delve deeply into the taxonomy and 3D editing capacities of text-guided 3D editing. As described in Figure 1, our survey mainly contains text-guided 3D editing, its applications, and its related work. To better understand our classification criteria, we begin with a review of the fundamental techniques, containing text-guided image generation and editing, text-guided 3D editing optimization, and 3D representations in Sect. 2. Then, in Sect. 3, we typically review research works published in nearly 4 years and organize them into two primary categories in terms of the editing strategy: 2D-3D lifting and direct 3D editing. Each category is further divided and discussed in detail in Sects. 3.2 and 3.3 by considering critical factors, including the optimization schemes and 3D representations. We further investigate the methodologies of text-guided 3D editing in Sect. 3.4 with a discussion about 3d editing constraints. In addition, we explore various text-guided 3D editing capacities in terms of editing scale, type, granularity, and perspective. Specifically, we pay attention to both geometry and appearance editing, covering four 3D editing applications: texturing, style transfer, local editing of scenes, and insertion editing, with detailed experimental comparisons and discussions in Sect. 4. Finally, we present some current challenges and potential future trends in Sect. 5. The overall structure of our survey is depicted in Fig. 2.

In summary, this is the first review of recent advances within text-guided 3D editing in the past 4 years. It systematically categorizes and deeply assesses the current findings in text-guided 3D editing. Based on the proposed fundamental taxonomy and thorough evaluation, this survey aims to provide guidance for future research directions in this rapidly advancing field.

2 Fundamentals of text-guided 3D editing

In this section, we provide an overview of the technical fundamentals underlying text-guided 3D editing. We first review two image-related tasks: text-guided image generation and editing. Then, we introduce text-guided 3D editing optimization to explain how CLIP loss and diffusion loss work in 3D editing driven by text prompts. Lastly, we summarize 3D representations and categorize them into explicit and implicit for a detailed discussion.

2.1 Text-guided image generation and editing

It’s common to perform text-guided 3D editing by elevating impressive results of text-guided 2D image manipulation to 3D, which can make full use of large pre-trained visual-language models while avoiding collecting paired text-3D datasets. For text-guided image generation and editing, early efforts (Reed et al. 2016; Xu et al. 2018; Karras et al. 2019; Patashnik et al. 2021) combine Generative Adversarial Networks (GANs) and CLIP to perform image manipulation guided by texts. Recently, thanks to advances in pre-trained diffusion models, diffusion-based methods (Nichol et al. 2022; Saharia et al. 2022; Rombach et al. 2022) have enabled the generation of high-resolution and diverse images, provoking their application in 3D editing. Here, we pay special attention to two related areas: text-guided image generation and editing relying on diffusion models.

2.1.1 Text-guided image generation

The emergence of classifier-free guidance (Ho and Salimans 2022) allows for versatile conditions, e.g., text, as guidance. The pioneering GLIDE (Nichol et al. 2022) and Imagen (Saharia et al. 2022) employ diffusion models with text prompts to control the image generation process and produce images in pixel space. Differently, methods, represented by Stable Diffusion (Rombach et al. 2022), and DALLE-2 (Ramesh et al. 2022), introduce diffusion models into a latent space and generate images with high visual fidelity more efficiently and flexibly. Stable Diffusion (Rombach et al. 2022) applies a diffusion model in the low-dimensional space of pre-trained autoencoders, facilitating training on limited computational resources and decreasing inference costs compared to pixel-based diffusion methods. Subsequent methods (Meng et al. 2023; Ge et al. 2023) extend these studies and create more high-quality images.

Although text-guided image generation models have the ability to edit images (e.g., DALLE-2 can inpaint regions), in most cases, they offer no guarantees that unedited regions should be preserved. To be worse, similar text prompts will yield completely different images. For instance, adding the text prompt “cheese” to “cake” often changes the cake’s shape.

2.1.2 Text-guided image editing

Text-guided image editing aims to alter the appearance, structure, or content of an extant image while keeping non-editing regions unchanged. To this end, Prompt-to-prompt (Hertz et al. 2022) proposes to edit images by injecting the cross-attention maps during the diffusion process, controlling which pixels attend to which tokens of the prompt text during which diffusion steps. In this way, Prompt-to-Prompt enables various editing tasks, such as localized editing by replacing a word and style editing by adding a specification. Imagic (Kawar et al. 2023) performs image editing by producing a text embedding that aligns with both the input image and the target text prompt, and subsequently regenerating with the produced text embedding. Facilitated by this editing strategy, Imagic realizes non-rigid changes such as posture changes. More recently, trained on a dataset of text instructions and image descriptions, InstructPix2Pix (Li et al. 2023b) proposes an approach to editing images based on text instructions. It can effectively capture consistent information from the original image while adhering to text instructions.

Some other efforts (Xu et al. 2023; Ruiz et al. 2023) focus on image editing by generating personalized images that maintain a specific object or concept learned from a small collection of images. Textual Inversion (Gal et al. 2022b) learns a unique identifier word to represent a new subject and incorporates this word into the text encoder’s dictionary. Given a few images of the same subject, DreamBooth (Ruiz et al. 2023) finetunes the text-guided image diffusion model Imagen (Saharia et al. 2022) to bind the specific subject with a new and rare identifier (denoted as $*$). Recently, some methods, including GLIGEN (Li et al. 2023c), Make-A-Scene (Gafni et al. 2022), ControlNet (Zhang et al. 2023), and T2I-Adapter (Mou et al. 2024), incorporate additional conditions (e.g., bounding boxes, segmentation maps) to cooperate with text prompts, enabling the image generation process to be more controllable, and thus offering these image generation models the ability to editing images. For example, ControlNet (Zhang et al. 2023) expands pre-trained Stable Diffusion (Rombach et al. 2022) using a new ControlNet structure, applied to each encoder level of the base U-Net, to add different guidance in the denoising process, supporting various task-specific conditions, such as canny edges, human pose, depths, shape normals, and sketches. More recently, some works, like CustomDiffusion (Kumari et al. 2023) and FastComposer (Xiao et al. 2023) explore to generate personalized images containing multiple concepts or objects.

Finally, some advancements have been contributed to text-guided image inpainting (Rombach et al. 2022; Manukyan et al. 2023; Xie et al. 2023), which involves filling in masked regions of an image with new content consistent with the given text. Stable Diffusion (Rombach et al. 2022) can also perform image inpainting with latent diffusion, achieving impressive results. LaMa inpainting (Suvorov et al. 2022) proposes a new inpainting network architecture using fast Fourier convolutions (FFCs), which have an image-wide receptive field, achieving excellent inpainting performance even in challenging scenarios. We direct readers to (Huang et al. 2024a) for a comprehensive discussion on this field.

Discussions on image editing for text-guided 3d editing Lifting text-guided image editing to 3D is not a straightforward extension of existing 2D methodologies due to multi-view editing inconsistency in 3D space. If image editing results across views have large inconsistencies, 3D editing based on image editing will fail to consolidate in 3D. Even worse, 3D editing cannot be recovered from incorrect image editing. In addition, the inherent disabilities in 2D image editing make 3D editing challenging. For instance, InstructPix2Pix (Brooks et al. 2023) is great for replacement editing, style transfer, and texture modification, but it struggles to complete add or delete editing due to its limited spatial reasoning abilities. Some image inpainting methods, like LaMa inpainting (Suvorov et al. 2022) and Imageinpainter (Rombach et al. 2022), are usually effective at delete editing. The performance of depth-conditioned ControlNet (Zhang et al. 2023) depends on the quality of depth maps. In practice, it’s essential to consider two factors: the editing quality and editing capacities (e.g., style transfer, replacing objects, or removing objects) of 2D editing models when applying them to text-guided 3D editing.

Learning from rich experiments and experience, recent work in text-guided 3D editing can successfully distill the strong priors of text-guided image editing methods to 3D, enabling editing 3D assets from modifying the appearance or geometry of 3D contents (Kamata et al. 2023; Haque et al. 2023), to removing objects (Wang et al. 2023) and add new structures (Dihlmann et al. 2024; Shum et al. 2024; Shahbazi et al. 2024). Concretely, to overcome the challenge of inconsistent image editing in 3D, InstructN2N (Haque et al. 2023) proposes a multi-stage dataset updating strategy, which iteratively edits the rendered images of original 3D scenes through InstructPix2Pix (Brooks et al. 2023), and then uses the edited images to reconstruct the target 3D scenes. Facilitated by InstructPix2Pix, InstructN2N is effective at various 3D editing, including editing textures, replacing objects, and changing the global properties of a scene. Subsequent attempts further perfect InstructN2N in terms of better 3D consistency of editing (Dong and Wang 2024), local editing (Mirzaei et al. 2023b), and high-efficient editing (Song et al. 2023a; Chen et al. 2024a; Wang et al. 2024a). For instance, to be more consistent in 3D, ViCA-NeRF (Dong and Wang 2024) edits key views using InstructPix2Pix and then utilizes the depth map and camera poses to propagate the edit results to other views.

For removing objects from 3D scenes, some studies (Wang et al. 2023; Mirzaei et al. 2023a) take advantage of text-guided image inpainting methods (Rombach et al. 2022; Xie et al. 2023). For example, InpaintNeRF360 (Wang et al. 2023) removes an arbitrary number of objects from the 3D scene and fills in the missing region with perceptually plausible and view-consistent content using pre-trained image inpainter (Rombach et al. 2022). Additionally, some researchers dedicate their efforts to adding new structures (Shum et al. 2024; Dihlmann et al. 2024; Shahbazi et al. 2024) based on these image editing models. LanguageFusion (Shum et al. 2024) adopts DreamBooth (Ruiz et al. 2023) for view synthesis from the scene and the object and builds a dataset for the training of the target scene conditioned on the user-provided 3D bounding box. Based on a user-defined proxy mesh placed into the existing scene, SIGNeRF (Dihlmann et al. 2024) renders a collection of color, depth, and mask images, and then feeds into depth-conditioned ControlNet (Zhang et al. 2023) for multi-view edited images of the target NeRF. More details are provided in Sect. 3.2.

2.2 Text-guided 3D editing optimization

Compared with image editing, text-guided 3D editing is more challenging because of the scarcity of paired text-3D datasets. Recently, by means of the pre-trained language-visual alignment models, exampled by the Contrastive Language-Image Pre-training (CLIP) (Radford et al. 2021), Denoising Diffusion Probabilistic Model (DDPM) (Ho et al. 2020), text-guided 3D generation and editing have emerged as a promising direction.

2.2.1 CLIP-based optimization

The CLIP model bridges the semantic gap between image pixels and the natural language prompts by using the image and text encoders, with the image encoder employing ResNet (He et al. 2016) or vision Transformer (ViT) (Dosovitskiy et al. 2020) and the text encoder adopting text Transformer (Vaswani et al. 2017). Both the image and text encoders are jointly trained on the large dataset of 400 million text-image pairs collected from the public Internet, ensuring the semantically rich and well-matched representations. Depending on the aligned representations, the absolute and relative directional CLIP losses (Gal et al. 2022a; Wang et al. 2023b) are widely used in text-guided 3D editing.

The absolute directional CLIP loss $\mathcal {L}_{sim}(I, T)$ measures the semantic similarity between the output image I and the target text prompt T by computing the cosine distance between them in the CLIP embedding space, which is defined as:

$$\begin{aligned} \mathcal {L}_{sim}(I,T) = 1 - \mathcal {D}_{cos}(E_{img}(I), E_{txt}(T)), \end{aligned}$$

(1)

where $\mathcal {D}_{cos}$ denotes the cosine distance, and $E_{img}(\cdot )$ and $E_{txt}(\cdot )$ are the image and text encoder of CLIP.

The relative directional CLIP loss $\mathcal {L}_{dir}$ measures the cosine distance between the direction of the change from the input to the output image and the direction of the change from the source to the target text prompt, which is defined as:

$$\begin{aligned} \mathcal {L}_{dir}(I_{target},T_{target},I_{source},T_{source}) = 1- \mathcal {D}_{cos}(\varDelta {I},\varDelta {T}), \end{aligned}$$

(2)

where $\varDelta {I}=E_{img}(I_{target})-E_{img}(I_{source})$ and $\varDelta {T} = E_{txt}(T_{target}) - E_{txt}(T_{source})$. $I_{target}$ and $T_{target}$ are the output image and its text description, and $I_{source}$ and $T_{source}$ are the input image and its text description.

2.2.2 Diffusion-based optimization

Vanilla Diffusion-based Loss Diffusion model (Ho et al. 2020) defines both forward and reverse processes parameterized by a Markov chain. The forward process gradually adds noises to the input clean data $\text{x}_0\backsim {p(\text{x})}$ by a noise schedule $\beta _{1:T}$ where T denotes the time steps. Theoretically, when $T\rightarrow \infty, X_T$ is a normal Gaussian distribution.

$$\begin{aligned} q(\text{x}_t|\text{x}_{t-1}) = \mathcal {N}(\text{x}_t;\sqrt{1-\beta _t}\text{x}_{t-1},\beta _t\textbf{I}), \end{aligned}$$

(3)

$$\begin{aligned} q(\text{x}_{1:T}|\text{x}_0) = \prod _{t=1}^{T}{q(\text{x}_t|\text{x}_{t-1})}, \end{aligned}$$

(4)

Also, the reverse process denoises the input from random noises $\mathbf {\epsilon }\backsim \mathcal {N}(\textbf{0,I})$ within T time steps. It models the reverse transition $p_(\text{x}_{t-1}|\text{x}_t)$ by a parametric Gaussian distribution whose mean $\varvec{\mu_{\theta}}(\text{x}_t,t)$ and variance $\varvec{\Sigma }_\theta (\text{x}_t,t))$ are predicted from noisy data using a U-Net network based on residual and Transformer blocks. The diffusion model can synthesize high-quality data and is trained in a stable manner.

$$\begin{aligned} p_{\varvec{\uptheta}}(\text{x}_{t-1}|\text{x}_t) = \mathcal {N}(\text{x}_{t-1};\varvec{\mu _{\theta }}(\text{x}_t,t),\varvec{\Sigma }_\theta (\text{x}_t,t)),\end{aligned}$$

(5)

$$\begin{aligned} p_{\varvec{\uptheta}}(\text{x}_{0:T}) = p(\text{x}_T)\prod _{t=1}^{T}{p_{\varvec{\uptheta}}(\text{x}_{t-1}|\text{x}_t)}, \end{aligned}$$

(6)

where $p(\text{x}_T) = \mathcal {N}(\text{x}_T;\textbf{0,I})$.

Here, we focus on text-guided image generation and editing diffusion models, which use the text prompt y as the condition information for the prediction of the noise $\varvec{\epsilon _{\phi }}(\text{x}_t,y,t)$. The prediction process is trained to minimize the following loss function:

$$\begin{aligned} \mathcal {L} = \mathbb {E}_{\text{x}_t,\varvec{\epsilon },t}\left[ \omega (t){\left\| \varvec{\epsilon _{\phi }}(\text{x}_t,t)-\varvec{\epsilon }\right\| }^2_2\right] , \end{aligned}$$

(7)

where $\omega (t)$ is a coefficient that depends on the time-step t. Furthermore, classifier-free guidance (CFG) Ho and Salimans (2022) is widely leveraged to improve the quality of results via a guidance scale parameter $\omega $:

$$\begin{aligned} \hat{\varvec{\epsilon }}_{\varvec{\phi }}(\text{x}_t,y,t) = (1+\omega )\varvec{\epsilon }_{\varvec{\phi }}(\text{x}_t,y,t)-\omega {\varvec{\epsilon }_{\varvec{\phi }}(\text{x}_t,t)}, \end{aligned}$$

(8)

The vanilla diffusion-based loss, operating on the image or its latent space, is often applied to generate or alter textures of 3D shapes, which will be discussed in detail in Sect. 4.1.

Score Distillation Sampling For text-guided 3D generation and editing, Dreamfusion (Poole et al. 2022) introduces score distillation sampling (SDS) to distill the prior knowledge from pre-trained text-to-image diffusion models into Neural Radiance Field (NeRF), demonstrating impressive 3D content creation capacity conditioned on given text prompts. Concretely, the image $x=g(\theta )$ is rendered by a differentiable generator g and a representation parameterized by $\phi $, and the gradient is calculated as:

$$\begin{aligned} \nabla _{\varvec{\theta }}\mathcal {L}_{SDS}(\phi ,x=g(\theta )) = {\mathbb {E}}_{t,\varvec{\epsilon }}\left[ (\hat{\varvec{\epsilon }}_{\varvec{\phi }}(x_t,y,t)-\varvec{\epsilon })\frac{\partial {x}}{\partial {\theta }}\right] \end{aligned}$$

(9)

Transformer model The Transformer model, introduced by Vaswani et al. (2017), stands as a pivotal factor in the evolution of CLIP and diffusion-based models. The Transformer model consists of embedding layers, position encoding, scaled dot-product attention, and multi-head attention, offering an innovative solution to capturing long-range dependencies in sequences.

Scaled dot-product attention is the cornerstone of the Transformer architecture, which allows the Transformer model to focus on relevant information within the input sequence. It computes a weighted sum of all input value vectors, where the weights are determined dynamically by the dot products of query and key vectors:

$$\begin{aligned} Attention(Q, K, V) = softmax(\frac{QK^\text{T}}{\sqrt{d_k}})V, \end{aligned}$$

(10)

where Q, K, and V are the query, key, and value matrices, respectively, and $d_k$ is the dimension of the key vectors.

Multi-head attention performs the attention function in parallel by projecting $d_{model}$-dimensional queries, keys, and values h times with different learned linear projections to $d_k$, $d_k$, and $d_v$ dimensions. Then, the outputs of each head are concatenated, resulting in the final values. This allows the model to jointly capture both local and global information from different representation subspaces, providing a richer representation.

$$\begin{aligned} MultiHead(Q,K,V) = Concat(head_1, head_2,..., head_h)W^o, \end{aligned}$$

(11)

where $head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)$ represents the output of the ith attention head, parameterized by $W^Q_i\in \mathbb {R}^{d_{model}\times {d_k}}$, $W^K_i\in \mathbb {R}^{d_{model}\times {d_k}}$, $W^V_i\in \mathbb {R}^{d_{model}\times {d_v}}$ respectively, and $W^o\in \textbf{R}^{hd_v\times {d_{model}}}$ is a learned linear transformation.

2.3 3D representations

Considering trade-offs across efficiency and representational capability, various 3D representations have been proposed in computer vision and computer graphics. These 3D representations can be mainly categorized as explicit and implicit, and each has its own pros and cons in terms of memory and time efficiency, cooperation with deep neural networks, and so on. Explicit 3D representations, such as voxel grids, point clouds, and meshes, can intuitively represent the surface or outline of the objects and are easy to edit. However, most of these representations tend to be heavily memory-consuming. Differently, implicit 3D representations, especially neural implicit representations like Neural Radiance Fields (NeRF), can be more flexible and lightweight in memory and topology but are time-consuming and rely on efficient supervision. To benefit from both implicit and explicit representations, hybrid representations combine these representations, potentially improving rendering quality and efficiency. In the following sections, we will introduce the formulations of these widely used 3D representations and their representative work.

2.3.1 Explicit 3D representations

We introduce the typical explicit 3D representations, including voxel grids, point clouds, and meshes, for their popularity and representativeness. We demonstrate the feature extraction and differentiable rendering of each kind of representation, which can maintain compatibility with gradient-based optimization in text-guided 3D generation and editing.

Point Clouds A point cloud is an unordered collection of points in 3D space, where each point can represent attributes (e.g., colors and normals). Early methods (Atzmon et al. 2018; Shi et al. 2020) convert unordered point clouds to regular voxels to utilize 3D convolution operations for feature extraction. However, these methods suffer from loss of geometric details due to the quantization operation of the voxels. Differently, the pioneer PointNet (Qi et al. 2017a) and its variant PointNet++ (Qi et al. 2017b) operate directly on discrete using feedforward neural networks. The subsequent works (Wang et al. 2019; Wu et al. 2019) redefine convolution operations for irregular point clouds. EdgeConv (Wang et al. 2019) builds a dynamic graph and learns local features for each point from its neighbors.

In order to render point clouds, previous methods (Pfister et al. 2000; Zwicker et al. 2001; Yifan et al. 2019) take oriented point clouds with a radius as surfels, disks, or ellipses to model the geometry and define surface splatting-based differentiable rendering pipelines to adjust point locations, normals, colors, etc. Some other approaches (Nalbach et al. 2017; Bui et al. 2018) replace physics-based rendering with a generative neural network, so that some of the mistakes of the modeling pipeline can be rectified by the rendering network. Furthermore, some techniques (Aliev et al. 2020; Rakhimov et al. 2022; Lassner and Zollhofer 2021; Rückert et al. 2022) embed point clouds with learnable features and adopt deep network-based rendering [e.g, U-Net-like rendering network (Aliev et al. 2020)], enabling high-fidelity rendering results. Nonetheless, the rendering speed of recent methods, such as Pular (Lassner and Zollhofer 2021) and ADOP (Rückert et al. 2022), are still not fast enough. Recent methods More recently, 3D Gaussian Splatting (Kerbl et al. 2023) achieves real-time rendering thanks to the tile-based rendering algorithm for the projected Gaussians. Some other methods (Xu et al. 2022; Zimny et al. 2024; Hu et al. 2023) try to incorporate point clouds with neural radiance fields to achieve high-resolution rendering results. These recent methods will be discussed in detail in hybrid representations.

Meshes Meshes visually display 3D shape surfaces with a compact collection of vertices, edges, and faces whose structures can typically be triangles or quadrilaterals (Shirman and Sequin 1987). Meshes can be easily manipulated and rendered in traditional computer graphics processes and are easy-to-use models in computer graphics applications such as geometry processing, texture mapping, animation, and rendering, which can be seamlessly compatible with existing software and hardware workflow for 3D generation and editing. To generate meshes, many methods (Hanocka et al. 2019; Zhou et al. 2020) represent a mesh in a graph and employ graph-based convolutional neural networks to estimate vertex locations and topology. Furthermore, some methods (Wang et al. 2018; Wen et al. 2019; Gao et al. 2023) pre-design a fixed mesh and predict vertex displacements to deform the mesh for the target shape.

Some mesh-based differentiable renderers, like OpenDR (Loper and Black 2014), neural mesh renderer (Kato et al. 2018), and soft rasterizer (Liu et al. 2019), update mesh attributes (e.g., color) using classical rendering techniques with gradient-based optimization. Additionally, some methods (Thies et al. 2019; Richardson et al. 2023; Cao et al. 2023) can generate textured meshes with texture maps for high rendering quality. Similar to voxel grids and point clouds, the mesh representation can also be integrated with neural implicit representations (Gao et al. 2020; Yang et al. 2022) for a balance between high-quality and high-speed rendering.

Voxel Grids Similar to pixels in 2D images, the straightforward voxel grids are regularly and densely arranged in 3D space, where each voxel stores geometry occupancies (Maturana and Scherer 2015; Wu et al. 2015). The regularity of voxel grids enables them to be in synergy with 3D deep neural networks and is widely used in 3D vision tasks like 3D reconstruction and generation (Choy et al. 2016; Häne et al. 2017; Wu et al. 2016). However, voxel-based methods are highly memory-consuming and thus have difficulties in handling high resolutions. For memory efficiency, octree-based methods (Riegler et al. 2017; Tatarchenko et al. 2017) introduce the octree data structure for shape modeling. However, these works still have limitations in preserving shape details like the smoothness of the surface.

Recent approaches (Müller et al. 2022; Fridovich-Keil et al. 2022) combine voxel grids with implicit representations to improve the training and rending efficiency and quality. These advancements in 3D representation have significantly enhanced the efficiency and effectiveness of 3D generation and editing (Sella et al. 2023; Haque et al. 2023; Kamata et al. 2023).

2.3.2 Implicit 3D representations

The traditional implicit 3D representation contains the Signed Distance Function (SDF), truncated Signed Distance Function (TSDF), etc. Recently, neural implicit representations, such as DeepSDF and Neural Radiance Fields (NeRF), using neural deep networks to implicitly represent a shape, have gained the most popularity in 3D reconstruction (Park et al. 2019; Mildenhall et al. 2021; Barron et al. 2022), 3D generation and editing (Poole et al. 2022; Haque et al. 2023).

Signed Distance Function (SDF) SDF is a continuous function that defines a 3D surface as the zero-level set of a distance field. It can be formulated as:

$$\begin{aligned} SDF(x) = s, x\in \mathcal {R}^3, s\in \mathcal {R}, \end{aligned}$$

(12)

where s is the signed value that depicts the distance from point x in 3D space to its closest surface, with the sign indicating whether the point is inside or outside the surface. If the point is inside the object, its sign is negative; otherwise, it is positive. Thus, the underlying 3D surface is implicitly defined as the isosurface of $SDF(\cdot ) = 0$, which can be extracted as a smooth mesh using Marching Cubes (Lorensen and Cline 1998).

Truncated Signed Distance Function (TSDF) TSDF extends the SDF by truncating the distance function at a certain threshold value, improving the computation efficiency. TSDF is depicted as:

$$\begin{aligned} TSDF(x) = \min (\max (\frac{SDF(x)}{D_{trunc}},-1),1) \end{aligned}$$

(13)

where SDF(x) is the signed output of SDF and $D_{trunc}$ is the threshold value determining the distance beyond where the function is truncated. In practice, the 3D shape is organized as voxel grids, where each voxel grid stores the output of TSDF, representing the truncated distance between the voxel and the nearest surface of the object. Both SDF and TSDF are commonly used in the 3D reconstruction and rendering (Newcombe et al. 2011; Curless and Levoy 1996; Stutz and Geiger 2018). Compared to SDF, TSDF can build a smoother transition between the interior and exterior regions of the object with minor computation resources.

DeepSDF Inheriting from SDF and TSDF, DeepSDF (Park et al. 2019) represents a 3D shape’s surface in a volumetric manner, where the magnitude of SDF value indicates the distance from a point to its nearest surface, and the sign indicates whether the point is inside of the shape or not. However, classical SDF can depict only one shape at a time, which is neither feasible nor computationally efficient. Differently, DeepSDF is a learned continuous SDF representation that models various shapes as the zero isosurfaces of SDFs predicted by a feed-forward network. Specifically, through introducing the latent codes of various shapes, DeepSDF directly regresses the continuous SDFs from the point sample x using a single deep neural network, which can be defined as:

$$\begin{aligned} f_{\theta }(z_i,x) = SDF^i(x), \end{aligned}$$

(14)

where $z_i$ is the latent code of a shape, and $f_{\theta }$ is a multi-layer fully-connected neural network. The network is trained by minimizing the $L_1$ loss between the predicted and real SDF values of sampled points. Once trained, the shape can be discretized for visualization by Marching Cubes.

Neural Radiance Field (NeRF) Recently, NeRF (Mildenhall et al. 2021) has been proposed to represent higher-resolution geometry and appearance and render photorealistic novel views of complex scenes, which has been popular in 3D generation and editing fields (Poole et al. 2022; Dihlmann et al. 2024; Haque et al. 2023).

Specifically, NeRF represents a scene using a fully connected deep network, which takes a 3D location $\textbf{x} = (x, y, z)$ and 2D viewing direction $(\theta , \phi )$ as inputs, and outputs the volume density $\sigma $ and view-dependent emitted color $\textbf{c} = (r, g, b)$. The process is formulated as:

$$\begin{aligned} F_{\mathbf {\Theta }}: (\textbf{x,d}) \rightarrow (\textbf{c},\sigma ) \end{aligned}$$

(15)

where $F_{\mathbf {\Theta }}$ denotes the multi-layer perceptron network, and a 3D Cartesian unit vector $\textbf{d}=(\theta ,\phi )$ represents the viewing direction. Then, NeRF synthesizes views by querying 5D coordinates along camera rays and uses volume rendering techniques to project the output colors and densities into an image.

To be practical, given the camera ray $\textbf{r}(t) = \textbf{o} + t\textbf{d}$ with camera position $\textbf{o}$, view direction $\textbf{d}$, and the near and far bounds $t\in [t_n, t_f]$, the projected color of $\textbf{r}(t)$ is rendered by sampling N points along the ray:

$$\begin{aligned} \mathbf {\hat{C}}(\textbf{r}) = \sum _{i=1}^{N}{\Omega _{i}(1-exp(-\rho _i\delta _i))c_i}, \end{aligned}$$

(16)

where $\rho _i$ and $c_i$ denote the density and color ith sampled point, $\Omega _{i} = exp(-\sum _{j=1}^{i-1}{\rho _i\delta _i})$ indicates the accumulated transmittance along the ray, and $\delta _i$ is the distance between adjacent points. With the differentiable volume rendering techniques, the network can be optimized by a set of images with known camera poses.

2.3.3 Hybrid 3D representations

Explicit representations can visually depict the 3D shape but with a limited resolution. Implicit representations can model complex geometry and topology with arbitrary resolution, but the explicit meshes need to be extracted by additional computation, e.g., Marching Cubes. Especially for neural implicit representations, the training of the neural deep works is difficult without the supervision of the target surface. Therefore, marrying the merits of implicit and explicit 3D representations can potentially improve generation and editing quality and rendering efficiency. DMTet (Munkberg et al. 2022) combines meshes with a deep 3D conditional generative model for high-resolution 3D shape synthesis. More recently, some methods (Fridovich-Keil et al. 2022; Müller et al. 2022; Chen et al. 2022; Sun et al. 2022; Barron et al. 2022; Chan et al. 2022) extend naive NeRF by integrating explicit 3D representations, like voxel grids and point clouds, for training and rendering efficiency. Specifically, Plenoxels Fridovich-Keil et al. (2022), InstantNGP (Müller et al. 2022), DVGO (Sun et al. 2022), and TensoRF (Chen et al. 2022) use voxel grids, DreamEditor (Zhuang et al. 2023) utilizes meshes, while some 3D Gaussian Splatting-based methods (Chen et al. 2024a; Wang et al. 2024a; Ren et al. 2023) employ point clouds. For example, Direct Voxel Grid Optimization (DVGO) (Sun et al. 2022) uses dense voxel grids to represent 3D geometry and builds a shallow network in the feature space of voxel grids to capture complex, viewpoint-dependent appearances. It can achieve rendering quality comparable to NeRF within about 15 minutes during the training process. Nvidia proposes Instant Neural Graphics Primitives (Instant-NGP) (Müller et al. 2022) that use a small-scale network to implement a fully connected network and adopt multi-resolution hash coding for accelerating. It facilitates GPU parallel computation and reduces computation by eliminating hash conflicts, reducing training time to just a few seconds while maintaining rendering quality. The Triplane (Chan et al. 2022) accelerates the training and inference time in a different way. It decomposes the 3D space into three orthogonal planes (such as the XY, XZ, and YZ planes) and represents the features of the 3D shape on these planes. Incorporating discrete explicit representations into implicit representations can effectively utilize the network capacity of the implicit representation, and speed up the training and rendering time.

Deep Marching Tetrahedra (DMTet) DMTet (Munkberg et al. 2022) is a deep 3D conditional model for producing high-resolution 3D shapes, where a hybrid representation is adopted, taking full advantage of meshes and deep implicit representations. It proposes a deformable tetrahedral grid to encode a discretized SDF and a differentiable marching tetrahedra layer to convert the SDF to the explicit surface mesh. During the computation process, DMTet learns to adapt the grid resolution by deforming and selectively subdividing tetrahedra, improving the quality of the output shape. Finally, the geometry and topology of the surface can be jointly optimized with a loss function defined explicitly on the surface mesh. Following this line, starting from coarse voxel grids and point clouds, DMTet can generate complex meshes with arbitrary topology, which is more performant than the Marching cubes (Lorensen and Cline 1998).

3D Gaussians Splatting (3D GS) 3D GS represents the 3D scene with point-like anisotropic Gaussians $\mathcal {G}=\{g_1, g_2,...,g_N\}$, where $g_i=\{\mu ,\Sigma ,c,\alpha \}$ and $i\in \{1,...,N\}$. Specifically, each Gaussian point is characterized by its center position $\mu \in {\mathbb {R}}^3$, covariance $\Sigma \in \mathbb {R}^7$, its color represented by spherical harmonics coefficients $c\in {\mathbb {R}}^k$ (where k indicates the degrees of freedom), its opacity $\alpha \in \mathbb {R}^1$. In this representation, the 3D Gaussians can be defined as follows:

$$\begin{aligned} G(x) = e^{-\frac{1}{2}x^T{\Sigma }^{-1}x}, \end{aligned}$$

(17)

where x denotes the distance between $\mu $ and the query point. The covariance matrix $\Sigma $ can be decomposed into a rotation matrix $\textbf{R}$ and a scaling matrix $\textbf{S}$ for differentiable optimization:

$$\begin{aligned} \Sigma = \textbf{RSS}^T\textbf{R}^T, \end{aligned}$$

(18)

For rendering new viewpoints, 3D Gaussian Splatting uses neural point-based rendering (Zwicker et al. 2001; Yifan et al. 2019; Kerbl et al. 2023) to compute the color of each pixel. The rendering process of N ordered points overlapping a pixel is as follows:

$$\begin{aligned} C = \sum _{i\in {N}}{c_i{\sigma }_i{\prod _{j=1}^{i-1}(1-{\alpha }_j)}}, \sigma {i} = \alpha _{i}G(x_i), \end{aligned}$$

(19)

Benefited from the explicit point-like Gaussians and efficient differentiable rendering approach, 3D Gaussian splatting achieves high-quality real-time rendering, which more recently has been used as an alternative to NeRF in 3D and 4D reconstruction tasks (Kerbl et al. 2023; Yang et al. 2023; Wu et al. 2024c), 3D and 4D generation and editing (Tang et al. 2023; Ren et al. 2023; Wang et al. 2024a; Chen et al. 2024a).

3 Text-guided 3D editing

Text-guided 3D editing aims to modify the appearance and geometry parts of an extant 3D shape, satisfying the requirements of text prompts. Unlike text-to-3D generation, which generates 3D assets from scratch, the starting point of the editing task is a pre-existing 3D shape to be edited, enabling enhanced controllability over the generation process for customized 3D assets. Thus, it is essential to edit involved regions of 3D content specified by the text prompt, while keeping unreferenced parts unchanged, making the editing task challenging. In addition, efforts struggle to edit 3D assets with multi-view consistency that semantically aligns with the given text prompts due to the deficiency of paired text-3D datasets, complex scenes, incomprehensible text prompts, etc. In these years, much research has attempted to overcome these challenges, achieving promising results.

3.1 Taxonomy

As demonstrated in previous sections, the key challenge to 3D editing is to edit involved regions of the 3D asset specified by the text prompt, while keeping unreferenced parts unchanged. To overcome this challenge, recent research explores two promising editing strategies: 2D-3D lifting and direct 3D editing, according to the place where the editing operations take place. Moreover, the editing process essentially involves 3D representations and optimization schemes for creating 3D models and supervising editing operations. Thanks to the suitable representation of the 3D model with a differentiable rendering algorithm, the edited 3D models could be supervised in the image domain by different optimization schemes, such as reconstruction-based loss, diffusion-based loss, CLIP-based loss, or their hybrids. According to these essential connections and differences, we use editing strategy, presentation form and optimization method as classification criteria to classify and discuss existing methods. In this section, we first define the categorization criteria and then organize extant papers according to these criteria, investigating text-guided 3D editing methods from the nature of technical implementation. Generally, we classify these techniques into two primary types, including 2D-3D lifting and direct 3D editing, where each type is further divided in terms of optimization schemes and 3D representations.

Table 1 The comprehensive overview of text-guided 3D editing methods, categorized from some critical factors: editing strategies, additional conditions, optimization losses, and 3D representations

Full size table

Editing Strategies refer to 2D-3D lifting and direct 3D editing. The 2D-3D lifting benefits from text-guided image editing techniques and then lifts these editing results to 3D. Differently, direct 3D editing tends to directly perform editing over 3D representations, typically by fine-tuning existing text-guided 3D generation techniques to satisfy new text prompts.
Optimization Schemes, that is, optimization losses, provide supervision during the 3D editing process, containing diffusion-based loss, CLIP-based loss, reconstruction-based loss, and their hybrids. The diffusion-based loss is further classified as vanilla diffusion-based and score distillation sampling (SDS)-based.
3D Representations used in text-guided 3D editing could be divided into explicit, implicit, and hybrids. Considering trade-offs across efficiency and representational capabilities, neural radiance field (NeRF), mesh, and hybrid representations are the most commonly used.

Compared to other prompts, like images and sketches, text prompts offer flexible expression over different editing requirements, from abstract conceptions to concrete objects, and thus are more practical and user-friendly. In 3D generation and editing fields, text prompt is commonly presented in two forms: text description and instruction. For example, given a dog, we can use “a dog wearing a hat” or “add a hat to a dog”, respectively, in a description or instruction manner, to obtain a dog with a hat on its head. Figure 3 depicts the structured taxonomy of text-guided 3D editing methods. Based on the proposed taxonomy, Table 1 highlights the categorization and comparison results of these surveyed techniques, offering a quick search. In the table, we show the SDS-based loss and its variants, like cascaded SDS (CDS), masked SDS, SDS with overlapped semantic component suppression (OSCS), delta denoising score (DDS), and score projection sampling (SPS), with detailed descriptions in the corresponding papers (Cheng et al. 2023; Park et al. 2023; Xie et al. 2023). The additional conditions, donated by users, provide further control for accurate editing.

Table 2 The overview of 2D-3D lifting-based 3D editing methods

Full size table

3.2 2D-3D lifting

Thanks to advancements in text-guided image editing, learning shape and appearance priors from text-image diffusion models and lifting image editing to 3D has emerged as a promising direction with great potential. The overview of these methods is depicted in Table 2. In general, these works fall into three categories. The first category (Haque et al. 2023; Song et al. 2023a; Khalid et al. 2023; Karim et al. 2023; Fang et al. 2023; Wang et al. 2024a; Chen et al. 2024a; Dihlmann et al. 2024), called reconstruction-based optimization, conducts 2D image editing in line with the target text prompt to obtain multi-view renderings of the target 3D model, which is then used as the dataset to reconstruct the target 3D model by reconstruction loss. The second category (Kamata et al. 2023; Chen et al. 2024b; Zhuang et al. 2023), named diffusion-based optimization, attempts to progressively denoise the noisy renderings of the target 3D model using the pre-trained 2D image editing diffusion-based models with SDS loss (Poole et al. 2022) and its variants (Cheng et al. 2023; Park et al. 2023; Xie et al. 2023). The typical flow diagrams of these two types are shown in Figures 4a, b. In addition, some methods (Raj et al. 2023; Chen et al. 2024b; Zhuang et al. 2024) combine these two optimization schemes to take advantage of both reconstruction and diffusion-based losses, which we define as hybrid loss-based optimization.

3.2.1 Reconstruction-based optimization

In this line, methods utilize text-guided image editing models to build an edited image dataset where the reconstruction-based loss is adopted to reconstruct the target 3D model (described in Figure 4a). We expound these methods, exampled by InstructN2N (Haque et al. 2023) and its improved versions (Song et al. 2023a; Dong and Wang 2024; Chen et al. 2024a; Wu et al. 2024a), on different 3D representations: NeRF and 3D Gaussian Splatting, respectively. Considering the dataset updating strategies, we generally summarize these methods into two types: multi-stage and multi-view. The multi-stage dataset updating-based methods (Haque et al. 2023; Mirzaei et al. 2023b) iteratively edit a single image to update the target model, while the multi-view dataset updating-based ones can update the target model using multi-view edited images. The typical flow diagrams of these two types are shown in Fig. 5a, b.

Reconstruction-Based 2D-3D Lifting on NeRF

Multi-stage dataset Updating Given a NeRF of a scene, InstructN2N (Haque et al. 2023) presents a primary instruction-based 3D editing method, which alternates between editing the source images using the instruction-based 2D image editing: InstructPix2Pix (Brooks et al. 2023) and optimizing the underlying NeRF built from these edited images. To propagate the image editing across different views, it designs an iterative dataset update strategy that repeats these alternate operations in a multi-stage manner. Though promising results, InstructN2N inherits limitations of the pre-trained InstructPix2Pix model, such as inconsistent editing across multi-view, the inability to perform large spatial manipulations, and over-editing.

Subsequent RMNE (Mirzaei et al. 2023b), LatentEditor (Khalid et al. 2023) and InseRF (Shahbazi et al. 2024) go after the dataset updating strategy of InstructN2N (Haque et al. 2023). Furthermore, to avoid over-editing lying in InstructN2N (Haque et al. 2023), these methods attempt to introduce mask guidance to implement local editing. RMNE (Mirzaei et al. 2023b) introduces a relevance-guided NeRF editing method to modify the main NeRF guided by the relevance field, ensuring the modification occurs in relevant regions (Figure 6). The core of this method is the relevance-guided image editing module, where relevance maps, derived from the discrepancy between InstructPix2Pix predictions with and without the conditional and unconditional predictions, are used as 2D masks to localize the editable pixels for fine-grained image editing. During training, RMNE iteratively edits the rendering image for the NeRF and updates the corresponding relevance map for the relevance field by the relevance-guided image editing module, enhancing the quality of text-guided 3D editing over NeRF. LatentEditor (Khalid et al. 2023) embeds the real-world scene into a latent space and conducts the iterative dataset update and NeRF editing in this latent space, reducing the editing time up to 5-fold compared to InstructN2N (Haque et al. 2023). It computes delta scores using InstructPix2Pix which serves as a 2D mask in the latent space, guiding the local modifications of images in the latent space. The edited image latents are then used to iteratively update the latent NeRF, achieving local NeRF editing. Once the latent NeRF is trained, it can be decoded through the Stable Diffusion decoder (Rombach et al. 2022). Although RMNE (Mirzaei et al. 2023b) and LatentEditor (Khalid et al. 2023) employ mask-guidance to alleviate the over-editing problem of IP2P, they still are unable to recover from the cases in which IP2P fails badly. Beyond editing the extant objects, InseRF (Shahbazi et al. 2024) can insert a new object into the source scene. It first lifts the 2D object insertion in the reference view of the scene, obtained by a mask-conditioned Imagen inpainting model (Saharia et al. 2022; Lugmayr et al. 2022), to 3D using a single view object reconstruction model SyncDreamer (Liu et al. 2023), which is then fused into the source scene through 3D placement. Finally, the appearances of the fused scene and object can be refined with the iterative updating strategy in InstructN2N (Haque et al. 2023). The performance of InseRF is limited by the capabilities of the underlying single-view object reconstruction model. In fact, the multi-stage dataset updating strategy is time-consuming and easy to multi-view inconsistent editing.

Multi-view dataset updating To be more efficient and multi-view consistent, alternative methods, exampled by EfficientN2N (Song et al. 2023a), ViCA-NeRF (Dong and Wang 2024), and DN2N (Fang et al. 2023), can acquire multi-view rendering images of the target model at once, averting the iterative updating. EffientN2N (Song et al. 2023a) simplifies the iterative dataset updating by treating it as a text-guided image editing process across multiple views. It applies corresponding points-based correspondence regularization into the InstructPix2Pix (Brooks et al. 2023) to enhance multi-view consistency during the denoising process, resulting in a collection of edited images to reconstruct the target NeRF. Thus, EfficientN2N can edit 3D assets 10 times faster than InstructN2N (Haque et al. 2023) and mitigate the view consistency during editing. ViCA-NeRF (Dong and Wang 2024) introduces geometric and learned regularization operations to propagate the image editing from edited to unedited views, thereby enhancing view consistency and avoiding iterative dataset updating. The geometric regularization leverages the depth maps to establish image correspondences, while the learned regularization further aligns the latent codes in the text-guided image editing model [e.g., InstructPix2Pix (Brooks et al. 2023)] between edited and unedited images, enabling to edit key views and propagate the update throughout the entire space. Doing so updates the dataset with multi-view consistent edited images, thus facilitating the target NeRF reconstruction. However, this method’s efficacy is contingent upon the depth map accuracy derived from NeRF, which underscores its reliance on the quality of NeRF-generated results. Moreover, the edited NeRF results of both EfficientN2N (Song et al. 2023a) and ViCA-NeRF (Dong and Wang 2024) tend to exhibit increased blurriness relative to the original NeRF, which may be due to the introduction of the average operation into InstructPix2Pix (Brooks et al. 2023) during multi-view alignment.

DN2N (Fang et al. 2023) and FreeEditor (Karim et al. 2023) expand the generalization capability of text-guided 3D editing through training a generalized NeRF, allowing users to edit 3D assets without retraining during the inference stage and reducing the editing time and memory consumption. DN2N (Fang et al. 2023) obtains a set of edited images by filtering the noisy results with poor consistency derived from the text-guided image editing model: Null-text (Mokady et al. 2023), in which the remaining inconsistency is addressed by building training data pairs for all training scenes with subtle perturbations. Based on these training data pairs, two cross-view regularization terms are integrated into the training process to eliminate perturbations, resulting in a generalized NeRF. During inference, the editing results can be achieved by the generalized NeRF without retraining for new scenes, reducing the editing time and improving the editing generalization. Inevitably, DN2N is constrained by the results of Null-text (Mokady et al. 2023). 2D editing models may not always achieve reliable editing results, leading to failures in editing 3D scenes. Different from the above methods, FreeEditor (Karim et al. 2023) proposes a “single-view editing” scheme where only a starting-view image needs to be edited to obtain the target NeRF, avoiding multi-view consistency and eliminating editing time. It also trains a generalized NeRF with an Edit Transformer to enforce intra-view consistency and inter-view editing transfer from the starting view to the target one. Therefore, during inference, only a single image needs to be edited for the edited NeRF. To collect the edited starting and target views with view consistency for training the generalized NeRF, FreeEditor repeatedly uses the pre-trained InstructPix2Pix (Brooks et al. 2023) to edit a set of images and observes the edited performance. However, a reasonable set of edited starting and target views requires hundreds of trial-and-error iterations and handcrafted selections.

Moreover, InpaintNeRF360 (Wang et al. 2023) follows the pipeline of multi-view dataset updating to accomplish object removal, specified by text prompts. Given the text prompt, InpaintNeRF360 (Wang et al. 2023) can remove multiple objects for both unbounded and frontal-facing scenes, through a multi-view inpainting technique. It implements the 3D inpainting by a pre-trained image inpainter (Rombach et al. 2022) under the guidance of multi-view segmentation maps, resulting from a promptable segmentation model called Segment Anything Model (SAM) (Kirillov et al. 2023). The inpainted NeRF is then reconstructed from the multi-view inpainted images to produce view-consistent and photo-realistic novel views. InpaintNeRF360 uses inpainted 2D images as priors for training the edited NeRF. However, the results of image inpainting models heavily rely on high-quality segmentation maps. Furthermore, even with a good segmentation map, the current image inpainting methods may output different results across views, which means the edited NeRF can’t converge to a perceptually consistent model.

Multi-view dataset updating in a multi-stage manner LanguageFusion (Shum et al. 2024) and SIGNeRF (Dihlmann et al. 2024) combine multi-view and multi-stage dataset updating strategies to balance the editing quality and efficiency. LanguageFusion (Shum et al. 2024) firstly fine-tunes a text-guided image editing model DreamBooth (Ruiz et al. 2023) in an inpainting manner to generate multi-view images containing the object of interest and the background. Then it introduces a pose-conditioned dataset updating strategy, where camera views near already-trained views are preferred to update the training dataset. By interactively updating the dataset and reconstructing the target NeRF, LanguageFusion can work well with text-guided object insertion and removal tasks. But, it’s difficult to fine-tune DreamBooth to generate images containing both the object of interest and the background. This fine-tuning process takes many tries. SIGNeRF (Dihlmann et al. 2024) adopts a depth-conditioned image editing model ControlNet (Zhang et al. 2023) to obtain a 3D-consistent reference sheet of multi-view edited images without iterative updating. Given a user-defined proxy mesh or a bounding box placed into the existing scene, a collection of corresponding color, depth, and mask images are rendered and arranged into image grids, which are then fed into ControlNet for multi-view edited images of the target NeRF. This update scheme can be easily repeated to further increase quality if needed. SIGNeRF requires a proxy object for object insertion editing, thereby losing a certain degree of freedom and user-friendliness. Moreover, optimal results are obtained when the object occupies the image’s center and is positioned close to the camera.

Reconstruction-Based 2D-3D Lifting on Hybrid Representation In this section, we introduce methods (Wang et al. 2024a; Zhuang et al. 2024) that adopt 3D Gaussian Splatting (GS) as the scene representation, improving the rendering quality and efficiency.

Multi-stage dataset updating Similar to InstructN2N (Haque et al. 2023), GaussianEditor (Wang et al. 2024a) employs a pre-trained InstructPix2Pix (Brooks et al. 2023) to edit the rendering image of the 3D Gaussian Splatting (3D GS), which is then used to update the target model (Fig. 7). Thanks to 3D GS, this iterative process can be more than twice as fast as InstructN2N (Haque et al. 2023). In addition, GaussianEditor incorporates Large Language Models (e.g., GPT-3.5 Turbo) and SAM (Kirillov et al. 2023) to automatically select 3D Gaussians of interest, achieving delicate and precise editing. The concurrent GaussianEditor-SC (Chen et al. 2024a) also adopts 3D GS as the representation and progressively updates the target 3D GS with edited rendering images. It proposes Gaussian Semantic Tracing to enhance control in the editing process while achieving stabilized editing results through Hierarchical Gaussian Splatting. Additionally, it designs a 3D inpainting algorithm to work well with object removal and integration. Both methods can’t achieve good editing results in scenes where the grounding segmentation or diffusion model fails in good performance. Further, in GaussianEditor (Wang et al. 2024a), textual descriptions for different views of the same object may differ from each other, which may cause Large Language Models (LLMs) to misunderstand these descriptions as those from different objects.

Multi-view dataset updating GaussCtrl (Wu et al. 2024a) shares the multi-view dataset updating strategy with some NeRF-based, like ViCA-NeRF (Dong and Wang 2024) and EfficientN2N (Song et al. 2023a), but instead uses a depth-guided image editing model named ControlNet (Zhang et al. 2023) with an attention-based latent code alignment module to encourage multi-view consistent editing. Given these advancements, GaussCtrl performs 3D editing with higher visual quality and efficiency. But, in the attention-based latent code alignment module, reference views should be selected empirically. OR-NeRF (Yin et al. 2023) enables users to remove objects from a given 3D scene using either points or text prompts on sparse images (Fig. 8). To perform multi-view consistent object removal, it spreads user-defined points prompt to other views and inputs them to SAM (Kirillov et al. 2023) for multi-view segmentation maps. Based on these multi-view segmentation maps, a pre-trained 2D inpainting model LaMa (Suvorov et al. 2022) is adopted to inpaint multi-view rendering images of the target NeRF, building a dataset to reconstruct the target NeRF. Note that the text prompt can be converted to point prompts through Grounded-SAM (Ren et al. 2024). Similar to InpaintNeRF360, the results of image inpainting models heavily rely on high-quality segmentation maps.

Multi-view dataset updating in a multi-stage manner VcEdit (Wang et al. 2024b) adopts a 2D image editing model named InfEdit (Xu et al. 2023) to edit the rendered images of the target 3D GS and designs the cross-attention consistency and editing consistency modules to generate edited images with high view consistency. Associated with the iterative updating pattern: editing rendered images and updating the 3D GS, VcEdit can further reduce inconsistencies in edited images. Despite high multi-view consistency, VcEdit depends on the 2D editing models to generate image guidance for 3D editing.

3.2.2 Diffusion-based optimization

The pioneer DreamFusion (Poole et al. 2022) adopts a pre-trained 2D text-to-image diffusion model [e.g., Imagen (Saharia et al. 2022)] prior to performing text-to-3D generation, where score distillation sampling (SDS) is presented to distill the 2D diffusion process to 3D NeRF. Following this pipeline, 2D-3D lifting-based editing methods feed the noisy rendering of the target 3D models along with the target text prompts into pre-trained text-guided image editing diffusion models (Li et al. 2023b; Ruiz et al. 2023) to obtain the 2D editing, and then use SDS loss to obtain the scores that used to guide lifting 2D editing to 3D (illustrated in Figure 4b).

SDS-Based 2D-3D Lifting on Hybrid Representation Instruct3D-to-3D (Haque et al. 2023) is the first work that transforms a 3D scene according to text instructions with SDS loss. It takes the rendering image of the source model with text instructions as denoising conditions, which are fed with the noisy rendering image of the target NeRF into the pre-trained image editing diffusion model InstructPix2Pix (Brooks et al. 2023) cooperated with SDS loss, and outputs the denoised rendering image of the target NeRF. By performing this process from various camera viewpoints, it can obtain the target NeRF along with the text instruction. Besides, a dynamic scaling scheme is presented to dynamically adjust the intensity of the geometry conversions, achieving a more controllable and smooth 3D editing process. Failure cases demonstrate that Instruct3D-to-3D is difficult to follow instructions requiring spatial reasoning.

DreamEditor (Zhuang et al. 2023) first distills NeRF into a mesh-based neural field, facilitating localization of the editing regions and decoupling texture and geometry during editing. Based on this, it progressively locates the editable regions and alters the texture and geometry within these regions optimized by SDS loss relying on the pre-trained DreamBooth (Ruiz et al. 2023), obtaining realistic and consistent editing results conforming to the text prompt. The mesh-based neural field faces difficulties in effectively reconstructing backgrounds in unbounded scenes, thus making DreamEditor suitable for only object-centric editing in the foreground of the scene. GSEdit (Palandra et al. 2024) explores the application of SDS loss in text-guided 3D editing using 3D GS. Different from previous NeRF-based methods, like Instruct3D-to-3D (Kamata et al. 2023), it adapts the SDS to work with 3D GS and utilizes the pre-trained InstructPix2Pix (Brooks et al. 2023) with the adapted SDS loss to denoise the noisy rendering image of the target 3D GS, which is used to update the target 3D GS. Once the editing is complete, the mesh extraction and texture refinement can be carried out following the DreamGaussian (Tang et al. 2023). The completely separate editing and texture refinement operations may not cooperate well. For example, GSEdit fails to follow the text instruction “Put a hat”, while the 2D editing model InstructPix2Pix can handle this instruction. In practice, due to the step-by-step SDS optimization, the Janus problem, where the generated object appears as a front view from different viewpoints, is often unavoidable in Instruct3D-to-3D (Haque et al. 2023), DreamEditor (Zhuang et al. 2023) and GSEdit (Palandra et al. 2024).

3.2.3 Hybrid loss-based optimization

Hybrid Loss-Based 2D-3D Lifting on NeRF DreamBooth3D (Raj et al. 2023) proposes a subject-specific 3D assets generation framework, which can edit the NeRF complying with the given text prompt while keeping the edited NeRF with geometric and appearance identities of the source model. To be more specific, it designs a 3-stage optimization scheme where the image editing diffusion model DreamBooth (Ruiz et al. 2023) is partially and fully fine-tuned in the first and second stages, generating multi-view pseudo-subject images. In the final stage, the partial DreamBooth is further fine-tuned with these multi-view images, acting as a multi-view DreamBooth to optimize the target NeRF with the SDS loss along with the multi-view reconstruction loss. Besides the Janus problem, failure results also indicate DreamBooth3D struggles to deal with thin object structures like sunglasses.

Inspired by the text-to-3D generation method Shap-E (Jun and Nichol 2023), Shap-Editor (Chen et al. 2024b) performs 3D editing in latent space, making each editing require approximately one second during inference. During training, it employs the Shap-E encoder to map the 3D model into the latent space and transforms it into the edited latent code using the Shap-Editor network. The original and edited latent codes are then decoded into the source and edited NeRFs, whose rendering images are fed into the pre-trained image generation or editing models with the SDS loss and reconstruction loss, distilling 2D editing to 3D. During inference, only the latent code needs to be fed into the Shap-Editor network. While Shap-Editor claims it can learn a latent editor that understands multiple instructions, it can’t yet achieve a fully opened editor.

Hybrid Loss-based 2D-3D Lifting on 3D Gaussian Splatting TipEditor (Zhuang et al. 2024) devises a 3D editing framework that accepts text and image prompts with a 3D bounding box to specify the editing location and content, ensuring flexible and precise local editing on 3D GS. It uses a stepwise 2D personalization strategy to learn the personalized representation of the existing scene and the reference image based on LoRA (Hu et al. 2021), where the localization loss defined by the 2D projected bounding box encourages correct object placement of the reference object in the existing scene. Then, a coarse edited 3D GS is learned with the SDS loss, whose textures are refined by carefully generated pseudo-GT images derived from the rendered and denoised images utilizing the reconstruction loss. With only a single reference image, content personalization is difficult to work. For example, content personalization tends to be overfitted to the reference image in style transfer editing, thus leading to content disclosure besides the style of the reference image.

3.3 Direct 3D editing

Despite encouraging editing results, 2D-3D lifting-based editing methods operate on images, struggling to perform consistent editing of 3D assets. These methods suffer from the ‘Janus Problem’ of generating multiple faces from different viewpoints since they operate on images with an unawareness of 3D knowledge. Instead, direct 3D editing efforts can directly edit the underlying 3D representations conforming to the target text prompts with the help of pre-trained text-vision models (Radford et al. 2021; Ho et al. 2020; Poole et al. 2022). This pipeline optimizes the edited model by relying on differentiable rendering to connect the edited model and the powerful pre-trained text-vision models. Previous methods (Wang et al. 2022, 2023b; Hyung et al. 2023; Song et al. 2023b; Michel et al. 2022; Lei et al. 2022; Memery et al. 2023; Ma et al. 2023; Gao et al. 2023; Chen et al. 2023b) leverage the differentiable rendering to obtain the rendering images of the target 3D model and match these images to the target text prompt through CLIP similarity (Radford et al. 2021). Illuminated by the score distillation sampling (SDS) (Poole et al. 2022), recent direct 3D editing methods (Cheng et al. 2023; Mikaeili et al. 2023; Sella et al. 2023; Zhou et al. 2023; Metzer et al. 2023; Oh et al. 2023; Decatur et al. 2024; Park et al. 2023; He et al. 2024; Li et al. 2024a; Wang et al. 2023a) using pre-trained diffusion models are under increasing exploration and have achieved an exciting editing quality. The typical flow diagrams of these two types are shown in Fig. 9a, b.

3.3.1 CLIP-based optimization

In this line, methods (Wang et al. 2022, 2023b; Michel et al. 2022; Ma et al. 2023) deform the geometry or appearance of a given 3D model to the target one by minimizing CLIP loss between the multi-view renderings of the target model and the target text prompt (depicted in Figure 9a). We divide these methods into CLIP-based 3D editing on NeRF and mesh for a detailed description.

CLIP-Based Direct 3D Editing on Mesh In this section, methods attempt to edit the attributes, like vertex positions and textures, of the source model under the supervision of CLIP loss. As depicted in Fig. 9a, these methods first design an appearance or geometry edit network to alter the source model and predict the attributes of the target model. Then, the target model is updated by the CLIP loss by minimizing the semantic discrepancy between the multi-view rendering images (2D image projections) of the target model and the text prompt.

Given a triangle mesh, Text2Deformer (Gao et al. 2023) aims to deform its shape adhering to the text prompt and produce large low-frequency shape deformations and small high-frequency details. It models the mesh deformation using Jacobians and edits the source mesh by optimizing per-triangle Jacobians using a main CLIP-based semantic loss with two regularization losses, where the CLIP-based semantic loss ensures the geometry deformation satisfying the target prompt, a view-consistency regularization loss encourages coherent deformation across multiple views, and a Jacobians regularization loss preserves the fidelity to the given mesh. Text2Deformer needs to optimize per-triangle Jacobians for each mesh instance, which is time-consuming. Instead of optimization, using a single feed-forward network to predict Jacobians could be faster. Text2Mesh (Michel et al. 2022) and X-Mesh (Ma et al. 2023) can edit both the geometry and color of a given mesh to texture the mesh. Given the bare mesh, Text2Mesh (Michel et al. 2022) proposes a learned neural network to predict the color and displacement of each vertex on the mesh and finally output a vertex-colored mesh conforming to the target text prompt. The learned neural network is optimized by the semantic supervision of CLIP loss, driving multiple rendering images of the vertex-color mesh and their 2D augmentations to match the text prompt. However, such text-independent architecture lacks textual guidance during predicting attributes, thus leading to slow convergence. Besides the learned neural network, X-Mesh (Ma et al. 2023) designs a text-guided dynamic attention module to integrate the text prompt as the guidance of color and position offsets prediction, resulting in more accurate attribute prediction and faster convergence speed (Figure 10). But, both Text2Mesh and X-Mesh lose sight of reflection knowledge that is essential for photorealistic editing. Differently, TANGO (Lei et al. 2022) adopts two learned neural networks to predict diffuse, roughness, specular, and normal maps, from which the spatially varying bidirectional reflectance distribution function can be derived, enabling photorealistic editing. While TANGO incorporates reflection information, it is limited in shape manipulation.

CLIP-Based Direct 3D Editing on NeRF These mesh-based direct 3D editing methods (Michel et al. 2022; Ma et al. 2023) can naturally manipulate over the explicit geometry but suffer from the limited capacity for modeling and rendering complex scenes. Recently, the implicit NeRF has shown its power in reconstructing and rendering complex scenes. However, direct editing NeRF guided by text prompts is challenging since 1) the geometry and appearance of NeRF are implicitly parameterized and entangled by a fully connected network, making it difficult to precisely edit geometry or appearance separately, 2) only CLIP loss is insufficient to ensure the fine-grained editing. Recent NeRF-based direct 3D editing methods work on these problems from disentangling appearance and geometry representation (Wang et al. 2022), refining CLIP loss (Wang et al. 2023b) to introducing masks for local editing (Song et al. 2023b).

NeRF-Art (Wang et al. 2023b) proposes a CLIP-based global-local contrastive loss that encourages the editing results closer to the target text prompt and farther away from pre-defined negative samples in the CLIP embedding space, combined with directional CLIP loss, enabling both global structures and local details to adhere to the semantics of the target text prompt. In addition, it adopts a weight regularization to suppress cloudy artifacts and geometry noises when altering the density field of NeRF. In some cases, NeRF-Art tends to over-editing. Besides, it struggles to deal with text prompts that are linguistically ambiguous or contain semantically meaningless words. CLIP-NeRF (Wang et al. 2022) designs a disentangled conditional NeRF framework (Figure 11), where shape and appearance codes are employed to control the volumetric field and emitted colors. Then it proposes shape and appearance mappers, introducing text or image prompts into the shape and appearance deformation, which cooperate with the disentangled conditional NeRF, allowing individual geometry and appearance editing under the optimization of CLIP loss. But, CLIP-NeRF can’t handle fine-grained and out-of-domain shape and appearance edits, due to the limited expressive ability of the latent space and the lack of various training data. BlendingNeRF (Song et al. 2023b) presents a layered NeRF architecture containing a pre-trained NeRF for the original NeRF and an editable NeRF for the target NeRF. The editable NeRF combined with blending operation is trained to render a blended image matching the target text, guided by the editing region mask, derived from CLIPSeg (Lüddecke and Ecker 2022), on the rendering image of the original NeRF specified by the text prompt. Supervised by the CLIP loss with localized editing objectives, BlendingNeRF can locally add new objects, modify textures, and remove parts of the original object. During fine-grained editing, the overall performance of BlendingNeRF can be affected by the pre-trained models, CLIPSeg and CLIP. In addition, it adopts the Instant-NGP as the 3D representation, which requires a tradeoff between quality improvement and an increase in computational time.

CLIP-Based Direct 3D Editing on Hybrid Representation LENeRF (Hyung et al. 2023) proposes to perform text-guided 3D local editing directly on triplane (Chan et al. 2022). It designs an editing framework compromising three add-on modules, where the latent residual mapper and deformation network are used to produce the source and target triplane features. The key attention field module can estimate a soft 3D mask in the triplane feature space, trained by the CLIP-generated zero-shot pseudo labels. Then, these 3D masks are used for interpolation between the source and target features, resulting in local editing conducted on the triplane features optimized by the CLIP loss. As indicated by the experiments, LENeRF focuses on local editing of faces.

3.3.2 Diffusion-based optimization

Inspired by recent advancements in text-guided 3D generation Poole et al. (2022), Lin et al. (2023), Shi et al. (2023), Jun and Nichol (2023); Sanghi et al. (2022), it’s trivial to achieve 3D editing by fine-tuning existing 3D assets with new prompts. As illustrated in Fig. 9b, such editing methods (Sella et al. 2023; Li et al. 2024a; Oh et al. 2023) purify the noisy renderings of the target 3D model through SDS loss (Poole et al. 2022), where the scores generated by the pre-trained 2D image generation models (Rombach et al. 2022; Zhang et al. 2023; Avrahami et al. 2023; Ho et al. 2020) are applied to guide the modification process of the extant 3D models.

Diffusion-Based Direct 3D Editing on Mesh Different from previous texturing methods (Chen et al. 2023a; Cao et al. 2023), 3D-Paintbrush (Decatur et al. 2024) aims to automatically texture local regions on meshes. It develops three networks to produce localization, texture, and background maps, in which the localization map is used to specify the edit region, playing a key role in local texturing. These networks are optimized by the refined SDS loss, named cascaded score distillation (CSD), which leverages multiple stages of a cascaded diffusion model at different resolutions, allowing high-quality textured meshes. Additionally, based on the obtained maps, three texturing variants of the same mesh, including the localization, target, and background, are optimized by the CSD losses along with the corresponding text prompts, improving the quality of the localization and texturing. To accommodate CSD losses, the corresponding cascaded text prompts need to be set empirically.

Diffusion-Based Direct 3D Editing on Hybrid Representation Latent-NeRF (Metzer et al. 2023) proposes to train the latent NeRF in a latent space of the diffusion models [e.g, Stable Diffusion (Rombach et al. 2022)], which is easily transformed back to a regular NeRF for further refinement in RGB space. Given a sketched shape, Latent-NeRF can refine its geometry details and generate textures conforming to the target text prompt, optimized by the SDS loss and occupancy loss. It further proposes Latent-Paint that can embed the texture map into the latent space and then directly optimize the texture map through the gradients back-propagated by the rendered mesh, allowing for texture generation. As an early work, Latent-NeRF employs simple Stable Diffusion as the backbone, which tends to generate unsatisfactory images when specifying the desired direction, even with a directional text prompt (e.g., “front”). Vox-E (Sella et al. 2023) claims explicit grid structure is beneficial for editing 3D objects, and thus aims to edit the grid-based volumetric representation [e.g., DVGO (Sun et al. 2022)] of a 3D object, implemented by fine-tuning the grid structure to satisfy the target text prompt through the SDS loss. For global editing, a volumetric regularization loss is introduced to encourage the correlation between the density of the input and edited NeRFs. For local editing, it can generate a refined NeRF by merging the input and edited NeRF, guided by the volumetric binary mask that marks the voxels should be edited. However, Vox-E (Sella et al. 2023) suffers from noisy editing results because of coupling geometry and textures.

To overcome this problem, some methods (Chen et al. 2023b, c; Li et al. 2024a; Oh et al. 2023) propose to disentangle geometry and appearance modeling and employ a two-stage framework guaranteeing high-quality text-guided 3D editing. Given a 3D ellipsoid, Fantasia3D (Chen et al. 2023c) can subsequently edit its geometry and appearance according to the given text prompt. It introduces DMTet (Munkberg et al. 2022) as the 3D geometry representation and renders the surface normal map extracted from DMTet for geometry modeling while employing the spatially varying bidirectional reflectance distribution function (BRDF) as the appearance representation and rendering the photorealistic image for appearance modeling. Both geometry and appearance modeling are supervised by SDS loss, supporting separate geometry and appearance editing. Fantasia3D results in an appearance vastly different from the base mesh, because the whole shape is re-optimized according to prompts. FocalDreamer (Li et al. 2024a) shares the spirit of the geometry and appearance modeling in Fatansia3D (Chen et al. 2023c), but aims to add new objects or parts to a base shape. Given user-specified ellipsoid areas to be edited, it merges the ellipsoid with the base shape and renders the normal map of the merged shape for geometry modeling. For appearance modeling, it follows a physically based rendering material model and renders the base and edited shape in a dual-path manner with base and editable textures, which are blended by pixel-wise discriminative mask for a unified appearance. These processes are optimized by the SDS loss with three regularization losses, including style consistency loss, geometric focal loss, and collision avoidance loss. Although these regularizations can improve the editing quality, they don’t always work well together. ControlDreamer (Oh et al. 2023) targets to generate a multi-view consistent texture aligning on the source geometry. The key part is MV-ControlNet, which can understand the multi-view depths of a mesh, and generate multi-view consistent textures. Specifically, under the text prompt of the geometry, it employs MVDream (Shi et al. 2023) to create a NeRF, which is then transformed into a mesh via DMTet. Based on the mesh, it uses the proposed MV-ControlNet to generate a multi-view consistent textured mesh conforming to the text prompt of the appearance. Training MV-ControlNet requires additional datasets and training resource overhead. Besides, it’s difficult for ControlDreamer to handle significant geometry changes.

Previous methods focus on adding geometries, overwriting textures, or both, but they can’t support non-rigid editing of 3D assets. Plasticine3D (Chen et al. 2023b) pays attention to performing non-rigid editing guided by text prompts. It proposes a framework containing three key parts: geometry processing, multi-view-embedding (MVE), and embedding-fusion (EF). The geometry processing is designed to get a base geometry for different types of editing tasks, including general addition, median-scale, and global non-rigid transformations. The MVE optimization strategy with the EF can obtain a fused embedding for the base and edited parts, preserving details of the original object and delivering desired degrees of editing. Besides the SDS loss, a score projection sampling (SPS), which provided stronger guidance to the target prompt direction, is proposed to facilitate significant modifications. However, geometry editing may change the scale of the object, which needs manual correction. Furthermore, the fine-tuning process generally takes more than two hours, too slow for daily usage.

3.3.3 Hybrid loss-based optimization

In this section, the SDS or CLIP loss is generally utilized to conduct 3D editing, while CLIP, reconstruction, and other regularization losses are introduced to leave the undesired regions unchanged.

Hybrid Loss-Based Direct 3D Editing on Hybrid Representation Repaint-NeRF (Zhou et al. 2023), SKED (Mikaeili et al. 2023), customizeNeRF (He et al. 2024) and LucidDreaming (Wang et al. 2023a) apply the SDS loss to semantically be faithful to the target text prompt. However, the SDS-based optimization watches the whole scene during the editing process leading to unwanted alterations. To overcome this challenge, these methods further introduce additional conditions and losses. Repaint-NeRF (Zhou et al. 2023) firstly optimizes a feature field to match the given text or patch prompt, which masks and separates the part to be changed. while updating the edited NeRF to align with the target text prompt, it reconstructs the unmasked regions in the rendered images of the edited NeRF to be similar to the source NeRF. Besides, a CLIP loss is adopted to make the unseen regions in the edited NeRF filled with the background information. Repaint-NeRF can generally modify the instant NeRF by text prompts. But, in some cases, where the new object shape is hugely different from the old one, the training process will be very tough. SKED (Mikaeili et al. 2023) introduces user-defined sketches from two viewpoints as guidance to edit the existing NeRF, offering fine-grained and flexible control for the interested regions. To ensure the editing results respect the given sketches, it devises two regularization losses: the preservation and silhouette losses. The preservation loss can preserve the unrelated parts by encouraging the similarities of density and color between the source and edited NeRF out of the sketch regions, while the silhouette loss encourages the editing performed in the sketch regions. Unfortunately, two sketches are not enough for consistent editing in some cases, which makes SKED vulnerable to the Janus problem. CustomizeNeRF (He et al. 2024) proposes a local-global iterative editing training strategy assembled with local and global SDS losses, which can alternate between foreground region editing and full-image editing, simultaneously editing the foreground layout reflecting the text prompt and maintaining the background content. Further, a background loss is calculated to enforce the rendered pixel color of the background region to be the same as the original for background preservation. LucidDreaming (Wang et al. 2023a) presents an effective pipeline capable of fine-grained control over 3D generation. It requires minimal input of 3D bounding boxes, which can be deduced from a simple text prompt using a Large Language Model. Specifically, it proposes clipped ray sampling to separately render and optimize objects with user specifications. It also introduces object-centric density blob bias, fostering the separation of generated objects. With individual rendering and optimizing of objects, it excels not only in controlled content generation from scratch but also within the pre-trained NeRF scenes. But, rendering objects separately, LucidDreaming struggles to create interactions between objects. Moreover, the training time increases linearly with the number of objects, limiting its application to a large number of objects.

Furthermore, Progressive3D (Cheng et al. 2023) and ED-NeRF (Park et al. 2023) improve the editing quality by refining the SDS loss. Progressive3D (Cheng et al. 2023) aims to progressively edit the existing NeRF for generating complex scenes with multiple objects (Fig. 12). It proposes a refined SDS loss with overlapped semantic component suppression (OSCS) to suppress the overlapped component and enhance the influence of the different semantics, imposing the optimization focusing more on the semantic difference. Additionally, given the 2D masks projected from user-specified 3D bounding boxes, it designs a content consistency constraint to constraint the density and color between the source and target NeRF to be consistent in unmasked regions, similar to SKED (Mikaeili et al. 2023). Progressive3D decomposes a difficult generation into a series of editing processes, which improves the editing qualities but leads to multiplying time costs and more human intervention. Like Latent-NeRF (Metzer et al. 2023) and ShapEditor (Chen et al. 2024b), ED-NeRF (Park et al. 2023) encodes scenes into the latent space and performs 3D editing based on the latent diffusion model for faster editing speed and high-quality editing results. To cooperate with 3D editing in latent space, the SDS loss is substituted with delta denoising score (DDS) distillation loss, which is determined by subtracting the SDS scores of target and source text prompts. In combination with a binary mask, masked DDS guides NeRF in the intended direction of the target prompt without causing unintended deformations. Besides, it introduces an additional reconstruction loss to mitigate undesired deformations beyond the mask. The 3D editing quality is affected by the limited expressive ability of the latent space and imperfect binary masks.

Table 3 The comprehensive overview of text-guided 3D editing methods for accurate editing, from the perspective of editing constraints

Full size table

3.4 Discussions

2D-3D lifting-based methods can benefit from editing priors of text-based image editing to complete 3D editing, and improve the controllability of the 3D editing process. However, editing an individual image at a time lacks 3D awareness, compromising the multi-view consistent editing. If editing results across views have large inconsistencies, 2D-3D lifting-based methods fail to consolidate in 3D. For instance, the 2D image editing model attempts to edit the same object in differing spatial locations, thus failing in 3D editing. Currently, some strategies, such as hybrid optimization and multi-view dataset updating, are proposed to mitigate inconsistent editing, but these strategies can’t solve this challenge completely. To be worse, failed image editing will directly lead to 3D editing failure. In addition, 2D-3D lifting-based methods inherit many of the limitations of 2D image editing models. For example, InstructPix2Pix struggles to add or remove regions, perform large spatial manipulations, and so on. Depth-conditioned ControlNet heavily relies on the quality of depth maps. DreamBooth requires personalization training for each scene, and good editing performance depends on empirical fine-tuning.

Direct 3D editing methods generally improve the multi-view consistency of 3D editing since they can keep 3D awareness of the underlying 3D representations. They also can take full advantage of 2D and 3D generation priors to improve the editing quality. However, methods with diffusion-based optimization may derive some limitations from 2D generation diffusion models, such as the Janus problem and unsatisfactory image generation with the desired direction. Multi-view generation models like MVDream can alleviate these challenges. In addition, without the editing priors from text-based image editing, direct 3D editing methods lack control of the editing process, especially for local editing. Compared with 2D-3D lifting-based methods, direct 3D editing methods are more urgent to introduce additional editing constraints to determine the editing regions, ensuring that non-editing areas remain unchanged when editing is completed.

In both categories, iterative optimization based on CLIP, SDS, and hybrid losses is adopted to supervise the editing process, which makes 3D editing face problems such as slow training speed and difficult optimization. Most existing methods also need to be optimized individually for each scenario, increasing the cost of inference. Moreover, the lack of large 3D editing datasets makes it impossible to build large-scale models for 3D editing.

Discussions on Editing Constraints To obtain accurate and fine-grained 3D editing results, the core is to introduce the editing constraints that limit the desired editing occurring in editable regions, implementing 3D editing adhering to the target text prompt while avoiding unwanted modifications. Some 2D-3D lifting-based 3D editing methods implicitly localize the editing region utilizing the powerful text-guided image editing diffusion models. For example, DreamBooth3D (Raj et al. 2023) adopts DreamBooth (Ruiz et al. 2023), GaussCtrl (Wu et al. 2024a) employs ControlNet (Zhang et al. 2023), and some others like InstructN2N (Haque et al. 2023), EfficientN2N (Song et al. 2023a), and FreeEditor (Karim et al. 2023), utilize InstructPix2Pix (Brooks et al. 2023) to perform image editing and lift 2D editing to 3D without explicit constraints. However, without explicit editing constraints, these methods can only rely on the editing capacities of 2D image editing, which is not ideal for fine-grained 3D editing. We call this editing mode as 2D implicit editing.

To ensure dedicated 3D local editing, further methods (Mirzaei et al. 2023b; Dong and Wang 2024; Song et al. 2023b; Khalid et al. 2023; He et al. 2024) explicitly introduce editing constraints in an automatic or manual manner. Some works (Cheng et al. 2023; Zhuang et al. 2024; Dihlmann et al. 2024) naively involve users in defining additional conditions as editing constraints, guiding the localization of interested regions. Different from these works requiring user efforts, some other methods (Sella et al. 2023; Mirzaei et al. 2023b; Zhuang et al. 2023) can automatically specify the editing constraints from the semantic correspondences between the target text prompt and rendered images. Here, we categorize these editing constraints into 2D and 3D, considering where these constraints actually work. Specifically, 2D editing constraints are commonly used to locate editing masks on rendering images of the target 3D model. These 2D constraints contain 2D text-image cross-attention maps, 2D relevance maps, 2D bounding boxes, 2D segmentation maps, and 2D masks projected from 3D constraints. Differently, 3D editing constraints, denoted by 3D bounding boxes, proxy objects, proxy points, and 3D masks back-projected from 2D constraints, can directly determine interested regions over 3D representations. Then, these 2D or 3D constraints further work with optimization losses, imposing the back-propagation of gradients in masked regions specified by these editing constraints. The comprehensive overview of editing constraints in fine-grained text-guided 3D editing is shown in Table 3.

4 3D editing capacity

So far, we have expounded existing text-guided 3D editing methods in depth, but a question still exists: How has current progress in text-guided 3D editing gone so far? To answer this, we generally review state-of-the-art methods from the point of 3D editing capacity, considering the editing scale, type, granularity and perspective. First, we provide the definition and overview of different editing capacities.

Editing Scale We mainly classify the editing scale as object and scene levels. Object-level editing refers to single or multiple objects with simple or complex attributes, while scene-level editing focuses on object-centric, indoor, and bounded and unbounded outdoor scenes.
Editing Type We categorize methods into geometric editing and appearance editing, where geometry editing is capable of transforming the initial shape to the desired target shape, such as altering positions of vertices, and appearance editing refers to re-colorization, texturing, material generation, and style transfer. In fact, most methods can change both geometry and appearance according to the text guidance.
Editing Granularity We consider two granularities of 3D editing: global editing (e.g., “Make it look like a statue”) and local editing (e.g., “Add a party hat to it”). The global granularity changes a 3D model over its overall space, while local editing modifies a 3D asset confined to editable regions and preserves the rest.
Editing Perspective In addition to modifying the geometry and appearance of 3D assets, some methods can add or delete objects from scenes, expanding the freedom and variety of 3D editing.

The summary of existing methods considering editing capacities is described in Table 4. Then, we focus on four text-guided 3D editing applications: texturing, style transfer, local editing of scenes, and insertion editing. Broadly, text-guided texturing and style transfer demonstrate the appearance editing capacities of both objects and scenes at the global level. Local editing of scenes involves geometry or appearance modifications in a local way. Text-guided insertion editing emphasizes 3D editing with higher degrees of freedom. Due to the open-ended nature of these tasks, most studies evaluate their results with subjective case and user studies, thus presenting a challenge for a fair comparison and evaluation. Here, we provide quantitative and qualitative results on the same dataset for a fair evaluation of different methods.

Table 4 The comprehensive overview of 3D editing capacities with respect to editing scale, type, granularity, and perspective

Full size table

4.1 Text-guided texturing of 3D shapes

In this section, we focus on text-guided texturing of 3D shapes that transform a bare mesh to a vertex-colored or textured mesh conforming to the target text prompt. Methods in this field could be generally divided into two types: optimization-based 3D generation and (iteratively) 2D texturing.

The optimization-based 3D generation aims at colorizing the base mesh through the CLIP or SDS loss based on the advancements of large-scale pre-trained text-guided image generation models like CLIP (Wang et al. 2022) and diffusion models (Rombach et al. 2022; Zhang et al. 2023; Avrahami et al. 2023; Ho et al. 2020). Some approaches (Wang et al. 2022; Michel et al. 2022; Lei et al. 2022; Ma et al. 2023) endeavor to utilize CLIP loss (CLIP-space similarities) as an optimization objective to predict the per-vertex geometric and appearance details of a bare mesh. Text2Mesh (Michel et al. 2022) optimizes per-vertex color and small geometric displacement to modify a source mesh to adhere to the target text prompt. TANGO (Lei et al. 2022) focuses on generating a photo-realistic appearance of a given mesh by automatically predicting reflectance effects. X-Mesh (Ma et al. 2023) leverages textual semantic guidance to predict color and position offsets of each vertex over the mesh for a high-quality vertex-colored mesh with fast convergence. Subsequently, alternative methods (Metzer et al. 2023; Chen et al. 2023c; Oh et al. 2023; Decatur et al. 2024) can optimize the texture map of a given mesh through back-propagating gradients, guided by the scores derived from SDS loss (Poole et al. 2022). Given a shape, Latent-Paint (Latent-nerf) (Metzer et al. 2023) represents a texture map in the latent space and directly optimizes it by back-propagating through the differentiable renderer on the textured mesh. ControlDreamer (Oh et al. 2023) utilizes the multi-view depths-based MV-ControlNet to align the texture with the mesh shape for a textured mesh, producing multi-view consistent texture with geometry fidelity. 3D Paintbrush (Decatur et al. 2024) can locally texture the given mesh by simultaneously producing a localization map to specify the edit region and a texture map to colorize the editable region.

Conversely, 2D texturing methods (Chen et al. 2023a; Richardson et al. 2023; Cao et al. 2023; Zeng et al. 2024) via diffusion models iteratively harness the capabilities of text-to-image diffusion models (Rombach et al. 2022; Zhang et al. 2023; Avrahami et al. 2023; Ho et al. 2020) to synthesize texture maps matching the target text prompt, supervised by the vanilla diffusion-based loss. These methods offer an alternative approach for painting satisfactory textures over 3D shapes. As the core, these methods iteratively utilize pre-trained depth-to-image models (Rombach et al. 2022; Zhang et al. 2023) to generate the rendering images of the target textured mesh from different viewpoints, which are then projected back to the texture maps for 3D meshes. TEXTure (Richardson et al. 2023) and Text2Tex (Chen et al. 2023a) use the pre-trained depth-to-image diffusion model to gradually synthesize partial textures of a 3D shape from multiple viewpoints. TEXTure (Richardson et al. 2023) defines a dynamic partitioning of the rendered image into a trimap of three states: “keep”, “refine” and “generate”, serving as the guidance to the refined diffusion sampling process to progressively generate and update partial textures for the corresponding regions from different views. Text2Tex (Chen et al. 2023a) shares the dynamic partitioning scheme with TEXTure (Richardson et al. 2023), but it devises an automatic view sequence generation strategy to determine the next best view for updating the partial texture, saving human efforts by designing different orders of viewpoints for various geometries. To improve multi-view consistency, TexFusion (Cao et al. 2023) proposes to aggregate different denoising predictions on a shared latent texture map from various viewpoints during the denoising process, synthesizing the entire texture map. However, these methods still struggle to synthesize textures without lighting bias from 2D priors. Paint3D (Zeng et al. 2024) proposes a coarse-to-fine framework, where a coarse texture map is acquired from multi-view images from the pre-trained 2D image diffusion models. Based on the coarse texture map, a high-quality texture map is generated with a diffusion model in UV space. GenesisTex (Gao et al. 2024) transforms the image diffusion model to texture space by texture space sampling and performs the sampling and denoising operations in the latent texture map for each viewpoint. During the sampling process, the style consistency and dynamic alignment mechanisms are integrated to preserve the multi-view consistency. Relying on TEXTure (Richardson et al. 2023) and Text2Tex (Chen et al. 2023a), a texture refinement module is further introduced to fill in the blank regions and refine the texture maps with dedicated details. Different from Text2Tex (Chen et al. 2023a) and TEXTure (Richardson et al. 2023) that select one view to update the partial texture, TexRO (Wu et al. 2024b) proposes a viewpoint selection strategy to produce an optimal set of viewpoints and generates an initial UV texture from these viewpoints. It then optimizes the initial UV texture with a multi-view and recursive optimization process at increasing resolutions, which bolsters the quality and multi-view consistency of the synthesized textures.

Table 5 Comparison results of compared methods: Text2Mesh (Michel et al. 2022), Latent-Paint (Metzer et al. 2023), TEXTure (Richardson et al. 2023), Text2Tex (Chen et al. 2023a), and TexRO (Wu et al. 2024b)

Full size table

4.1.1 Discussion

Datasets and Evaluation Metrics We quantitatively and visually compare methods on a subset of textured meshes from the Objaverse (Deitke et al. 2023) dataset. In objaverse, each mesh with textures is described by a brief caption. Quantitative comparisons follow the experiment setting of Text2Tex (Chen et al. 2023a), where 410 high-quality textured meshes across 255 categories are filtered for experiments. The generated textures are assessed with commonly used metrics for image quality and diversity: the Frechet Inception Distance (FID) (Heusel et al. 2017) and Kernel Inception Distance (KID$\times {10}^{-3}$) (Bińkowski et al. 2018). To compute these metrics, 20 images are rendered for each mesh at a resolution of $512\times 512$. In addition, the visual experiments are conducted on a subset of textured meshes from the Objaverse (Deitke et al. 2023) dataset, following the experimental settings of Paint3D (Zeng et al. 2024). The selected subset contains 105, 301 texture meshes, with 105, 000 meshes utilized for training and 331 for evaluation.

Comparisons and Discussions To quantitatively demonstrate the performance of text-guided texturing, the comparison results of typical methods containing Text2Mesh (Michel et al. 2022), Latent-Paint (Metzer et al. 2023), TEXTure (Richardson et al. 2023), Text2Tex (Chen et al. 2023a), and TexRO (Wu et al. 2024b), are reported in Table 5. The Results are borrowed from Text2Tex (Chen et al. 2023a) and TexRO (Wu et al. 2024b). Since Text2Tex (Chen et al. 2023a) can automatically determine the next best view for updating the partial texture with more view-consistencies, it works better than TEXTure (Richardson et al. 2023) that updates the partial with texture fixed viewpoints. Moreover, TexRO (Wu et al. 2024b) generates textures closer to the ground truth, and outperforms the suboptimal Text2Tex by a margin of 1.85 in FID ($\downarrow $) and 1.97 in KID ($\downarrow $), indicating more realistic synthesis quality.

We depict visual results of the state-of-the-art approaches, including Latent-Paint (Latent-NeRF) (Metzer et al. 2023), TEXTure (Richardson et al. 2023), Text2Tex (Chen et al. 2023a), and Paint3D (Zeng et al. 2024), in Fig. 13. TEXTure and Text2Tex work better than Latent-Paint and can synthesize clear and distinct textures with more details. However, TEXTure may generate textures with seams or splicing, while Text2Tex compromises in generating fine-grained textures for complex scenes. Paint3D outperforms the other methods and can generate illumination-free textures, facilitating relighting.

Table 6 The overview of methods in text-guided style transformation

Full size table

4.2 Text-guided style transfer

In this section, we explore text-guided 3D style transfer on neural radiance fields (NeRFs). Compared to image references-based stylization methods (Chiang et al. 2022; Fan et al. 2022), text-guided 3D stylization is more user-friendly to specify styles, from abstract ones like a certain concept to very concrete ones like a famous painting or character. Exampled by NeRF-Art (Wang et al. 2023b), InstructN2N (Haque et al. 2023) and its variants (Khalid et al. 2023; Song et al. 2023a; Mirzaei et al. 2023b; Dong and Wang 2024; Karim et al. 2023; Fang et al. 2023), methods offer promising stylization results, like turning portraits into styles of notable figures, generating paintings of specific artistic style, and modifying the time of the day and seasons. NeRF-Art (Wang et al. 2023n) extends CLIP loss with a global-local contrastive learning strategy to ensure the uniformity of the style over global structures and local details. InsturctN2N (Haque et al. 2023) and its variants rely on the editing abilities of the text-guided image editing model InstructPix2Pix (Brooks et al. 2023), offering flexible and high-quality style transfer. More details are listed in Table 6.

4.2.1 Discussion

Comparisons and Discussions To evaluate the capacities of text-guided style transformation, the experiments are performed on 2 cases from NeRF-Art (Wang et al. 2023b). In the first case, we want to convert the figure in the original image into Van Gogh’s style, while in the second one, we want to transform the original image into a fauvism style. We evaluate 7 leading-edge approaches, including NeRF-Art (Wang et al. 2023b), InstructN2N (Haque et al. 2023), DN2N (Fang et al. 2023), DreamEditor (Zhuang et al. 2023), ViCA-NeRF (Dong and Wang 2024), FreeEditor (Karim et al. 2023) and LatentEditor (Khalid et al. 2023). Figure 14 depicts the visual results of these methods with their corresponding text prompts. Additionally, we conduct a user study and ask the participants to vote on the results of these methods according to two evaluation criteria: the overall “Quality”, and “Alignment” to the given text descriptions. The user study involves over 51 participants voting on 2 scenes, resulting in over 100 responses. The edited scenes for all methods are rendered in the same camera trajectory to obtain 4 images, which are used for the user study. The comparison results are reported in Table 7.

Although Style editing has a certain subjectivity, we can still do some analyses of the results. We take the first case, “converting the figure to Van Gogh’s style” as an example. NeRF-Art (Wang et al. 2023b) and DreamEditor (Zhuang et al. 2023) follow the text description “Vincent Van Gogh” and modify the figure to look like “Van Gogh”, but change irrelevant regions, like background and T-shirt. Other methods utilize InstructPix2Pix (Brooks et al. 2023) to edit images, from which the edited 3D model is reconstructed. Thus, they bear resemblance results. InstructN2N (Haque et al. 2023), DN2N (Fang et al. 2023), ViCA-NeRF (Dong and Wang 2024), and LatentEditor (Khalid et al. 2023) give more clear text instructions or descriptions to improve the editing quality. Generally, they still have the problem of changing unreferenced regions, but can outcome some promising results. FreeEditor (Karim et al. 2023) and LatentEditor (Khalid et al. 2023) achieve more realistic results since they can keep more areas unrelated to the target prompt while transferring the style. InstructN2N (Haque et al. 2023) and ViCA-NeRF (Dong and Wang 2024) achieve various style transformations, such as turning portraits into notable figures and changing the seasons. Moreover, ViCA-NeRF (Dong and Wang 2024) can produce more vivid colors and details. The results from the user study also indicate a preference for ViCA-NeRF with a substantial margin on both the quality evaluation ($41.2\%$ votes) and the alignment evaluation ($43.1\%$ votes), highlighting ViCA-NeRF’s proficiency in text-guided style transfer.

Table 7 User study where participants voted on the edited results of different methods based on overall “Quality” ($Vote_{quality}$) and“Alignment” with the given text descriptions ($Vote_{alignment}$)

Full size table

4.3 Text-guided local editing of scenes

Opposed to global editing, such as texture generation (Chen et al. 2023a; Cao et al. 2023) or style transfer (Wang et al. 2023b; Haque et al. 2023) tasks, text-guided local editing typically demands fine-grained localization, guaranteeing that the editing is constrained to the desired region while avoiding unnecessary modifications to non-editing regions. In fact, the editable regions can be determined automatically or manually.

For object-level local editing, Vox-E (Sella et al. 2023) and ShapEditor (Chen et al. 2024b) automatically address the region localization using cross-attention maps derived from pre-trained text-to-image models (Li et al. 2023b; Ruiz et al. 2023). However, attention maps offer more coarse localizations than the actual editing regions, leading to noised changes. SKED (Mikaeili et al. 2023), FocalDreamer (Li et al. 2024a), and Progressive3D (Cheng et al. 2023) can determine the editing regions in the forms of 2D sketch on multi-view images, 3D ellipsoid focal regions and 3D bounding boxes, which require additional human efforts.

It’s more challenging to edit a scene accurately since it has a higher degree of freedom in spatial regions. Some efforts (Haque et al. 2023; Mirzaei et al. 2023b; Dong and Wang 2024) have been made towards object-centric scene editing, keeping the background unchanged. InstructN2N (Haque et al. 2023), Instruct3D-to-3D (Kamata et al. 2023), FreeEditor (Karim et al. 2023) and GaussCtrl (Wu et al. 2024a) rely on the powerful 2D image editing diffusion models, like InstructPix2Pix (Li et al. 2023b) and ControlNet (Zhang et al. 2023) to implicitly localize the editing regions, whose local editing capacities are limited to the power of 2D image editing models. Subsequent methods (Mirzaei et al. 2023b; Song et al. 2023a; Dong and Wang 2024; Song et al. 2023b; Khalid et al. 2023; He et al. 2024) follow this pipeline but introduce additional constraints for dedicated local editing. RMNE (Mirzaei et al. 2023b) and LatentEditor (Metzer et al. 2023) identify the discrepancy between InstructPix2Pix (Li et al. 2023b) predictions with and without the instruction, which is then used to guide the generation of the edited images. Repaint-NeRF (Zhou et al. 2023) optimizes a feature field using text or patch to extract the 2D mask, which separates the desired part to change. Blending-NeRF (Song et al. 2023b), ViCA-NeRF (Dong and Wang 2024), CustomizeNeRF (He et al. 2024), EN-NeRF (Park et al. 2023), GaussianEditor (Wang et al. 2024a), and GaussianEditor-SC Chen et al. (2024a) utilize 2D segmentation map to localize the interested regions over 2D images or 3D representations. GaussianEditor employs Large Language Models (e.g., GPT-3.5) to automatically locate the text region of interest, while other methods default to knowing it with user efforts. Instead of learning 2D or 3D interested regions from the semantic similarities of the text prompt and images, some methods require user efforts for flexible localization, resulting in more controls during the fine-grained editing of geometry and appearance. ViCA-NeRF (Dong and Wang 2024) needs to associate with user interaction for a single point. SIGNeRF (Dihlmann et al. 2024) enables local editing by user-defined 3D bounding boxes, ensuring flexible editing of any area of the scene.

4.3.1 Discussion

Comparisons and Discussions To evaluate the performance of text-guided local editing of scenes, the experiments are conducted on the dataset provided by InstructN2N (Haque et al. 2023). We evaluate 4 typical methods over text-guided local editing of scenes, including InstructN2N (Haque et al. 2023), RMNE (Mirzaei et al. 2023b), GaussianEditor (Wang et al. 2024a), and SIGNeRF (Dihlmann et al. 2024). The visual results are depicted in Fig. 15.

We take the case of “ Turn the bear into a grizzly bear” as an example. As shown in Fig. 15, InstructN2N (Haque et al. 2023), GaussianEditor (Wang et al. 2024a) and RMNE (Mirzaei et al. 2023b) suffer from incomplete and blurry editing in areas like the face and the feet. Since InstructN2N (Haque et al. 2023), GaussianEditor (Wang et al. 2024a) and RMNE (Mirzaei et al. 2023b) adopts the InstructPix2Pix Li et al. (2023b) to conduct 2D editing on the rendering images of the source model according to the target text prompt, resulting in edited images. Then, these multi-view images are used to build a dataset to reconstruct the target model. Therefore, the local editing performances are limited by the editing weaknesses of InstructPix2Pix, such as multi-view inconsistency and lack of local editing. In addition, the relevance or segmentation maps used in GaussianEditor (Wang et al. 2024a) and RMNE (Mirzaei et al. 2023b) are automatically obtained from the semantic relations between the target text prompt and images, which are not accurate enough to locate the editing boundaries. SIGNeRF (Dihlmann et al. 2024) produces plausible results with more realistic and structured pelt textures and is superior considering the localization precision and editing quality. The possible reasons exist in two aspects. Firstly, it employs the image editing model termed ControlNet (Zhang et al. 2023), which is depth-conditioned for better image editing results. Secondly, it combines multi-view and multi-stage dataset updating strategies, enabling consistent editing.

In addition, the quantitative results of compared methods are reported in Table 8. The experiment is conducted on “Bear statue” and “Face” cases provided by InstructN2N (Haque et al. 2023). Results are borrowed from GaussCtrl (Wu et al. 2024a), EfficientN2N (Song et al. 2023a), LatentEditor (Khalid et al. 2023) and FreeEditor (Karim et al. 2023). The CLIP directional similarity ($CLIP_{dir}$) measures the edits with the editing directions from the input to edit results, by computing the directional CLIP similarity between changes of text and 3D shapes. Different from InstructN2N and ViCA-NeRF using InstructPix2Pix (Brooks et al. 2023), GaussCtrl with depth-conditioned ControlNet (Zhang et al. 2023) achieves the best results in terms of $CLIP_{dir}$. The most possible reason may be that depth-conditioned ControlNet can keep more consistent editing across different views than InstructPix2Pix in both “Bear statue” and “Face” cases.

The table also reports the editing time for each method. Editing efficiency is affected by many factors, such as multi-view editing consistency, 3D representation, and the complexity of the editing scene. InstructN2N-GS replaces the NeRF in InstructN2N (Haque et al. 2023) with 3D Gaussian Splatting. Compared with InstructN2N, InstructN2N-GS reduces the editing time from 90 minutes to 13.5 minutes, which speaks volumes about the rendering efficiency of 3D Gaussian Splatting. But, it’s not enough for better editing efficiency. LatentEditor (Khalid et al. 2023) embeds the real-world scene into a latent space and conducts NeRF editing in this latent space, reducing the editing time up to 5-fold compared to InstructN2N (Haque et al. 2023). ViCA-NeRF (Dong and Wang 2024), FreeEditor (Karim et al. 2023), GaussCtrl (Wu et al. 2024a) and EfficientN2N (Song et al. 2023a) improve multi-view consistency, thus enabling less editing time. EfficientN2N and FreeEditor can editor one scene in five minutes, making 3D editing more practical.

Table 8 Quantitative results of compared methods: InstructN2N (Haque et al. 2023), InstructN2N-GS (Haque et al. 2023), EfficientN2N (Song et al. 2023a), ViCA-NeRF (Dong and Wang 2024), GaussCtrl (Wu et al. 2024a), LatentEditor (Khalid et al. 2023), and FreeEditor (Karim et al. 2023). $CLIP_{dir}$ denotes the CLIP directional similarity

Full size table

4.4 Text-guided insertion editing

In this section, we are interested in text-guided insertion editing that allows adding new structures to existing scenes. In contrast to text-guided texturing (Chen et al. 2023a; Cao et al. 2023), stylization (Wang et al. 2023b), and object removal (Yin et al. 2023; Wang et al. 2023), which rely on the assumption that the objects already exist in the scene, generating new structures in the extant scenes is more challenging due to the deficiency of spatial control. Some methods (Sella et al. 2023; Mikaeili et al. 2023; Chen et al. 2024b; Cheng et al. 2023; Li et al. 2024a) explore adding accessories to existing objects under the spatial priors indicated by the text prompt (e.g., a dog wearing a hat) or defined by user efforts (e.g., user-defined 3D bounding box, 2D sketch).

Vox-E (Sella et al. 2023) can add new structures to the existing voxel grids optimized by the SDS loss. To keep the existing object globally unchanged, it proposes a volumetric regularization loss, which encourages correlation between the density features of the source and edited voxel grids. For local editing, it further determines 3D editing masks by cross-attention maps between the target text prompt and merges the source and edited voxel grids based on these 3D masks for accurate insertion editing. SKED (Mikaeili et al. 2023), Progressive3D (Cheng et al. 2023), and FocalDreamer (Li et al. 2024a) require user interactions, and then add objects limited to the user-defined regions. Although demanding user efforts, these methods can offer more control of the editing process, achieving more flexible and accurate insertion editing. SKED (Mikaeili et al. 2023) generates new objects that adhere to the offered sketches and the target text prompt by preserving the original density and radiance fields. Progressive3D (Cheng et al. 2023) decomposes the entire generation process for a complex text prompt into a series of local insertion operations, where multiple objects can be inserted incrementally in regions determined by user-provided 3D bounding boxes. FocalDreamer (Li et al. 2024a) proposes geometry union and dual-path rendering to merge a base shape with editable parts according to user-defined 3D ellipsoids with the target text prompt, enabling to generate independent 3D parts into a complete object, tailored for convenient instance reuse and part-wise control.

However, generating and inserting new objects in scenes remains an open problem due to the lack of spatial constraints. Under the guidance of 3D bounding boxes, LanguageFusion (Shum et al. 2024) proposes to synthesize multi-view images of the interested foreground object into the given background using DreamBooth (Ruiz et al. 2023) in an inpainting way. To ensure view consistency, it designs a pose-conditioned dataset updating strategy that progressively prioritizes radiance field training with camera views close to the already-trained views prior to propagating the training to remaining views. Given the proxy object placed into the background NeRF, SIGNeRF (Dihlmann et al. 2024) adopts conditioned image diffusion model ControlNet (Zhang et al. 2023) to iteratively generate a reference sheet of edited images, based on which the target NeRF is fine-tuned to match the target text prompt. GO-NeRF (Dai et al. 2024) and InseRF (Shahbazi et al. 2024) try to reduce users’ interaction burdens in determining the desired 3D locations. GO-NeRF (Dai et al. 2024) can automatically define a 3D bounding box from user-selected three points on a rendered image. It then creates a new object in the 3D box and integrates it into an existing NeRF with well-designed losses, like inpainting SDS loss, opacity loss, and sparsity loss. InseRF (Shahbazi et al. 2024) focuses on generative object insertion in 3D scenes in a 3D-consistent way, only requiring a 2D bounding box on a reference view. It adds a 2D object in a reference view of the scene and then distills the 2D insertion to 3D using a single-view object reconstruction [e.g., SyncDreamer (Liu et al. 2023)] and a 3D placement strategy. Thus, it’s capable of scene-aware generation and insertion of objects in 3D scenes with the textual description of the objects and a single-view 2D bounding box as spatial guidance.

4.4.1 Discussion

Datasets and Evaluation Metrics For better illustration, we quantitatively and visually present the results of compare methods. CLIP similarity ($CLIP_{sim}$) computes the alignment of the performed 3D edits with the text prompts. The CLIP directional similarity ($CLIP_{dir}$) evaluates the alignment between the images and text prompts before and after editing, reflecting whether the image editing direction is consistent with the text changing direction.

Comparisons and Discussions For text-guided insertion editing on objects, we compare typical methods, including Vox-E (Sella et al. 2023) and FocalDreamer (Li et al. 2024a), and report their visual results with four text prompts in Fig. 16. Vox-E-Global supports global editing, preserving the global geometry unchanged but without local constraints for fine-grained editing. As depicted in Fig. 16, both Vox-E-Global and Vox-E usually output noisy results due to the disentangled geometry and appearance of NeRF. Since Vox-E optimizes over different views, it may lead to multi-view inconsistency and incorrect editing on certain prompts with complex spatial constraints, i.e., a deer standing on two separate wooden skateboards. Vox-E-Global works worse and fails to perform the desired editing while keeping the undesired regions unchanged because of its limited capacity to localize the desired regions. FocalDreamer (Li et al. 2024a) can directly insert new parts in user-defined regions, enabling it for accurate local insertion with complex attributes.

The quantitative results of compared methods are reported in Table 9. The Results are borrowed from FocalDreamer (Li et al. 2024a). The $CLIP_{sim}$ and $CLIP_{dir}$ are calculated with rendered images from the same 100 views. With a user-defined focal region, FocalDreamer (Li et al. 2024a) can execute the desired editing with unnecessary changes, resulting in higher $CLIP_{sim}$ and $CLIP_{dir}$. Additionally, a user study is conducted with 65 participants. The use study requires participants to assess different methods according to the base shape preservation and prompt relevance and give a preference score ranging from $1\sim 10$. Compared to Vox-E and Vox-E-Global, FocalDreamer can better preserve the base shape unchanged using a user-defined focal region while inserting new objects, thus achieving higher preference scores considering the base shape preservation. Further, the soft geometry union operator and a style consistency regularization enable high-fidelity geometry and textures to align with the text prompts, leading to higher prompt relevance scores. Though promising results, FocalDreamer needs user efforts to define the initial editing regions, which may be inconvenient for users.

Table 9 Quantitative results of compared methods: Vox-E-Global (Sella et al. 2023), Vox-E (Sella et al. 2023), and FocalDreamer (Li et al. 2024a)

Full size table

For text-guided insertion editing on scenes, we offer a comparison among InstructN2N (Haque et al. 2023), RMNE (Mirzaei et al. 2023b), GaussianEditor (Wang et al. 2024a) and SIGNeRF (Dihlmann et al. 2024). These methods are evaluated on a subset of real indoor and outdoor scenes proposed in MipNeRF-360 (Barron et al. 2022) and InstructN2N (Kamata et al. 2023). In Fig. 17, we provide a visual comparison of InstructN2N, MV-InstructN2N, and InseRF. MV-InstructN2N follows the iterative dataset updating strategy of InstructN2N and refines it with multi-view masks for the target object region.

Compared to adding new objects in scenes, the primary InstructN2N does better in modifying existing objects. Moreover, it tends to change the scene globally without exact localizations. As shown in the first row, InstructN2N offers noisy results with unnecessary changes (denoted by the yellow box) while failing to add a panettone on the tray (denoted by the red box). Thanks to the given multi-view masks, MV-InstructN2N can edit the image in local regions but suffers from inconsistent image editing across different views, leading to editing failures on some viewpoints (denoted in red box). With minimal user efforts in determining a 2D bounding box in a single view, InseRF can insert new objects in the extant scenes at random positions. However, the style consistency between the background and inserted objects should be improved.

5 Open challenges and future work

Despite the promising achievements in text-guided 3D editing, considering some concerns such as multi-view consistency, controllability, generalization and efficiency, there still exist challenges that are worth investigating in future work. In this section, we will explore these open challenges and potential future directions in this field.

Datasets Unlike paired text-image datasets that can be easily captured, building 3D assets requires a mass of time and professional skills, thus, collecting a large paired text-3D dataset is challenging. Various 3D editing styles, scenes, and user requirements make this situation worse. ChangeIt3D (Achlioptas et al. 2023) introduces an extensive corpus ShapeTalk, which has over half a million natural language utterances describing shape differences, yet is limited to geometry editing. Defining specific rules to collect a large-scale, high-quality 3D dataset for 3D editing is still in need and will facilitate data-driven 3D editing. Meanwhile, utilizing extensive text-2D data for text-guided 3D editing, avoiding the scarcity of text-3D data, has demonstrated great potential, but issues like view inconsistency require additional attention.

Reliable Evaluation Metrics There is a notable absence of reliable evaluation metrics for evaluating the quality of editing results. Besides comparing the visual results, researchers often adopt evaluation metrics used in text-to-3D generation. For example, the widely used CLIP similarity can only evaluate how well the asset aligned with the input text, lacking evaluations of other aspects like the multi-view consistency of the edited model. Metrics like Frechet Inception Distance (FID) and Kernel Inception Distance can be applied to 3D editing for evaluating the quality and diversity of editing results, but may not always adhere to the editing requirements and preferences of users. In addition, these metrics can’t reflect how well the unreferenced regions are maintained, which is critical in 3D editing. User Studies are generally performed to evaluate editing results from different perspectives: image quality from any viewpoint, 3D consistency, fidelity to the text instruction, and fidelity to the source 3D scene, but they are time-consuming and subject to user bias. Quantifying the quality of generated 3D models objectively is an important and not widely explored problem. Better metrics to judge the results objectively in terms of generation quality, fidelity to unedited regions, and matching degree with the conditions still need further exploration. Employing recently developed Large Language and Multimodal Models (Chowdhery et al. 2023; Achiam et al. 2023; OpenAI 2023) to evaluate editing results may be a promising direction. For instance, GPTEval3D (Wu et al. 2024c) utilizes GPT-4V (OpenAI 2023) to compare two 3D assets according to user-defined criteria.

Practicability text-guided 3D editing reduces users’ burden during the editing process, but user interaction is still needed for more precise and controllable editing. These methods are not compatible with current graphics tools and thus require additional interactive tools. Furthermore, incorporating these recreated 3D assets into the traditional rasterization graphics pipeline remains a challenge since most of these methods rely on implicit representations and render with volume rendering or neural rendering. Though some text-guided texturing methods (Chen et al. 2023a; Richardson et al. 2023; Zeng et al. 2024) can produce textured meshes with UV maps, most of them still fail to generate light-less textures.

Information Confidentiality and Security Based on various 3D representations, text-guided 3D generation and editing have shown great potential in generating and recreating 3D assets, which can be applied to digital media such as virtual reality, augmented reality, special effects games, etc. Meanwhile, the information confidentiality and data security issues of these generated 3D assets have become increasingly important concerns. Some efforts are donated to protect information security from different aspects, such as information steganography (Zhu et al. 2021; Li et al. 2023a; Huang et al. 2024b) and copyright protection (Luo et al. 2023). However, this field is still under-explored.

More Potential Applications Text-guided 3D editing techniques enable the editing of 3D assets to be more user-friendly and low-cost, allowing these techniques to find possible applications in many fields, such as 3D animation, educational tools, creative design and VR game development. During 3D animation and film production, text prompts could be used to alter or create animated sequences, changing the actions, behaviors, or interactions of 3D models, and even creating special effects, bringing more stunning visual effects to the player or audience. Creative designers can use 3D editing technologies to change the materials, styles and shapes of 3D assets, and even create objects or scenes in the designer’s imagination, which makes creative design interesting. In VR game development, 3D editing techniques can be applied to modify terrain, buildings, and vegetation, creating new game scenes and providing players with a rich game environment. These techniques can also be used to create personalized character models in the game, including people and animals, to make the character more distinctive. Moreover, these 3D editing techniques have the potential to cooperate with professional software in many industries, like education and healthcare. For example, text-guided 3D editing techniques can work with educational software, to create dynamic and interactive learning materials that adapt to textual inputs from educators or students.

6 Conclusion

We have comprehensively surveyed recent advancements in text-guided 3D editing from two perspectives: the methodology and editing capacity. Our survey overviews over 50 methods and proposes a taxonomy according to the editing strategies, optimization schemes, and 3D representations. We then explore their contributions to the editing capacities considering editing scale, type, granularity, and perspective. We further highlight 4 typical 3D editing tasks: text-guided texturing, style transformation, local editing of scenes, and insertion editing with in-depth comparisons and discussions. To end our survey, we discuss these open challenges within text-guided 3D editing. We hope this survey offers a systematic summary that could inspire subsequent work for interested readers.

Data availability

The datasets generated or analyzed during the current study are available from the corresponding author on reasonable request.

References

Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S et al (2023) GPT-4 technical report. arXiv preprint. arXiv:2303.08774
Achlioptas P, Huang I, Sung M, Tulyakov S, Guibas L (2023) Shapetalk: A language dataset and framework for 3D shape edits and deformations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12685–12694
Aliev K-A, Sevastopolsky A, Kolos M, Ulyanov D, Lempitsky V (2020) Neural point-based graphics. In: Proceeding of the 16th European conference on computer vision. Springer, pp 696–712
Atzmon M, Maron H, Lipman Y (2018) Point convolutional neural networks by extension operators. ACM Trans Graph 37(4):71
Article Google Scholar
Avrahami O, Fried O, Lischinski D (2023) Blended latent diffusion. ACM Trans Graph (TOG) 42(4):1–11
Article Google Scholar
Barron JT, Mildenhall B, Verbin D, Srinivasan PP, Hedman P (2022) Mip-NeRF 360: unbounded anti-aliased neural radiance fields. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5470–5479
Betker J, Goh G, Jing L, Brooks T, Wang J, Li L, Ouyang L, Zhuang J, Lee J, Guo Y (2023) Improving image generation with better captions. Computer Science. https://www.cdn.openai.com/papers/dall-e-3.pdf
Bińkowski M, Sutherland DJ, Arbel M, Gretton A (2018) Demystifying mmd gans. arXiv preprint arXiv:1801.01401
Brooks T, Holynski A, Efros AA (2023) InstructPix2Pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18392–18402
Bui G, Le T, Morago B, Duan Y (2018) Point-based rendering enhancement via deep learning. Vis Comput 34:829–841
Article Google Scholar
Cao T, Kreis K, Fidler S, Sharp N, Yin K (2023) TexFusion: synthesizing 3D textures with text-guided image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4169–4181
Chan ER, Lin CZ, Chan MA, Nagano K, Pan B, De Mello S, Gallo O, Guibas LJ, Tremblay J, Khamis S (2022) Efficient geometry-aware 3d generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16123–16133
Chen A, Xu Z, Geiger A, Yu J, Su H (2022) TensoRF: tensorial radiance fields. In: European conference on computer vision. Springer, pp 333–350
Chen DZ, Siddiqui Y, Lee H-Y, Tulyakov S, Nießner M (2023a) Text2Tex: text-driven texture synthesis via diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 18558–18568
Chen Y, Chen A, Chen S, Yi R (2023b) Plasticine3D: non-rigid 3D editting with text guidance. arXiv preprint. arXiv:2312.10111
Chen R, Chen Y, Jiao N, Jia K (2023c) Fantasia3D: disentangling geometry and appearance for high-quality text-to-3d content creation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 22246–22256
Chen Y, Shao G, Shum KC, Hua B-S, Yeung S-K (2023d) Advances in 3D neural stylization: a survey. arXiv preprint. arXiv:2311.18328
Chen Y, Chen Z, Zhang C, Wang F, Yang X, Wang Y, Cai Z, Yang L, Liu H, Lin G (2024a) GaussianEditor: swift and controllable 3D editing with Gaussian splatting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 21476–21485
Chen M, Xie J, Laina I, Vedaldi A (2024b) Shap-Editor: instruction-guided latent 3D editing in seconds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 26456–26466
Cheng X, Yang T, Wang J, Li Y, Zhang L, Zhang J, Yuan L (2023) Progressive3d: Progressively local editing for text-to-3d content creation with complex semantic prompts. arXiv preprint arXiv:2310.11784
Chiang P-Z, Tsai M-S, Tseng H-Y, Lai W-S, Chiu W-C (2022) Stylizing 3D scene via implicit representation and hypernetwork. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1475–1484
Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham P, Chung HW, Sutton C, Gehrmann S (2023) PALM: scaling language modeling with pathways. J Mach Learn Res 24(240):1–113
Google Scholar
Choy CB, Xu D, Gwak J, Chen K, Savarese S (2016) 3d-r2n2: a unified approach for single and multi-view 3D object reconstruction. In: Computer Vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part VIII 14. Springer, pp 628–644
Cui C, Ma Y, Cao X, Ye W, Wang Z (2024) Receive, reason, and react: drive as you say, with large language models in autonomous vehicles. IEEE Intell Transp Syst Mag 16(4):81–94
Article Google Scholar
Curless B, Levoy M (1996) A volumetric method for building complex models from range images. In: Proceedings of the 23rd annual conference on computer graphics and interactive techniques, pp 303–312
Dai P, Tan F, Yu X, Zhang Y, Qi X (2024) Go-Nerf: generating virtual objects in neural radiance fields. arXiv preprint arXiv:2401.05750 (2024)
Decatur D, Lang I, Aberman K, Hanocka R (2024) 3d paintbrush: Local stylization of 3d shapes with cascaded score distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4473–4483
Deitke M, Schwenk D, Salvador J, Weihs L, Michel O, VanderBilt E, Schmidt L, Ehsani K, Kembhavi A, Farhadi A (2023) Objaverse: a universe of annotated 3D objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13142–13153
Dihlmann J-N, Engelhardt A, Lensch H (2024) SIGNeRF: scene integrated generation for neural radiance fields. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6679–6688
Dong J, Wang Y-X (2024) ViCA-NeRF: view-consistency-aware 3D editing of neural radiance fields. In: NIPS '23: Proceedings of the 37th international conference on neural information processing systems, vol 30, pp 61466–61477
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16×16 words: transformers for image recognition at scale. arXiv preprint. arXiv:2010.11929
Fan Z, Jiang Y, Wang P, Gong X, Xu D, Wang Z (2022) Unified implicit neural stylization. In: European conference on computer vision. Springer, pp 636–654
Fang S, Wang Y, Yang Y, Tsai Y-H, Ding W, Zhou S, Yang M-H (2023) Editing 3D scenes via text prompts without retraining. arXiv e-prints. arXiv: 2309.04917
Foo LG, Rahmani H, Liu J (2023) AI-generated content (AIGC) for various data modalities: a survey. arXiv preprint. arXiv:2308.14177
Fridovich-Keil S, Yu A, Tancik M, Chen Q, Recht B, Kanazawa A (2022) Plenoxels: radiance fields without neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5501–5510
Gafni O, Polyak A, Ashual O, Sheynin S, Parikh D, Taigman Y (2022) Make-a-scene: scene-based text-to-image generation with human priors. In: European conference on computer vision. Springer, pp 89–106
Gal R, Patashnik O, Maron H, Bermano AH, Chechik G, Cohen-Or D (2022a) StyleGAN-NADA: clip-guided domain adaptation of image generators. ACM Trans Graph (TOG) 41(4):1–13
Article Google Scholar
Gal R, Alaluf Y, Atzmon Y, Patashnik O, Bermano AH, Chechik G, Cohen-Or D (2022b) An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv preprint. arXiv:2208.01618
Gao J, Chen W, Xiang T, Jacobson A, McGuire M, Fidler S (2020) Learning deformable tetrahedral meshes for 3D reconstruction. Adv Neural Inf Process Syst 33:9936–9947
Google Scholar
Gao W, Aigerman N, Groueix T, Kim V, Hanocka R (2023) Textdeformer: Geometry manipulation using text guidance. In: ACM SIGGRAPH 2023 conference proceedings, pp 1–11
Gao C, Jiang B, Li X, Zhang Y, Yu Q (2024) Genesistex: adapting image denoising diffusion to texture space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4620–4629
Ge S, Park T, Zhu J-Y, Huang J-B (2023) Expressive text-to-image generation with rich text. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7545–7556
Häne C, Tulsiani S, Malik J (2017) Hierarchical surface prediction for 3D object reconstruction. In: 2017 International conference on 3D vision (3DV). IEEE, pp 412–420
Hanocka R, Hertz A, Fish N, Giryes R, Fleishman S, Cohen-Or D (2019) MeshCNN: a network with an edge. ACM Trans Graph (ToG) (ToG) 38(4):1–12
Article Google Scholar
Haque A, Tancik M, Efros AA, Holynski A, Kanazawa A (2023) Instruct-NeRF2NeRF: editing 3D scenes with instructions. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 19740–19750
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He R, Huang S, Nie X, Hui T, Liu L, Dai J, Han J, Li G, Liu S (2024) Customize your NeRF: adaptive source driven 3D scene editing via local-global iterative training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6966–6975
Hertz A, Mokady R, Tenenbaum J, Aberman K, Pritch Y, Cohen-Or D (2022) Prompt-to-prompt image editing with cross attention control. arXiv preprint. arXiv:2208.01626
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NIPS'17: proceedings of the 31st international conference on neural information processing systems, vol 30, pp 6629–640
Ho J, Salimans T (2022) Classifier-free diffusion guidance. arXiv preprint. arXiv:2207.12598
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inf Process Syst 33:6840–6851
Google Scholar
Hoffman J, Hu T, Kanyuk P, Marshall S, Nguyen G, Schroers H, Witting P (2023) Creating elemental characters: from sparks to fire. In: ACM SIGGRAPH 2023 Talks, pp 1–2
Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W (2021) LORA: low-rank adaptation of large language models. arXiv preprint. arXiv:2106.09685
Huang Y, Huang J, Liu Y, Yan M, Lv J, Liu J, Xiong W, Zhang H, Chen S, Cao L (2024a) Diffusion model-based image editing: a survey. arXiv preprint. arXiv:2402.17525
Huang Q, Liao Y, Hao Y, Zhou P (2024b) Noise-NeRF: hide information in neural radiance fields using trainable noise. arXiv preprint. arXiv:2401.01216
Hu T, Xu X, Liu S, Jia J (2023) Point2Pix: photo-realistic point cloud rendering via neural radiance fields. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8349–8358
Hyung J, Hwang S, Kim D, Lee H, Choo J (2023) Local 3D editing via 3D distillation of clip knowledge. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12674–12684
Jun H, Nichol A (2023) Shap-E: generating conditional 3D implicit functions. arXiv preprint. arXiv:2305.02463
Kamata H, Sakuma Y, Hayakawa A, Ishii M, Narihira T (2023) Instruct 3D-to-3D: Text instruction guided 3D-to-3D conversion. arXiv preprint. arXiv:2303.15780
Karim N, Khalid U, Iqbal H, Hua J, Chen C (2023) Free-editor: Zero-shot text-driven 3D scene editing. arXiv preprint. arXiv:2312.13663
Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4401–4410
Kato H, Ushiku Y, Harada T (2018) Neural 3D mesh renderer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3907–3916
Kawar B, Zada S, Lang O, Tov O, Chang H, Dekel T, Mosseri I, Irani M (2023) Imagic: text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6007–6017
Kerbl B, Kopanas G, Leimkühler T, Drettakis G (2023) 3D Gaussian splatting for real-time radiance field rendering. ACM Trans Graph (TOG) 42(4):139–1
Article Google Scholar
Khalid U, Iqbal H, Karim N, Hua J, Chen C (2023) Latenteditor: Text driven local editing of 3D scenes. arXiv preprint. arXiv:2312.09313
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y (2023) Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4015–4026
Kumari N, Zhang B, Zhang R, Shechtman E, Zhu J-Y (2023) Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1931–1941
Lassner C, Zollhofer M (2021) Pulsar: efficient sphere-based neural rendering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1440–1449
Lei J, Zhang Y, Jia K (2022) TANGO: text-driven photorealistic and robust 3d stylization via lighting decomposition. Adv Neural Inf Process Syst 35:30923–30936
Google Scholar
Li C, Feng BY, Fan Z, Pan P, Wang Z (2023a) StegaNeRF: embedding invisible information within neural radiance fields. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 441–453
Li J, Liu S, Liu Z, Wang Y, Zheng K, Xu J, Li J, Zhu J (2023b) InstructPix2NeRF: instructed 3d portrait editing from a single image. arXiv preprint. arXiv:2311.02826
Li Y, Liu H, Wu Q, Mu F, Yang J, Gao J, Li C, Lee YJ (2023c) GLIGEN: open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22511–22521
Li C, Zhang C, Waghwase A, Lee L-H, Rameau F, Yang Y, Bae S-H, Hong CS (2023d) Generative ai meets 3d: A survey on text-to-3d in aigc era. arXiv preprint arXiv:2305.06131
Li Y, Dou Y, Shi Y, Lei Y, Chen X, Zhang Y, Zhou P, Ni B (2024a) FocalDreamer: text-driven 3D editing via focal-fusion assembly. In: Proceedings of the AAAI conference on artificial intelligence, vol 38, pp 3279–3287
Li X, Zhang Q, Kang D, Cheng W, Gao, Y, Zhang J, Liang Z, Liao J, Cao Y-P, Shan Y (2024b) Advances in 3D generation: a survey. arXiv preprint. arXiv:2401.17807
Liao JZZLJ, Cao Y-P, Shan Y (2024) Advances in 3D generation: a survey. arXiv preprint. arXiv:2401.17807
Lin C-H, Gao J, Tang L, Takikawa T, Zeng X, Huang X, Kreis K, Fidler S, Liu M-Y, Lin T-Y (2023) MAGIC3D: high-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 300–309
Liu S, Li T, Chen W, Li H (2019) Soft rasterizer: a differentiable renderer for image-based 3d reasoning. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 7708–7717
Liu Y, Lin C, Zeng Z, Long X, Liu L, Komura T, Wang W (2023) SyncDreamer: generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453
Loper MM, Black MJ (2014) OpenDR: an approximate differentiable renderer. In: Computer vision—ECCV 2014: 13th European conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part VII 13. Springer, pp 154–169
Lorensen WE, Cline HE (1998) Marching cubes: a high resolution 3D surface construction algorithm. In: Seminal graphics: pioneering efforts that shaped the field, pp 347–353
Lüddecke T, Ecker A (2022) Image segmentation using text and image prompts. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7086–7096
Lugmayr A, Danelljan M, Romero A, Yu F, Timofte R, Van Gool L (2022) RePaint: inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11461–11471
Luo Z, Guo Q, Cheung KC, See S, Wan R (2023) CopyRNeRF: protecting the copyright of neural radiance fields. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 22401–22411
Ma Y, Zhang X, Sun X, Ji J, Wang H, Jiang G, Zhuang W, Ji R (2023) X-MESH: towards fast and accurate text-driven 3D stylization via dynamic textual guidance. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2749–2760
Manukyan H, Sargsyan A, Atanyan B, Wang Z, Navasardyan S, Shi H (2023) HD-Painter: high-resolution and prompt-faithful text-guided image inpainting with diffusion models. arXiv preprint. arXiv:2312.14091
Maturana D, Scherer S (2015) VoxNet: a 3D convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 922–928
Memery S, Cedron O, Subr K (2023) Generating parametric brdfs from natural language descriptions. In: Computer graphics forum, vol 42. Wiley Online Library, p 14980
Meng C, Rombach R, Gao R, Kingma D, Ermon S, Ho J, Salimans T (2023) On distillation of guided diffusion models. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 14297–14306
Metzer G, Richardson E, Patashnik O, Giryes R, Cohen-Or D (2023) Latent-Nerf for shape-guided generation of 3D shapes and textures. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12663–12673
Michel O, Bar-On R, Liu R, Benaim S, Hanocka R (2022) Text2mesh: text-driven neural stylization for meshes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13492–13502
Mikaeili A, Perel O, Safaee M, Cohen-Or D, Mahdavi-Amiri A (2023) SKED: sketch-guided text-based 3d editing. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 14607–14619
Mildenhall B, Srinivasan PP, Tancik M, Barron JT, Ramamoorthi R, Ng R (2021) NERF: representing scenes as neural radiance fields for view synthesis. Commun ACM 65(1):99–106
Article Google Scholar
Mirzaei A, Aumentado-Armstrong T, Brubaker MA, Kelly J, Levinshtein, A, Derpanis KG, Gilitschenski I (2023a) Reference-guided controllable inpainting of neural radiance fields. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 17815–17825
Mirzaei A, Aumentado-Armstrong T, Brubaker MA, Kelly J, Levinshtein A, Derpanis KG, Gilitschenski I (2023b) Watch your steps: local image and scene editing by text instructions. arXiv preprint. arXiv:2308.08947
Mokady R, Hertz A, Aberman K, Pritch Y, Cohen-Or D (2023) Null-text inversion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6038–6047
Mou C, Wang X, Xie L, Wu Y, Zhang J, Qi Z, Shan Y (2024) T2I-Adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI conference on artificial intelligence, vol 38, pp 4296–4304
Müller T, Evans A, Schied C, Keller A (2022) Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans Graph (ToG) 41(4):1–15
Article Google Scholar
Munkberg J, Hasselgren J, Shen T, Gao J, Chen W, Evans A, Müller T, Fidler S (2022) Extracting triangular 3D models, materials, and lighting from images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8280–8290
Nalbach O, Arabadzhiyska E, Mehta D, Seidel H-P, Ritschel T (2017) Deep shading: convolutional neural networks for screen space shading. In: Computer graphics forum, vol 36. Wiley Online Library, pp 65–78
Newcombe RA, Izadi S, Hilliges O, Molyneaux D, Kim D, Davison AJ, Kohi P, Shotton J, Hodges S, Fitzgibbon A (2011) Kinectfusion: real-time dense surface mapping and tracking. In: 2011 10th IEEE international symposium on mixed and augmented reality. IEEE, pp 127–136
Nichol AQ, Dhariwal P, Ramesh A, Shyam P, Mishkin P, Mcgrew B, Sutskever I, Chen M (2022) Glide: towards photorealistic image generation and editing with text-guided diffusion models. In: International conference on machine learning. PMLR, pp 16784–16804
Oh Y, Choi J, Kim Y, Park M, Shin C, Yoon S (2023) Controldreamer: Stylized 3D generation with multi-view controlnet. arXiv preprint. arXiv:2312.01129
OpenAI (2023) GPT-4V(ision) system card. OpenAI
Oppenlaender J (2022) The creativity of text-to-image generation. In: Proceedings of the 25th international academic mindtrek conference, pp 192–202
Palandra F, Sanchietti A, Baieri D, Rodolà E (2024) GSEDIT: efficient text-guided editing of 3D objects via Gaussian splatting. arXiv preprint. arXiv:2403.05154
Park JJ, Florence P, Straub J, Newcombe R, Lovegrove S (2019) DEEPSDF: learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 165–174
Park J, Kwon G, Ye JC (2023) ED-NERF: efficient text-guided editing of 3D scene using latent space nerf. arXiv preprint. arXiv:2310.02712
Patashnik O, Wu Z, Shechtman E, Cohen-Or D, Lischinski D (2021) STYLECLIP: text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2085–2094
Pfister H, Zwicker M, Van Baar J, Gross M (2000) SURFELS: surface elements as rendering primitives. In: Proceedings of the 27th annual conference on computer graphics and interactive techniques, pp 335–342
Poole B, Jain A, Barron JT, Mildenhall B (2022) DREAMFUSION: text-to-3d using 2D diffusion. arXiv preprint. arXiv:2209.14988
Qi CR, Su H, Mo K, Guibas LJ (2017a) PointNet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 652–660
Qi CR, Yi L, Su H, Guibas LJ (2017b) PointNet++: deep hierarchical feature learning on point sets in a metric space. In: NIPS'17: Proceedings of the 31st international conference on neural information processing systems, vol 30, pp 5105–5114
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, pp 8748–8763
Raj A, Kaza S, Poole B, Niemeyer M, Ruiz N, Mildenhall B, Zada S, Aberman K, Rubinstein M, Barron J (2023) DreamBooth3D: subject-driven text-to-3d generation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2349–2359
Rakhimov R, Ardelean A-T, Lempitsky V, Burnaev E (2022) Npbg++: Accelerating neural point-based graphics. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15969–15979
Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3
Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. In: International conference on machine learning. PMLR, pp 1060–1069
Ren J, Pan L, Tang J, Zhang C, Cao A, Zeng G, Liu Z (2023) DreamGaussian4D: generative 4D gaussian splatting. arXiv preprint. arXiv:2312.17142
Ren T, Liu S, Zeng A, Lin J, Li K, Cao H, Chen J, Huang X, Chen Y, Yan F et al (2024) Grounded SAM: assembling open-world models for diverse visual tasks. arXiv preprint. arXiv:2401.14159
Richardson E, Metzer G, Alaluf Y, Giryes R, Cohen-Or D (2023) Texture: text-guided texturing of 3D shapes. In: ACM SIGGRAPH 2023 conference proceedings, pp 1–11
Riegler G, Osman Ulusoy A, Geiger A (2017) OctNet: learning deep 3D representations at high resolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3577–3586
Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10684–10695
Rückert D, Franke L, Stamminger M (2022) Adop: Approximate differentiable one-pixel point rendering. ACM Trans Graph (ToG) (ToG) 41(4):1–14
Google Scholar
Ruiz N, Li Y, Jampani V, Pritch Y, Rubinstein M, Aberman K (2023) Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22500–22510
Saharia C, Chan W, Saxena S, Li L, Whang J, Denton EL, Ghasemipour K, Gontijo Lopes R, Karagol Ayan B, Salimans T (2022) Photorealistic text-to-image diffusion models with deep language understanding. Adv Neural Inf Process Syst 35:36479–36494
Google Scholar
Sanghi A, Chu H, Lambourne JG, Wang Y, Cheng C-Y, Fumero M, Malekshan KR (2022) Clip-forge: towards zero-shot text-to-shape generation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 18603–18613
Sella E, Fiebelman G, Hedman P, Averbuch-Elor H (2023) VOX-E: text-guided voxel editing of 3d objects. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 430–440
Shahbazi M, Claessens L, Niemeyer M, Collins E, Tonioni A, Van Gool L, Tombari F (2024) INSERF: text-driven generative object insertion in neural 3D scenes. arXiv preprint. arXiv:2401.05335
Shi S, Guo C, Jiang L, Wang Z, Shi J, Wang X, Li H (2020) PV-RCNN: point-voxel feature set abstraction for 3D object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10529–10538
Shi Z, Peng S, Xu Y, Geiger A, Liao Y, Shen Y (2022) Deep generative models on 3D representations: a survey. arXiv preprint. arXiv:2210.15663
Shi Y, Wang P, Ye J, Long M, Li K, Yang X (2023) MVDREAM: multi-view diffusion for 3D generation. arXiv preprint. arXiv:2308.16512
Shirman LA, Sequin CH (1987) Local surface interpolation with Bézier patches. Computer Aid Geom Des 4(4):279–295
Article Google Scholar
Shum KC, Kim J, Hua, B-S, Nguyen DT, Yeung S-K (2024) Language-driven object fusion into neural radiance fields with pose-conditioned dataset updates. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5176–5187
Song L, Cao L, Gu J, Jiang Y, Yuan J, Tang H (2023a) Efficient-NeRF2NeRF: streamlining text-driven 3D editing with multiview correspondence-enhanced diffusion models. arXiv preprint. arXiv:2312.08563
Song H, Choi S, Do H, Lee C, Kim T (2023b) Blending-NeRF: text-driven localized editing in neural radiance fields. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 14383–14393
Stutz D, Geiger A (2018) Learning 3D shape completion from laser scan data with weak supervision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1955–1964
Sun C, Sun M, Chen H-T (2022) Direct voxel grid optimization: super-fast convergence for radiance fields reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5459–5469
Suvorov R, Logacheva E, Mashikhin A, Remizova A, Ashukha A, Silvestrov A, Kong N, Goka H, Park K, Lempitsky V (2022) Resolution-robust large mask inpainting with Fourier convolutions. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2149–2159
Tang J, Ren J, Zhou H, Liu Z, Zeng G (2023) DreamGaussian: generative gaussian splatting for efficient 3D content creation. arXiv preprint. arXiv:2309.16653
Taniguchi D (2019) AR-Net: immersive augmented reality with real-time neural style transfer. In: ACM SIGGRAPH 2019 virtual, augmented, and mixed reality, pp 1–1
Tatarchenko M, Dosovitskiy A, Brox T (2017) Octree generating networks: efficient convolutional architectures for high-resolution 3D outputs. In: Proceedings of the IEEE international conference on computer vision, pp 2088–2096
Thies J, Zollhöfer M, Nießner M (2019) Deferred neural rendering: image synthesis using neural textures. ACM Trans Graph (TOG) 38(4):1–12
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
Wang N, Zhang Y, Li Z, Fu Y, Liu W, Jiang Y-G (2018) Pixel2Mesh: generating 3D mesh models from single RGB images. In: Proceedings of the European conference on computer vision, pp 52–67
Wang Y, Sun Y, Liu Z, Sarma SE, Bronstein MM, Solomon JM (2019) Dynamic graph CNN for learning on point clouds. ACM Trans Graph (TOG) 38(5):1–12
Article Google Scholar
Wang C, Chai M, He M, Chen D, Liao J (2022) Clip-Nerf: text-and-image driven manipulation of neural radiance fields. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3835–3844
Wang Z, Li M, Chen C (2023a) LucidDreaming: controllable object-centric 3D generation. arXiv preprint. arXiv:2312.00588
Wang C, Jiang R, Chai M, He M, Chen D, Liao J (2023b) NeRF-Art: text-driven neural radiance fields stylization. IEEE Trans Vis Comput Graph 30(8):4983–4996
Article Google Scholar
Wang D, Zhang T, Abboud A, Süsstrunk S (2023c) InpaintNerf360: text-guided 3D inpainting on unbounded neural radiance fields. arXiv preprint. arXiv:2305.15094
Wang J, Fang J, Zhang X, Xie L, Tian Q (2024a) GaussianEditor: editing 3D Gaussians delicately with text instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20902–20911
Wang Y, Yi X, Wu Z, Zhao N, Chen L, Zhang H (2024b) View-consistent 3D editing with gaussian splatting. arXiv preprint. arXiv:2403.11868
Wen C, Zhang Y, Li Z, Fu Y (2019) Pixel2Mesh++: multi-view 3d mesh generation via deformation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1042–1051
Wu Z, Song S, Khosla A, Yu F, Zhang L, Tang, X, Xiao J (2015) 3D ShapeNets: a deep representation for volumetric shapes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1912–1920
Wu J, Zhang C, Xue T, Freeman B, Tenenbaum J (2016) Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In: NIPS'16: Proceedings of the 30th international conference on neural information processing systems, vol 29, pp 82–90
Wu W, Qi Z, Fuxin L (2019) PointConv: deep convolutional networks on 3D point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9621–9630
Wu J, Bian J-W, Li X, Wang G, Reid I, Torr P, Prisacariu VA (2024a) GaussCtrl: multi-view consistent text-driven 3D Gaussian splatting editing. arXiv preprint.arXiv:2403.08733
Wu J, Liu X, Wu C, Gao X, Liu J, Liu X, Zhao C, Feng H, Ding E, Wang J (2024b) TEXRO: generating delicate textures of 3D models by recursive optimization. arXiv preprint. arXiv:2403.15009
Wu T, Yang G, Li Z, Zhang K, Liu Z, Guibas L, Lin D, Wetzstein G (2024c) GPT-4V(ision) is a human-aligned evaluator for text-to-3D generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22227–22238
Wu G, Yi T, Fang J, Xie L, Zhang X, Wei W, Liu W, Tian Q, Wang X (2024d) 4D Gaussian splatting for real-time dynamic scene rendering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20310–20320
Xiao G, Yin T, Freeman WT, Durand F, Han S (2023) FastComposer: tuning-free multi-subject image generation with localized attention. arXiv preprint. arXiv:2305.10431
Xie S, Zhang Z, Lin Z, Hinz T, Zhang K (2023) SmartBrush: text and shape guided object inpainting with diffusion model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22428–22437
Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1316–1324
Xu Q, Xu Z, Philip J, Bi S, Shu Z, Sunkavalli K, Neumann U (2022) Point-Nerf: point-based neural radiance fields. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5438–5448
Xu S, Huang Y, Pan J, Ma Z, Chai J (2023) Inversion-free image editing with natural language. arXiv preprint. arXiv:2312.04965
Yang B, Bao C, Zeng J, Bao H, Zhang Y, Cui Z, Zhang G (2022) Neumesh: learning disentangled neural mesh-based implicit field for geometry and texture editing. In: European conference on computer vision. Springer, pp 597–614
Yang Z, Yang H, Pan Z, Zhu X, Zhang L (2023) Real-time photorealistic dynamic scene representation and rendering with 4D Gaussian splatting. arXiv preprint. arXiv:2310.10642
Yifan W, Serena F, Wu S, Öztireli C, Sorkine-Hornung O (2019) Differentiable surface splatting for point-based geometry processing. ACM Trans Graph (TOG) 38(6):1–14
Article Google Scholar
Yin Y, Fu Z, Yang F, Lin G (2023) Or-Nerf: object removing from 3D scenes guided by multiview segmentation with neural radiance fields. arXiv preprint. arXiv:2305.10503
Zeng X, Chen X, Qi Z, Liu W, Zhao Z, Wang Z, Fu B, Liu Y, Yu G (2024) Paint3D: paint anything 3D with lighting-less texture diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4252–4262
Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3836–3847
Zhou Y, Wu C, Li Z, Cao C, Ye Y, Saragih J, Li H, Sheikh Y (2020) Fully convolutional mesh autoencoder using efficient spatially varying kernels. Adv Neural Inf Process Syst 33:9251–9262
Google Scholar
Zhou X, He Y, Yu FR, Li J, Li Y (2023) Repaint-Nerf: nerf editting via semantic masks and diffusion models. In: Proceedings of the thirty-second international joint conference on artificial intelligence, pp 1813–1821
Zhu J, Zhang Y, Zhang X, Cao X (2021) Gaussian model for 3D mesh steganography. IEEE Signal Process Lett 28:1729–1733
Article Google Scholar
Zhuang J, Wang C, Lin L, Liu L, Li G (2023) DREAMEDITOR: text-driven 3D scene editing with neural fields. In: SIGGRAPH Asia 2023 conference papers, pp 1–10
Zhuang J, Kang D, Cao Y-P, Li G, Lin L, Shan Y (2024) Tip-Editor: an accurate 3D editor following both text-prompts and image-prompts. ACM Trans Graph (TOG) 43(4):1–12
Article Google Scholar
Zimny D, Waczyńska J, Trzciński T, Spurek P (2024) Points2Nerf: generating neural radiance fields from 3D point cloud. Pattern Recogn Lett 185:8–14
Article Google Scholar
Zwicker M, Pfister H, Van Baar J, Gross M (2001) Surface splatting. In: Proceedings of the 28th annual conference on computer graphics and interactive techniques, pp 371–378

Download references

Acknowledgements

This work is supported by project ZR2021QF062 supported by Shandong Provincial Natural Science Foundation.

Author information

Authors and Affiliations

Shandong Massive Information Technology Research Institute, Jinan, China
Lihua Lu, Ruyang Li, Xiaohui Zhang, Hui Wei, Guoguang Du & Binqiang Wang

Authors

Lihua Lu
View author publications
You can also search for this author inPubMed Google Scholar
Ruyang Li
View author publications
You can also search for this author inPubMed Google Scholar
Xiaohui Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Hui Wei
View author publications
You can also search for this author inPubMed Google Scholar
Guoguang Du
View author publications
You can also search for this author inPubMed Google Scholar
Binqiang Wang
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Lihua Lu: Conceptualization of this study, methodology, software, writing, visualization, validation, and review. Ruyang Li: Discussion, review, editing, and supervision. Xiaohui Zhang: Visualization, validation, and review. Hui Wei: Discussion, review, editing, and supervision. Guoguang Du: Discussion, review, and editing. Binqiang Wang: Discussion, review, and editing.

Corresponding authors

Correspondence to Lihua Lu or Hui Wei.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Ethical approval

Not applicable.

Informed consent

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Lu, L., Li, R., Zhang, X. et al. Advances in text-guided 3D editing: a survey. Artif Intell Rev 57, 321 (2024). https://doi.org/10.1007/s10462-024-10937-6

Download citation

Accepted: 29 August 2024
Published: 12 October 2024
DOI: https://doi.org/10.1007/s10462-024-10937-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Advances in text-guided 3D editing: a survey

Abstract

Similar content being viewed by others

DreamScene: 3D Gaussian-Based Text-to-3D Scene Generation via Formation Pattern Sampling

UniCanvas: Affordance-Aware Unified Real Image Editing via Customized Text-to-Image Generation

Diverse Text-to-3D Synthesis with Augmented Text Embedding

Explore related subjects

1 Introduction

2 Fundamentals of text-guided 3D editing

2.1 Text-guided image generation and editing

2.1.1 Text-guided image generation

2.1.2 Text-guided image editing

2.2 Text-guided 3D editing optimization

2.2.1 CLIP-based optimization

2.2.2 Diffusion-based optimization

2.3 3D representations

2.3.1 Explicit 3D representations

2.3.2 Implicit 3D representations

2.3.3 Hybrid 3D representations

3 Text-guided 3D editing

3.1 Taxonomy

3.2 2D-3D lifting

3.2.1 Reconstruction-based optimization

3.2.2 Diffusion-based optimization

3.2.3 Hybrid loss-based optimization

3.3 Direct 3D editing

3.3.1 CLIP-based optimization

3.3.2 Diffusion-based optimization

3.3.3 Hybrid loss-based optimization

3.4 Discussions

4 3D editing capacity

4.1 Text-guided texturing of 3D shapes

4.1.1 Discussion

4.2 Text-guided style transfer

4.2.1 Discussion

4.3 Text-guided local editing of scenes

4.3.1 Discussion

4.4 Text-guided insertion editing

4.4.1 Discussion

5 Open challenges and future work

6 Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords