✓✓ \newunicodechar✗✗
Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing
Abstract
Recently, diffusion models have emerged as a powerful class of generative models. Despite their success, there is still limited understanding of their semantic spaces. This makes it challenging to achieve precise and disentangled image generation without additional training, especially in an unsupervised way. In this work, we improve the understanding of their semantic spaces from intriguing observations: among a certain range of noise levels, (1) the learned posterior mean predictor (PMP) in the diffusion model is locally linear, and (2) the singular vectors of its Jacobian lie in low-dimensional semantic subspaces. We provide a solid theoretical basis to justify the linearity and low-rankness in the PMP. These insights allow us to propose an unsupervised, single-step, training-free LOw-rank COntrollable image editing (LOCO Edit) method for precise local editing in diffusion models. LOCO Edit identified editing directions with nice properties: homogeneity, transferability, composability, and linearity. These properties of LOCO Edit benefit greatly from the low-dimensional semantic subspace. Our method can further be extended to unsupervised or text-supervised editing in various text-to-image diffusion models (T-LOCO Edit). Finally, extensive empirical experiments demonstrate the effectiveness and efficiency of LOCO Edit. The codes will be released at https://github.com/ChicyChen/LOCO-Edit.
Key words: diffusion model, controlled generation, precise local edit, low-rank, interpretability
Contents
- 1 Introduction
- 2 Preliminaries on Diffusion Models
- 3 Exploring Linearity & Low-Dimensionality for Image Editting
- 4 Justification of Local Linearity, Low-rankness, & Semantic Direction
- 5 Experiments
- 6 Discussion on Related Works
- 7 Conclusion & Future Directions
- A More Empirical Study on Low-rankness & Local Linearity
- B Extra Details of LOCO Edit and T-LOCO Edit
- C Proofs in Section 4
- D Imaging Editing Experiment Details
1 Introduction
Recently, diffusion models have emerged as a powerful new family of deep generative models with remarkable performance in many applications such as image generation across various domains [1, 2, 3, 4, 5, 6], audio synthesis [7, 8], solving inverse problem [9, 10, 11, 12, 13, 14], and video generation [15, 16, 17]. For example, recent advances in AI-based image generation, revolutionized by diffusion models such as Dalle-2 [18], Imagen [19], and stable diffusion [4], have taken the world of “AI Art generation”, enabling the generation of images directly from descriptive text inputs. These models corrupt images by adding noise through multiple steps of forward process and then generate samples by progressive denoising through multiple steps of the reverse generative process.
Although modern diffusion models are capable of generating photorealistic images from text prompts, manipulating the generated content by diffusion models in practice has remaining challenges. Unlike generative adversarial networks [20], the understanding of semantic spaces in diffusion models is still limited. Thus, achieving disentangled and localized control over content generation by direct manipulation of the semantic spaces remains a difficult task for diffusion models. Although effective, some existing editing methods in diffusion models often demand additional training procedures and are limited to global control of content generation [21, 22, 23]. Some methods are training-free or localized but are still based upon heuristics, lacking clear mathematical interpretations, or for text-supervised editing only [24, 25, 26, 27, 28]. Others provide analysis in diffusion models [29, 30, 31, 32, 33], but also have difficulty in local edits such as hair color.
In this study, we address the above problem by studying the low-rank semantic subspaces in diffusion models and proposing the LOw-rank COntrollable edit (LOCO Edit) approach. LOCO is the first local editing method that is single-step, training-free, requiring no text supervision, and having other intriguing properties (see Figure 1 for an illustration). Our method is highly intuitive and theoretically grounded, originating from a simple while intriguing observation in the learned posterior mean predictor (PMP) in diffusion models: for a large portion of denoising time steps,
The PMP is a locally linear mapping between the noise image and the estimated clean image, and the singular vectors of its Jacobian reside within low-dimensional subspaces.
The empirical evidence in Figure 2 consistently shows that this phenomenon occurs when training diffusion models using different network architectures on a range of real-world image datasets. Theoretically, we validated this observation by assuming a mixture of low-rank Gaussian distributions for the data. We then prove the local linearity of the PMP, the low-rank nature of its Jacobian, and that the singular vectors of the Jacobian span the low-dimensional subspaces.
By utilizing the linearity of the PMP, we can edit within the singular vector subspace of its Jacobian to achieve linear control of the image content with no label or text supervision. The editing direction can be efficiently computed using the generalized power method (GPM) [30, 34]. Furthermore, we can manipulate specific regions of interest in the image along a disentangled direction through efficient nullspace projection, taking advantage of the low-rank properties of the Jacobian.
Benefits of LOCO Edit.
Compared to existing editing methods (e.g., [29, 35, 23, 24]) based on diffusion models, the proposed LOCO Edit offers several benefits that we highlight below:
-
•
Precise, single-step, training-free, and unsupervised editing. LOCO enables precise localized editing (Figure 1(a)) in a single timestep without any training. Further, it requires no text supervision based on CLIP [36], thus integrating no intrinsic biases or flaws from CLIP [37]. LOCO is applicable to various diffusion models and datasets (Figure 5).
-
•
Linear, transferable, and composable editing directions. The identified editing direction is linear, meaning that changes along this direction produce proportional changes in a semantic feature in the image space (Figure 1(d)). These editing directions are homogeneous and can be transferred across various images and noise levels (Figure 1(b)). Moreover, combining disentangled editing directions leads to simultaneous semantic changes in the respective region, while maintaining consistency in other areas (Figure 1(c)).
-
•
An intuitive and theoretically grounded approach. Unlike previous works, by leveraging the local linearity of the PMP and the low-rankness of its Jacobian, our method is highly interpretable. The identified properties are well supported by both our empirical observation (Figure 2) and theoretical justifications in Section 4.
Notations.
Throughout the paper, we use to denote the noise-corrupted image space at the time-step . In particular, denotes the clean image space with data distribution , and denote an image. denote the posterior mean space at the time-step . Here, denotes a unit hypersphere in , and denotes the Stiefel manifold. denotes the numerical rank of . denotes the posterior mean and is written as . denotes the span of the columns of . denotes the set of all solutions to the equation . denotes the projection of onto .
Organization.
In Section 2, we introduce preliminaries on diffusion models. In Section 3, we present the exploration of local linearity and low-rankness in PMP, the method intuition based on the insights, and method details of LOCO and T-LOCO Edits. In Section 4, we theoretically justify the low-rankness, linearity, and semantic subspace in the PMP. In Section 5, we conduct comprehensive experiments to demonstrate the superiority of LOCO Edit and investigate its robustness. In Section 6, we thoroughly discuss related works. In Section 7, we conclude with potential future directions.
2 Preliminaries on Diffusion Models
In this section, we start by reviewing the basics of diffusion models [1, 2, 39], followed by several key techniques that will be used in our approach, such as Denoising Diffusion Implicit Models (DDIM) [3] and its inversion [40], T2I diffusion model, and classifier-free guidance [41].
Basics of Diffusion Models.
In general, diffusion models consist of two processes:
-
•
The forward diffusion process. The forward process progressively perturbs the original data to a noisy sample for with the Gaussian noise. As in [1], this can be characterized by a conditional Gaussian distribution . Particularly, parameters sastify: (i) , and thus , and (ii) , and thus .
-
•
The reverse sampling process. To generate a new sample, previous works [1, 3, 42, 43] have proposed various methods to approximate the reverse process of diffusion models. Typically, these methods involve estimating the noise and removing the estimated noise from recursively to obtain an estimate of . Specifically, the sampling step from to with a small can be described as:
(1) where is parameterized by a neural network and trained to predict the noise at time .
Denoiser and Posterior Mean Predictor (PMP).
According to [1], the denoiser is optimized by solving the following problem:
where denotes the network parameters of the denoiser. Once is well trained, recent studies [44, 45] show that the posterior mean , i.e., predicted clean image at time , can be estimated as follows:
(2) |
Here, denotes the posterior mean predictor (PMP) [45, 44], and denotes the estimated posterior mean output from PMP given and as the input. For simplicity, we denote as .
DDIM and DDIM Inversion.
Given a noisy sample at time , DDIM [3] can generate clean images by multiple denoising steps. Given a clean sample , DDIM inversion [3] can generate a noisy at time by adding multiple steps of noise following the reversed trajectory of DDIM. DDIM inversion has been widely in image editing methods [40, 46, 29, 35, 47, 26] to obtain given the original and then performing editing starting from . In our work, after getting given via DDIM inversion, we edit to only at the single time step with the help of PMP, and then utilize DDIM to generate the edited image .
For ease of exposition, for any and with , we denote DDIM operator and its inversion as
Text-to-image (T2I) Diffusion Models & Classifier-Free Guidance.
So far, our discussion has only focused on unconditional diffusion models. Moreover, our approach can be generalized from unconditional diffusion models to T2I diffusion models [38, 4, 48, 19], where the latter enables controllable image generation guided by a text prompt . In more detail, when training T2I diffusion models, we optimize a conditional denoising function . For sampling, we employ a technique called classifier-free guidance [41], which substitutes the unconditional denoiser in Equation 1 with its conditional counterpart that can be described as follows:
(3) |
Here, denotes the empty prompt and denotes the strength for the classifier-free guidance.
3 Exploring Linearity & Low-Dimensionality for Image Editting
In this section, we formally introduce the identified low-rank subspace in diffusion models and the proposed LOCO Edit method with the underlying intuitions. In Section 3.1, we present the benign properties in PMP that our method utilizes. Followed by this, in Section 3.3 we provide a detailed description of our method.
3.1 Local Linearity and Intrinsic Low-Dimensionality in PMP
First, let us delve into the key intuitions behind the proposed LOCO Edit method, which lie in the benign properties of the PMP . At one given timestep , let us consider the first-order Taylor expansion of at the point :
(4) |
where is a perturbation direction with unit length, is the perturbation strength, and is the Jacobian of . Interestingly, we discovered that within a certain range of noise levels, the learned PMP exhibits local linearity, and the singular subspace of its Jacobian is low rank. Notably, these properties are universal across various network architectures (e.g., UNet and Transformers) and datasets.
We measure the low-rankness with rank ratio and the local linearity with norm ratio and cosine similarity. Specifically, (i) rank ratio is the ratio of and the ambient dimension ; (ii) norm ratio is the ratio of and ; (iii) cosine similarity is between and . The detailed experiment settings are provided in Section A.1, and results are illustrated in Figure 2, from which we observe:
-
•
Low-rankness of the Jacobian . As shown in Figure 2(a), the rank ratio for consistently displays a U-shaped pattern across various network architectures and datasets: (i) it is close to near either the pure noise or the clean image , (ii) is low-rank (i.e., rank ratio less than ) for all diffusion models within the range , (iii) it achieves the lowest value around mid-to-late timestep, slightly differs depending on architecture and dataset.
-
•
Local linearity of the PMP . Moreover, the mapping exhibits strong linearity across a large portion of the timesteps; see Figure 2(b) and Figure 10. Specifically, in Figure 2(b), we evaluate the linearity of at where the rank ratio is close to the lowest value. We can see that even when , which is consistently true among different architectures trained on different datasets.
In addition to comprehensive experimental studies, we will also demonstrate in Section 4 that both properties can be theoretically justified.
3.2 Key Intuitions for Our Image Editing Method
The two benign properties offer valuable insights for image editing with precise control. Here, we first present the high-level intuitions behind our method, with further details postponed to Section 3.3. Specifically, for any given time-step , let us denote the compact singular value decomposition (SVD) of the Jacobian as
(5) |
where is the rank of , and denote the left and right singular vectors, and denote the singular values. We write in short for a specific , and denote and .
-
•
Local linearity of PMP for one-step, training-free, and supervision-free editing. Given the PMP is locally linear at the -th timestep, if we perturb by , using one right singular vector of as an example editing direction, then by orthogonality
(6) This implies we can achieve one-step editing along the semantic direction . Notably, the method is training-free and supervision-free since the editing direction can be simply found via the SVD of .
-
•
Local linearity of PMP for linear, homogeneous, and composable image editing. (i) First, the editing direction is linear, where any linear change along results in a linear change along for the edited image. (ii) Second, the editing direction is homogeneous due to its independence of , where it could be applied on any images from the same data distribution and results in the same semantic editing. (iii) Third, editing directions are composable. Any linearly combined editing direction is a valid editing direction which would result in a composable change in the edited image. On the contrary, results in no editing since .
-
•
Low-rankness of Jacobian for localized and efficient editing. is for the entire predicted clean image, thus finds editing directions in the entire image. Denote the Jacobian only for a certain region of interest (ROI), and the Jacobian for regions outside ROI. Similarly, can edit mainly regions within the ROI, and contain directions that do not edit regions outside of ROI. Further projection of onto can result in a more localized editing direction for ROI. To perform such nullspace projection, computing the full SVD can be very expensive. But we can highly reduce the computation by the low-rank estimation of Jacobians with rank . The estimation is efficient yet effective with when the rank of the Jacobian achieves the lowest value.
Upon publishing our results, we encountered a concurrent study [32] that introduced a method for editing audio and images similar to Equation 6, drawing on an interesting analysis of the posterior distribution presented in [33]. However, our approach offers a distinct perspective, providing complementary insights and new findings. Specifically: (i) We explore the low-rank nature and local linearity in PMP, offering rigorous theoretical analyses of these characteristics and shedding light on the semantic meanings of the low-rank subspaces. (ii) These insights give rise to favorable properties in our editing method, such as transferability and composability; see Figure 1. (iii) Furthermore, we enable localized editing through null space projection (see Section 3) and demonstrate the robustness of the method across a variety of models (see Figure 5). (iv) Finally, we extend the method to unsupervised and text-supervised editing in various text-to-image models; see Figure 4.
3.3 Low-rank Controllable Image Editing Method with Nullspace Projection
In this subsection, we provide a detailed introduction to LOCO Edit, expanding on the discussion in Section 3.1. We first introduce the supervision-free LOCO Edit, where we further enable localized image editing through nullspace projection with masks. Second, we present how to generalize to T-LOCO Edit for T2I diffusion models w/wo text-supervision to define the semantic editing directions.
LOCO Edit.
We first introduce the general pipeline of LOCO Edit. As illustrated in Figure 3, given an original image , we first use to generate a noisy image . In particular, we choose so that the PMP is locally linear and its Jacobian is close to its lowest rank. From Section 3.1, we know that we can edit the image by changing , where is the identified editing direction. After editing to , we use to generate the edited image.
In many practical applications, we often need to edit only specific local regions of an image while leaving the rest unchanged. As discussed in Section 3.2, we can achieve this task by finding a precise local editing direction with localized Jacobians and nullspace projection. Overall, the complete method is in Algorithm 1. We describe the key details as follows.
-
•
Finding localized Jacobians via masking. To enable local editing, we use a mask (i.e., an index set of pixels) to select the region of interest,222For datasets that have predefined masks, we can use them directly. For other datasets that lack predefined masks as well as generate images, we can utilize Segment Anything (SAM) to generate masks [55]. with denoting the projection onto the index set . For picking a local editing direction, we calculate the Jacobian of restricted to the region of interest, , and select the localized editing direction from the top- singular vectors of (e.g., for some index ). In practice, a top- rank estimation for is calculated through the generalized power method (GPM) Algorithm 2 with to improve efficiency.
-
•
Better semantic disentanglement via nullspace projection. However, the projection introduces extra nonlinearity into the mapping , causing the identified direction to have semantic entanglements with the area outside of the mask. Here, denotes the complimentary set of . To address this issue, we can use the nullspace projection method [56, 57]. Specifically, given , nullspace projection projects onto . The projection can be computed as so that the modified editing direction does not change the image in . In practice, we calculate a top- rank estimation for through the generalized power method (GPM) Algorithm 2 with .
T-LOCO Edit.
The unsupervised edit method can be seamlessly applied to T2I diffusion models with classifier-free guidance (3) (Algorithm 3). Besides, we can further enable text-supervised image editing with an editing prompt (Algorithm 4). See results in Figure 4(a). This is useful because the additional text prompt allows us to enforce a specified editing direction that cannot be found easily in the semantic subspace of the vanilla Jacobian . As illustrated in Figure 4(b), this includes adding glasses or changing the curly hair of a human face. For simplicity, we introduce the key ideas of text-supervised T-LOCO Edit based upon DeepFloyd IF [19]. Similar procedures are also generalized to Stable Diffusion and Latent Consistency Models with an additional decoding step [4, 38]. We discuss the key intuition below, see Section B.2 and Section B.3 for method details.
We first introduce some notations. Let denote the original prompt, and denote the editing prompt. For example, in Figure 4(b), can be “portrait of a man”, while can be “portrait of a man with glasses”. Correspondingly, given the noisy image for the clean image generated with , let and be the estimated posterior mean and its Jacobian conditioned on the original prompt , and let and be the estimated posterior mean and its Jacobian conditioned on both the editing prompt and .
According to the classifier-free guidance (3), we can estimate the difference of estimated posterior means caused by the editing prompt as , and then set as an initial estimator of the editing direction.333The idea is to identify the editing direction in the space based on changes in the estimated posterior mean caused by the editing prompt. More details are provided in Section B.3. Based upon this, to enable localized editing, similar to the unsupervised case, we can apply masks to select ROI in and calculate localized Jacobian to get . After that, similarly, we can perform nullspace projection of for better disentanglement to get the final editing direction .
4 Justification of Local Linearity, Low-rankness, & Semantic Direction
In this section, we provide theoretical justification for the benign properties in Section 3.1. First, we assume that the image distribution follows mixture of low-rank Gaussians defined as follows.
Assumption 1.
The data generated distribution lies on a union of subspaces. The basis of each subspace are orthogonal to each other with for all , and the subspace dimension is much smaller than the ambient dimension . Moreover, for each , follows degenerated Gaussian with Without loss of generality, suppose is from the -th class, that is where , i.e. . Both is bounded.
Our data assumption is motivated by the intrinsic low-dimensionality of real-world image dataset [58].Additionally, Wang et al. [59] demonstrated that images generated by an analytical score function derived from a mixture of Gaussians distribution exhibit conceptual similarities to those produced by practically trained diffusion models. Given that is an estimator of the posterior mean , we show that the posterior mean can analytically derived as follows.
Lemma 1.
Under 1, for , the posterior mean is
(7) |
Lemma 1 shows that the posterior mean could be viewed as a convex combination of , i.e. projected onto each subspace . This lemma leads to the following theorem:
Theorem 1.
Based upon 1, we can show the following three properties for the posterior mean :
-
•
The Jacobian of posterior mean satisfies for all .
-
•
The posterior mean has local linearity such that
(8) where and is the step size.
-
•
is symmetric and the full SVD of could be written as , where , with and . Let and . It holds that
The proof is deferred to Appendix C. Admittedly, there are gap between our theory and practice, such as the approximation error between and , assumptions about the data distribution, and the high rankness of for and in Figure 2. Nonetheless, Theorem 1 largely supports our empirical observation in Section 3 that we discuss below:
-
•
Low-rankness of the Jacobian. The first property in Theorem 1 demonstrates that the rank of is always no greater than the intrinsic dimension of the data distribution. Given that the intrinsic dimension of the real data distribution is usually much lower than the ambient dimension [58], the rank of on the real dataset should also be low. The results align with our empirical observations in Figure 2 when .
-
•
Linearity of the posterior mean. The second property in Theorem 1 shows that the linear approximation error is within the order of . This implies that when approaches 1, becomes small, resulting in a small approximation error even for large . Empirically, Figure 2 shows that the linear approximation error of is small when and , whereas Figure 10 shows a much larger error for under the same . These observations align well with our theory.
-
•
Low-dimensional semantic subspace. The third property in Theorem 1 shows that, when is close to 1, the left singular vectors associated with the top- singular values form the basis of the image distribution. Thus, if the edited direction is the basis, the edited image will remain within the image distribution. This explains why found in Equation 6 is a semantic direction for image editing.
5 Experiments
In this section, we perform extensive experiments to demonstrate the effectiveness and efficiency of LOCO Edit. We first showcase LOCO Edit has strong localized editing ability across a variety of datasets in Section 5.1. Moreover, we conduct comprehensive comparisons with other methods to show the superiority of the LOCO Edit method in Section 5.2. Besides, we provide ablation studies on multiple components in our method in Section 5.3. Further, we visualize and analyze the editing directions from LOCO Edit in Section 5.4. All the experiments can be conducted with a single A40 GPU with 48G memory, with extra experimental details postponed to Appendix D.
5.1 Demonstration on Localized Editing and Other Benign Properties
First, we demonstrate several benign properties of our unsupervised LOCO Edit in Algorithm 1 on a variety of datasets, including LSUN-Church [60], Flower [61], AFHQ [62], CelebA-HQ [52], and FFHQ [63].
As shown in Figure 5 and Figure 1(a), our method enables editing specific localized regions such as eye size/focus, hair curvature, length/amount, and architecture, while preserving the consistency of other regions. Besides the ability of precise local editing, Figure 1 demonstrates the benign properties of the identified editing directions and verify our analysis in Section 4:
-
•
Linearity. As shown Figure 1(d), the semantic editing can be strengthened through larger editing scales and can be flipped by negating the scale.
-
•
Homogeneity and transferability. As shown Figure 1(b), the discovered editing direction can be transferred across samples and timesteps in .
-
•
Composability. As shown Figure 1(c), the identified disentangled editing directions in the low-rank subspace allow direct composition without influencing each other.
5.2 Comprehensive Comparison with Other Image Editing Methods
Method Name | Pullback | NoiseCLR | Asyrp | BlendedDiffusion | LOCO (Ours) | |
Local Edit Success Rate↑ | 0.32 | 0.37 | 0.32 | 0.47 | 0.55 | 0.80 |
LPIPS↓ | 0.16 | 0.13 | 0.14 | 0.22 | 0.03 | 0.08 |
SSIM↑ | 0.60 | 0.66 | 0.68 | 0.68 | 0.94 | 0.71 |
Transfer Success Rate↑ | 0.14 | 0.24 | 0.66 | 0.58 | Can’t Transfer | 0.91 |
Transfer Edit Time↓ | 4s | 2s | 5s | 3s | Can’t Transfer | 2s |
#Images for Learning | 1 | 1 | 100 | 100 | 1 | 1 |
Learning Time↓ | 8s | 44s | 1 day | 475s | 120s | 79s |
One-step Edit? | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ |
No Additional Supervision? | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ |
Theoretically Grounded? | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
Localized Edit? | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ |
We compare LOCO Edit with several notable and recent image editing techniques, including Asyrp [29], Pullback [30] (in the manifold), NoiseCLR [23], and BlendedDifusion [24]. Additionally, we also compare with an unexplored method using the Jacobians to find the editing direction, named as .
Evaluation Metrics.
We evaluate our method based upon the metrics that we elaborate on below and summarize the results in Table 1. Besides the image generation quality, we also compared other attributes such as the local edit ability, efficiency, the requirement for supervision, and theoretical justifications.
-
•
Local Edit Success Rate evaluates whether the editing successfully changes the target semantics and preserves unrelated regions by human evaluators.
- •
-
•
Transfer Success Rate measures whether the editing transferred to other images successfully changes the target semantics and preserves unrelated regions by human evaluators.
-
•
Learning time to measure the time required to identify edit directions or perform other learning/training.
-
•
Transfer Edit Time to measure the time required to transfer the editing to other images directly.
-
•
#Images for Learning measures the number of images used to find the editing directions.
-
•
One-step Edit, No Additional Supervision, Theoretically Grounded, and Localized Edit are attributes of the editing methods, where each of them measures a specific property for the method.
For fair comparison, we evaluate the methods on randomly selected images without cherry-picking for methods having strong edit ability in Figure 6. The detailed evaluation settings are provided in Section D.2.
Benefits of Our Method.
Based upon the qualitative and quantitative comparisons, our method shows several clear advantages that we summarize as follows.
-
•
Superior local edit ability with one-step edit. Table 1 shows LOCO Edit achieves the best Local Edit Success Rate. Such local edit ability only requires one-step edit at a specific time . For LPIPS and SSIM, our method performs better than global edit methods but worse than BlendedDiffusion. However, BlendedDiffusion sometimes fails the edit within the masks (as visualized in Figure F, rows 1, 3, 4, and 5). Other methods like NoiseCLR find semantic direction more globally, such as style and race, leading to worse performance in Local Edit Success Rate, LPIPS, and SSIM for localized edits.
-
•
Transferability and efficiency. First, LOCO Edit requires less learning time than most of the other methods and requires learning only for a single time step with a single image. Moreover, LOCO Edit is highly transferable, having the highest Transfer Success Rate in Table A. In contrast, BlendedDiffusion cannot transfer and requires optimization for each individual image. NoiseCLR has the second-best yet lower transfer success rate, while other methods exhibit worse transferability.
-
•
Theoretically-grounded and supervision-free. LOCO Edit is theoretically grounded. Besides, it is supervision-free, thus integrating no biases from other modules such as CLIP [36]. [37] shows CLIP sometimes can’t capture detailed semantics such as color. We can observe failures in capturing detailed semantics for methods that utilize CLIP guidance such as BlendedDiffusion and Asyrp in Figure 6, where there are no edits or wrong edits.
5.3 Ablation Studies
We conduct several important ablation studies on noise levels, the rank of nullspace projection, and editing strength, which demonstrates the robustness of our method.
-
•
Noise levels (i.e., editing time step ). We conducted an ablation study on different noise levels, with representative examples shown in Figure 7(a). The key observations are summarized as follows: (a) Larger noise levels (i.e., edit on with larger ) perform more coarse edit while small noise levels perform finer edit; (b) LOCO Edit is applicable to a generally large range of noise levels ([0.2T, 0.7T]) for precise edit.
-
•
Rank of nullspace projection . Ablation study on nullspace projection is in Figure 7(b) (definition of is in Algorithm 1). We present the key observations: (a) the local edit ability with no nullspace projection is weaker than that with nullspace projection; (b) when conducting nullspace projection, an effective low-rank estimation with can already achieve good local edit results.
-
•
Editing strength . The linearity with respect to editing strengths is visualized in Figure 7(c), with the key observations in addition to linearity: LOCO Edit is applicable to a generally wide range of editing strengths ([-15, 15]) to achieve localized edit.
5.4 Visualization and Analysis of Editing Directions
We visualize the identified editing direction (see Algorithm 1) in Figure 8. The editing directions are semantically meaningful to the region of interest for editing. For example, the editing directions for eyes, lips, nose, etc., have similar shapes to eyes, lips, nose, etc.
Further, since the objects in datasets Flower, AFHQ, CelebA-HQ, and FFHQ are usually positioned at the center, the identified editing directions also tend to be at the center. Besides, objects could have different shapes, and semantics in some images do not exist in other images. To further study the robustness of transferability for the editing directions, we transfer editing directions to images with objects at different positions, from different datasets, with different shapes, and with no corresponding semantics. We present the results in Figure 9, with key observations that: (a) the edit directions are generally robust to gender differences, shape differences, moderate position differences, and dataset differences, illustrated in the first five rows of Figure 9 (b) transferring editing direction to images without corresponding semantics results in almost no editing (shown in the last row of Figure 9). Therefore, in practical applications, meaningful transfer editing scenarios for LOCO Edit occur when the transferred editing directions correspond to existing semantics in the target image (e.g., transferring the editing direction of "eyes" is effective only if the target image also contains eyes).
6 Discussion on Related Works
Study of Latent Semantic Space in Generative Models.
Although diffusion models have demonstrated their strengths in state-of-the-art image synthesis, the understanding of diffusion models is still far behind the other generative models such as Generative Adversarial Networks (GAN) [66, 57], the understanding of which can provide tools as well as inspiration for the understanding of diffusion models. Some recent works have identified such gaps, discovered latent semantic spaces in diffusion models [29], and further studied the properties of the latent space from a geometrical perspective [30]. These prior arts deepen our understanding of the latent semantic space in diffusion models, and inspire later works to study the structures of information represented in diffusion models from various angles. However, their semantic space is constrained to diffusion models using UNet architecture, and can not represent localized semantics. Our work explores an alternative space to study the semantic expression in diffusion models, inspired by our observation of the low-rank and locally linear Jacobian of the denoiser over the noisy images. We provide a theoretical framework for demonstrating and understanding such properties, which can deepen the interpretation of the learned data distribution in diffusion models.
Image Editing in Unconditional Diffusion Models.
Recent research has significantly improved the understanding of latent semantic spaces in diffusion models, enabling global image editing through either training-free methods [29, 30, 31] or by incorporating an additional lightweight model [30, 67]. However, these methods result in poor performance for localized edit. In contrast, our approach achieves localized editing without requiring supervised training. For localized edits, [25] builds on [30], enabling local edits by altering the intermediate layers of UNet. However, these approaches are restricted to UNet-based architectures in diffusion models and have largely ignored intrinsic properties like linearity and low-rankness. In comparison, our work provides a rigorous theoretical analysis of low-rankness and local linearity in diffusion models, and we are the first to offer a principled justification of the semantic significance of the basis used for editing. Moreover, our method is independent of specific network architectures.
Other recent works, such as [32], introduce training-free global audio and image editing based on a theoretical understanding of the posterior covariance matrix [33], also independent of UNet architectures. However, our proposed LOCO Edit method allows unsupervised and localized editing, and is principled in the low-rank and locally linear subspaces in diffusion models, which enables several advantageous properties including transferability, composability, and linearity – benign features that have not been explored in prior work. Additionally, while [24] supports localized editing, it requires supervision from CLIP, lacks a theoretical basis, and is time-consuming for editing each image. In contrast, our method is more efficient, theoretically grounded, and free from failures or biases in CLIP. The CLIP-supervised may also exhibit a bias toward the CLIP score, leading to suboptimal editing results, as shown in Figure 6. In comparison, our method consistently enables high-quality edits without such bias.
Image Editing in T2I Diffusion Models.
T2I image editing usually requires much more complicated sampling and training procedures, such as providing certainly learned guidance in the reverse sampling process [11], training an extra neural network [21], or fine-tuning the models for certain attributes [22]. Although effective, these methods often require extra training or even human intervention. Some other T2I image editing methods are training-free [46, 27, 28], and further enable editing with identifying masks [46], or optimizing the soft combination of text prompts [28]. These methods involve a continuous injection of the edit prompt during the generation process to gradually refine the generated image to have the target semantics. Though effective, all of the above methods (either training-free or not) as well as instruction-guided ones [68, 69, 70, 71] lack clear mathematical interpretations and requires text supervision. [23] discovers editing directions in T2I diffusion models through contrastive learning without text supervision, but is not generalizable to editing with text supervision. [30] has some theoretical basis and extends to an editing approach in T2I diffusion models with text supervision, but such supervision is only for unconditional sampling. In contrast, our extended T-LOCO Edit, which originated from the understanding of diffusion models, is the first method exploring single-step editing with or without text supervision for conditional sampling.
7 Conclusion & Future Directions
We proposed a new low-rank controllable image editing method, LOCO Edit, which enables precise, one-step, localized editing using diffusion models. Our approach stems from the discovery of the locally linear posterior mean estimator in diffusion models and the identification of a low-dimensional semantic subspace in its Jacobian, theoretically verified under certain data assumptions. The identified editing directions possess several beneficial properties, such as linearity, homogeneity, and composability. Additionally, our method is versatile across different datasets and models and is applicable to text-supervised editing in T2I diffusion models. Through various experiments, we demonstrate the superiority of our method compared to existing approaches.
We identify several future directions and limitations of the current work. The current theoretical framework explains mainly the unsupervised image editing part. A more solid and thorough analysis of text-supervised image editing is of significant importance in understanding T2I diffusion models, which is yet a difficult open problem in the field. For example, there is still a lack of geometric analysis of the relationship between subspaces under different text-prompt conditions [4, 19, 38, 72]. Based on such understandings, it may be possible to further discover benign properties of editing directions in T2I diffusion models, or design more efficient fine-tuning [73, 74] accordingly. Besides, the current method has the potential to be extended for combining coarse to fine editing across different time steps. Furthermore, it is worth exploring the direct manipulation of semantic spaces in flow-matching diffusion models and transformer-architecture diffusion models. Lastly, it is possible to connect the current finding to image or video representation learning in diffusion models [75, 76, 77], or utilize the low-rank structures to build dictionaries [78].
Acknowledgement
We acknowledge support from NSF CAREER CCF-2143904, NSF CCF-2212066, NSF CCF-2212326, NSF IIS 2312842, NSF IIS 2402950, ONR N00014-22-1-2529, a gift grant from KLA, an Amazon AWS AI Award, MICDE Catalyst Grant. The authors acknowledge valuable discussions with Mr. Zekai Zhang (U. Michigan), Dr. Ismail R. Alkhouri (U. Michigan and MSU), Mr. Jinfan Zhou (U. Michigan), and Mr. Xiao Li (U. Michigan).
References
- Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Song et al. [2021a] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=PxTIG12RRHS.
- Song et al. [2021b] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum?id=St1giarCHLP.
- Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Zhang et al. [2024a] Huijie Zhang, Yifu Lu, Ismail Alkhouri, Saiprasad Ravishankar, Dogyoon Song, and Qing Qu. Improving training efficiency of diffusion models via multi-stage framework and tailored multi-decoder architectures. In Conference on Computer Vision and Pattern Recognition 2024, 2024a. URL https://openreview.net/forum?id=YtptmpZQOg.
- Alkhouri et al. [2023a] Ismail Alkhouri, Shijun Liang, Rongrong Wang, Qing Qu, and Saiprasad Ravishankar. Diffusion-based adversarial purification for robust deep mri reconstruction. ArXiv preprint arXiv:2309.05794, 2023a.
- Kong et al. [2021] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=a-xFK8Ymz5J.
- Chen et al. [2021] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=NsMLjcFaO8O.
- Chung et al. [2022] Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=nJJjv0JDJju.
- Song et al. [2023] Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2023.
- Chung et al. [2023] Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=OnD9zGAGT0k.
- Li et al. [2024a] Xiang Li, Soo Min Kwon, Ismail R Alkhouri, Saiprasad Ravishanka, and Qing Qu. Decoupled data consistency with diffusion purification for image restoration. ArXiv preprint arXiv:2403.06054, 2024a.
- Alkhouri et al. [2023b] Ismail Alkhouri, Shijun Liang, Rongrong Wang, Qing Qu, and Saiprasad Ravishankar. Robust physics-based deep mri reconstruction via diffusion purification. In Conference on Parsimony and Learning (Recent Spotlight Track), 2023b.
- Song et al. [2024] Bowen Song, Soo Min Kwon, Zecheng Zhang, Xinyu Hu, Qing Qu, and Liyue Shen. Solving inverse problems with latent diffusion models via hard data consistency. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=j8hdRqOUhN.
- Yu et al. [2023] Sihyun Yu, Kihyuk Sohn, Subin Kim, and Jinwoo Shin. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18456–18466, 2023.
- Blattmann et al. [2023] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. ArXiv preprint arXiv:2311.15127, 2023.
- Khachatryan et al. [2023] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023.
- Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ArXiv preprint arXiv:2204.06125, 2022.
- Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Karras et al. [2018] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2018. URL https://api.semanticscholar.org/CorpusID:54482423.
- Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Ruiz et al. [2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
- Dalva and Yanardag [2024] Yusuf Dalva and Pinar Yanardag. Noiseclr: A contrastive learning approach for unsupervised discovery of interpretable directions in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24209–24218, 2024.
- Avrahami et al. [2022] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18208–18218, June 2022.
- Kouzelis et al. [2024] Theodoros Kouzelis, Manos Plitsis, Mihalis A. Nicolaou, and Yannis Panagakis. Enabling local editing in diffusion models by joint and individual component analysis, 2024.
- Couairon et al. [2023] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=3lge0p5o-M-.
- Brack et al. [2023] Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek, Patrick Schramowski, and Kristian Kersting. SEGA: Instructing text-to-image models using semantic guidance. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=KIPAIy329j.
- Wu et al. [2023] Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and Shiyu Chang. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2023.
- Kwon et al. [2023] Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. Diffusion models already have a semantic latent space. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=pd1P2eUBVfq.
- Park et al. [2023a] Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, and Youngjung Uh. Understanding the latent space of diffusion models through the lens of riemannian geometry. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=VUlYp3jiEI.
- Zhu et al. [2023] Ye Zhu, Yu Wu, Zhiwei Deng, Olga Russakovsky, and Yan Yan. Boundary guided learning-free semantic control with diffusion models. In Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Manor and Michaeli [2024a] Hila Manor and Tomer Michaeli. Zero-shot unsupervised and text-based audio editing using DDPM inversion. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 34603–34629. PMLR, 21–27 Jul 2024a. URL https://proceedings.mlr.press/v235/manor24a.html.
- Manor and Michaeli [2024b] Hila Manor and Tomer Michaeli. On the posterior distribution in denoising: Application to uncertainty quantification. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=adSGeugiuj.
- Saad [2011] Yousef Saad. Numerical methods for large eigenvalue problems: revised edition. SIAM, 2011.
- Park et al. [2023b] Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, and Youngjung Uh. Understanding the latent space of diffusion models through the lens of riemannian geometry. Advances in Neural Information Processing Systems, 36:24129–24142, 2023b.
- Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Tong et al. [2024] Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024.
- Luo et al. [2023] Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. ArXiv preprint arXiv:2310.04378, 2023.
- Karras et al. [2022a] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022a.
- Mokady et al. [2023] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
- Ho and Salimans [2021] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. URL https://openreview.net/forum?id=qw8AKxfYbI.
- Lu et al. [2022] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
- Karras et al. [2022b] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022b.
- Zhang et al. [2024b] Huijie Zhang, Jinfan Zhou, Yifu Lu, Minzhe Guo, Peng Wang, Liyue Shen, and Qing Qu. The emergence of reproducibility and consistency in diffusion models. In Forty-first International Conference on Machine Learning, 2024b. URL https://openreview.net/forum?id=HsliOqZkc0.
- Luo [2022] Calvin Luo. Understanding diffusion models: A unified perspective. ArXiv preprint arXiv:2208.11970, 2022.
- Kim et al. [2022] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
- Haas et al. [2024a] René Haas, Inbar Huberman-Spiegelglas, Rotem Mulayoff, and Tomer Michaeli. Discovering interpretable directions in the semantic latent space of diffusion models. International Conference on Automatic Face and Gesture Recognition, abs/2303.11073, 2024a. URL https://api.semanticscholar.org/CorpusID:257631803.
- Esser et al. [2024] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=FPnUhsQJ5B.
- Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
- Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
- Bao et al. [2023] Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023.
- Liu et al. [2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015.
- Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009.
- Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, October 2023.
- Banerjee and Roy [2014] S. Banerjee and A. Roy. Linear Algebra and Matrix Analysis for Statistics. Chapman & Hall/CRC Texts in Statistical Science. CRC Press, 2014. ISBN 9781482248241. URL https://books.google.com/books?id=WDTcBQAAQBAJ.
- Zhu et al. [2021] Jiapeng Zhu, Ruili Feng, Yujun Shen, Deli Zhao, Zhengjun Zha, Jingren Zhou, and Qifeng Chen. Low-rank subspaces in gans. In Neural Information Processing Systems, 2021. URL https://api.semanticscholar.org/CorpusID:235367855.
- Pope et al. [2021] Phil Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=XJk19XzGq2J.
- Wang and Vastola [2023] Binxu Wang and John J Vastola. The hidden linear structure in score-based models and its application. ArXiv preprint arXiv:2311.10892, 2023.
- Yu et al. [2015] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. ArXiv, abs/1506.03365, 2015. URL https://api.semanticscholar.org/CorpusID:8317437.
- Nilsback and Zisserman [2008] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729, 2008. doi: 10.1109/ICVGIP.2008.47.
- Choi et al. [2020] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8188–8197, 2020.
- Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
- Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
- Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
- Goodfellow et al. [2014] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Neural Information Processing Systems, 2014. URL https://api.semanticscholar.org/CorpusID:261560300.
- Haas et al. [2024b] René Haas, Inbar Huberman-Spiegelglas, Rotem Mulayoff, and Tomer Michaeli. Discovering interpretable directions in the semantic latent space of diffusion models. International Conference on Automatic Face and Gesture Recognition, abs/2303.11073, 2024b. URL https://api.semanticscholar.org/CorpusID:257631803.
- Hertz et al. [2023] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image editing with cross-attention control. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=_CDixzkzeyb.
- Wang et al. [2023a] Qian Wang, Biao Zhang, Michael Birsak, and Peter Wonka. Instructedit: Improving automatic masks for diffusion-based image editing with user instructions. ArXiv, abs/2305.18047, 2023a. URL https://api.semanticscholar.org/CorpusID:258959425.
- Brooks et al. [2022] Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18392–18402, 2022. URL https://api.semanticscholar.org/CorpusID:253581213.
- Li et al. [2024b] Shanglin Li, Bo-Wen Zeng, Yutang Feng, Sicheng Gao, Xuhui Liu, Jiaming Liu, Li Lin, Xu Tang, Yao Hu, Jianzhuang Liu, and Baochang Zhang. Zone: Zero-shot instruction-guided local editing. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024b.
- Wang et al. [2024] Peng Wang, Huikang Liu, Druv Pai, Yaodong Yu, Zhihui Zhu, Qing Qu, and Yi Ma. A global geometric analysis of maximal coding rate reduction. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=u9qmjV2khT.
- Hu et al. [2022] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Yaras et al. [2024] Can Yaras, Peng Wang, Laura Balzano, and Qing Qu. Compressible dynamics in deep overparameterized low-rank learning & adaptation. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=uDkXoZMzBv.
- Fuest et al. [2024] Michael Fuest, Pingchuan Ma, Ming Gui, Johannes S Fischer, Vincent Tao Hu, and Bjorn Ommer. Diffusion models and representation learning: A survey. ArXiv preprint arXiv:2407.00783, 2024.
- Li et al. [2024c] Xiang Li, Yixiang Dai, and Qing Qu. Understanding generalizability of diffusion models requires rethinking the hidden gaussian structure. 2024c.
- Wang et al. [2023b] Peng Wang, Xiao Li, Yaras Can, Zhihui Zhu, Laura Balzano, Wei Hu, and Qing Qu. Understanding deep representation learning via layerwise feature compression and discrimination. ArXiv preprint arXiv:2311.02960, 2023b.
- Luo et al. [2024] Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison-Burch, and Rene Vidal. Pace: Parsimonious concept engineering for large language models. ArXiv preprint arXiv:2406.04331, 2024.
- Efron [2011] Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011.
- Max [1950] A Woodbury Max. Inverting modified matrices. In Memorandum Rept. 42, Statistical Research Group, page 4. Princeton Univ., 1950.
- Davis and Kahan [1970] Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970.
- Karras et al. [2020] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
- Choi et al. [2022] Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo J. Kim, and Sung-Hoon Yoon. Perception prioritized training of diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11462–11471, 2022. URL https://api.semanticscholar.org/CorpusID:247922317.
- Karras et al. [2022c] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022c. URL https://openreview.net/forum?id=k7FuTOWMOc7.
Organization.
In Appendix A, we provide the experiment details and more results for the empirical study on low-rankness and local linearity. In Appendix B, we show extra details of LOCO Edit and T-LOCO Edit. In Appendix C, we present the proofs for Section 4. In Appendix D, we discuss image editing experiment details.
Appendix A More Empirical Study on Low-rankness & Local Linearity
A.1 Experiment Setup for Section 3.1
We evaluate the numerical rank of the denoiser function for DDPM (U-Net [49] architecture) on CIFAR-10 dataset [50] (), U-ViT [51] (Transformer based networks) on CelebA [52] (), ImageNet [53] datasets () and DeepFloy IF [19] trained on LAION-5B [54] dataset (). Notably, U-ViT architecture uses the autoencoder to compress the image to embedding vector , and adding noise to for the diffusion forward process; and the reverse process replaces with in Equation 1. And the generated image . The PMP defined for U-ViT is:
(9) |
The for defined above. For DeepFloy IF, there are three diffusion models, one for generation and the other two for super-resolution. Here we only evaluate for diffusion generating the images.
Given a random initial noise , diffusion model generate image sequence follows reverse sampler Equation 1. Along the sampling trajectory , for each , we calculate and compute its numerical rank via
(10) |
where denotes the th largest singular value of . In our experiments, we set . We random generate initialize noise ( for U-ViT). We only use one prompt for DeepFloyd IF. We use DDIM with 100 steps for DDPM and DeepFloyd IF, DPM-Solver with 20 steps for U-ViT, and select some of the steps to calculate , reported the averaged rank in Figure 2. To report the norm ratio and cosine similarity, we select the closest to 0.7 along the sampling trajectory and reported in Figure 2, i.e. for DDPM, for U-ViT and for DeepFloyd IF. The norm ratio and cosine similarity are also averaged over 15 samples.
A.2 More Experiments for Section 3.1
We illustrated the norm ratio and cosine similarity for more timesteps in Figure 10, more text prompts, and flow-matching-based diffusion model in Figure 11. More specifically, for the plot of , we exactly use for DDPM, for U-ViT and for DeepFloyd IF; for the plot of , we exactly use for DDPM, for U-ViT and for DeepFloyd IF. The results aligned with our results in Theorem 1 that when is closer the 1, the linearity of is better.
A.3 Comparison for Low-rankness & Local Linearity for Different Manifold
This section is an extension of Section 3.1. We study the low rankness and local linearity of more mappings between spaces of diffusion models. The sampling process of diffusion model involved the following space: , , , , where is the h-space of U-Net’s bottleneck feature space [29] and is the predict noise space. First, we explore the rank ratio of Jacobian and Frobenius norm for: . We use DDPM with U-Net architecture, trained on CIFAR-10 dataset, and other experiment settings are the same as Section A.1, results are shown in Figure 12. The conclusion could be summarized as :
-
•
are low rank jacobian when . As shown in the left of Figure 12, rank ratio for is less than 0.1. It should be noted that:
-
–
. This is because
Therefore, is high rank when is low rank.
-
–
This is because and
-
–
-
•
When fixed, will change little when changing . As shown in the right of Figure 12, and . This means when fixed, will change little when changing .
Then, we also study the linearity of and given , using DDPM with U-Net architecture trained on CIFAR-10 dataset. We change the step size defined in Equation 4. Results are shown in Figure 13, both and have good linearity with respect to ..
In Theorem 1, the jacobian is a symmetric matrix. Therefore, we also verify the symmetry of the jacobian over the PMP . We use DDPM with U-Net architecture trained on CIFAR-10 dataset. At different timestep , we measure . Results are shown on the right of Figure 13. has good symmetric property when and . Additionally, is low rank when . So aligned with Theorem 1 .
To the end, we want to based on the experiments in Figure 12 and Figure 13 to select the best space for out image editing method. is the high-rank matrix, not suitable for efficiently estimate the nullspace; and has too small Frobenius norm to edit the image. Therefore, only and are low-rank and linear for image editing. What’s more, space is restricted to UNet architecture, but the property of the does not depend on the UNet architecture and is verified in diffusion models using transformer architectures. Additionally, we could only apply masks on but cannot on . Therefore, the PMP is the best mapping for image editing.
Appendix B Extra Details of LOCO Edit and T-LOCO Edit
B.1 Generalized Power Method
The Generalized Power Method [34, 30] for calculating the op- singular vectors of the Jacobian is summarized in Algorithm 2. It efficiently computes the top- singular values and singular vectors of the Jacobian with a randomly initialized orthonormal .
B.2 Unsupervised T-LOCO Edit
The overall method for DeepFloyd is summarized in Algorithm 3. For T2I diffusion models in the latent space such as Stable Diffusion and Latent Consistency Model, at time , we additionally decode into the image space to enable masking and nullspace projection. The editing is still in the space of .
B.3 Text-suprvised T-LOCO Edit
Before introducing the algorithm, we define:
(11) |
and
(12) |
to be the posterior mean predictors when using classifier-free guidance on the original prompt , and both the original prompt and the edit prompt accordingly.
Algorithm.
The overall method for DeepFloyd is summarized in Algorithm 4. For T2I diffusion models in the latent space such as Stable Diffusion and Latent Consistency Model, at time , we additionally decode into the image space to enable masking and nullspace projection. The editing is in the space of for Stable Diffusion and Latent Consistency Model. The proposed method is not proposed as an approach beating other T2I editing methods, but as a way to both understand semantic correspondences in the low-rank subspaces of T2I diffusion models and utilize subspaces for semantic control in a more interpretable way. We hope to inspire and open up directions in understanding T2I diffusion models and utilize the understanding in versatile applications.
Here, we want to find a specific change direction in the space that can provide target edited images in the space of by directly moving along : the whole generation is not conditioned on at all, except that we utilize in finding the editing direction . This is in contrast to the method proposed in [30], where additional semantic information is injected via indirect x-space guidance conditioned on the edit prompt at time . We hope to discover an editing direction that is expressive enough by itself to perform semantic editing.
Intuition.
Let be the estimated posterior mean conditioned on the original prompt , and be the estimated posterior mean conditioned on both the original prompt and the edit prompt . Let and be their Jacobian over the noisy image accordingly. The key intuition inspired by the unconditional cases are: i) the target editing direction in the space is homogeneous between the subspaces in and ; ii) the founded editing direction can effectively reside in the direction of a right singular vector for both and ; iii) and are locally linear.
Define as the change of estimated posterior mean. Let , then for some . Besides, we have and due to homogeneity and linearity. Hence, and then , which is along the desired direction . And this identified through the subspace in can be effectively transferred in for controlling the editing of target semantics. We further apply nullspace projection based on to obtain the final editing direction .
Appendix C Proofs in Section 4
C.1 Proofs of Lemma 1
Proof of Lemma 1.
Under the 1, we could calculate the noised distribution at any timestep ,
Because , . From the relationship between conditional Gaussian distribution and marginal Gaussian distribution, it is easy to show that
Then, we have
Next, we compute the score function as follows:
C.2 Proofs of Theorem 1
Lemma 2.
The jacobian of the poster mean is
(14) | ||||
where
Proof of Lemma 2.
Let , so we have:
So:
∎
Lemma 3.
Assume second-order partial derivatives of exist for any , then the posterior mean satisfied .
Proof of Lemma 3.
By taking the gradient of Equation 13 with respect to for both side, because the second-order partial derivatives of exist for any , we have:
The hessian of is symmetric, so we have:
Notably, the symmetric of holds without the 1. ∎
Proof of Theorem 1.
First, let’s prove the low-rankness of the posterior mean. From Lemma 2,
where the second equation is obtained due to the fact that . Therefore, we have:
(15) | ||||
Then, we prove the linearity:
where the first equation plug in the formula of and the second equation use the mean value theorem , .
where the third inequality use the fact that , we simplified as in this prove, and defined in the last inequality is independent of . Similarly, we could prove that:
Here, . After plugin to \tiny1⃝, we could obtain:
Finally, let’s prove the property of the left singular vector of :
From Lemma 3, the eigenvalue decomposition of could be written as , where , and the relation between eigenvalue decomposition and singular value decomposition of could be summarized as for all :
where is the sign function. Therefore, we have:
(16) |
given . From Lemma 2, we define:
From the full singular value decomposition of and :
where:
From Equation 15, we know that . It is easy to show that:
where satisfied . And . Based on the Davis-Kahan theorem [81], we have:
.
Because , , so:
And from Equation 16, we have:
∎
Appendix D Imaging Editing Experiment Details
D.1 Editing in Unconditional Diffusion Models of Different Datasets
Datasets.
Models.
Following [30], we use DDPM [1] for CelebaA-HQ and LSUN-church, and DDPM trained with P2 weighting [83] for FFHQ, AFHQ, Flowers, and MetFaces. We download the official pre-trained checkpoints of resolution , and keep all model parameters frozen. We use the same linear schedule including 100 DDIM inversion steps [3] as [30]. Further, we apply quanlity boosting after as proposed in [84].
Edit Time Steps.
We empirically choose the edit time step for different datasets in the range . In practice, we found time steps within the above range give similar editing results. In most of the experiments, the edit time steps chosen are: for FFHQ, for CelebaA-HQ and LSUN-church, for AFHQ, Flowers, and MetFace.
Editing Strength.
In the empirical study of local linearity, we observed that the local linearity is well-preserved even with a strength of . In practice, we choose the edit strength in the range of , where a larger leads to stronger semantic editing and a negative leads to the change of semantics in the opposite direction.
D.2 Comparing with Alternative Manifolds and Methods
Existing Methods
Alternative Manifolds.
There are two alternative manifolds where similar training-free approaches can be applied, and each of the alternative involves evaluation of the Jacobians (equivalently ), and accordingly.
-
•
(or equivalently up to a scale) calculates the Jacobian of the noise residual with respect to the bottleneck feature of .
-
•
calculates the Jacobian of the noise residual with respect to the input .
Notably, has hardly notable editing results on images, and hence we present the editing results of . Besides, with masking and nullspace projection, also leads to hardly notable changes on images, thus the final comparison is without masking and nullspace projection.
Evaluation Dataset Setup.
In human evaluation, for each method, we randomly select editing direction on images. Each direction is transferred to other images along both the negative and positive directions, in total transferability testing cases. Learning time and transfer edit time are averaged over 100 examples. LPIPS [64] and SSIM [65] are calculated over images for each method.
Human Evaluation Metrics.
We measure both Local Edit Success Rate and Transfer Success Rate via human evaluation on CelebA-HQ. i) Local Edit Success Rate: The subject will be given the source image with the edited one, if the subject judges only one major feature among {"eyes", "nose", "hair", "skin", "mouth", "views", "Eyebrows"} are edited, the subject will respond a success, otherwise a failure. ii) Transfer Success Rate: The subject will be given the source image with the edited one, and another image with the edited one via transferring the editing direction from the source image. The subject will respond a success if the two edited images have the same features changed, otherwise a failure. We calculate the average success rate among all subjects for both Local Edit Success Rate and Transfer Success Rate. Lastly, we have ensured no harmful contents are generated and presented to the human subjects.
Learning Time.
Learning time is a measure of the time it takes to compute local basis(training free approaches), to train an implicit function, or to optimize certain variables that help achieve editing for a specific edit method.
D.3 Editing in T2I Diffusion Models
Models.
We generalize our method to three types of T2I diffusion models: DeepFloyd [DeepFloyd], Stable Diffusion [4], and Latent Consistency Model [38]. We download the official checkpoints and keep all model parameters frozen. The same scheduling as that in the unconditional models is applied to DeepFloyd and Stable Diffusion, except that no quality boosting is applied. We follow the original schedule for Latent Consistency Model [38] with the number of inference steps set as .
Edit Time Steps.
We empirically choose the the edit time step as for DeepFloyd and for Stable DIffusin. As for Latent Consistency Model, image editing is performed at the second inference step.
Editing Strength.
For unsupervised image editing, we choose in Stable Diffusion, in DeepFloyd, and in Latent Consistency Model. For text-supervised image editing, we choose in Stable Diffusion, in DeepFloyd, and in Latent Consistency Model.