\newunicodechar

✓✓ \newunicodechar✗✗

Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing

Siyi Chen\CoAuthorMark Department of Electrical Engineering & Computer Science, University of Michigan Huijie Zhang111The first two authors contributed to this work equally. Department of Electrical Engineering & Computer Science, University of Michigan Minzhe Guo Department of Electrical Engineering & Computer Science, University of Michigan Yifu Lu Department of Electrical Engineering & Computer Science, University of Michigan Peng Wang Department of Electrical Engineering & Computer Science, University of Michigan Qing Qu Department of Electrical Engineering & Computer Science, University of Michigan
(September 10, 2024)
Abstract

Recently, diffusion models have emerged as a powerful class of generative models. Despite their success, there is still limited understanding of their semantic spaces. This makes it challenging to achieve precise and disentangled image generation without additional training, especially in an unsupervised way. In this work, we improve the understanding of their semantic spaces from intriguing observations: among a certain range of noise levels, (1) the learned posterior mean predictor (PMP) in the diffusion model is locally linear, and (2) the singular vectors of its Jacobian lie in low-dimensional semantic subspaces. We provide a solid theoretical basis to justify the linearity and low-rankness in the PMP. These insights allow us to propose an unsupervised, single-step, training-free LOw-rank COntrollable image editing (LOCO Edit) method for precise local editing in diffusion models. LOCO Edit identified editing directions with nice properties: homogeneity, transferability, composability, and linearity. These properties of LOCO Edit benefit greatly from the low-dimensional semantic subspace. Our method can further be extended to unsupervised or text-supervised editing in various text-to-image diffusion models (T-LOCO Edit). Finally, extensive empirical experiments demonstrate the effectiveness and efficiency of LOCO Edit. The codes will be released at https://github.com/ChicyChen/LOCO-Edit.

Key words: diffusion model, controlled generation, precise local edit, low-rank, interpretability

Refer to caption
(a) Precise and localized image editing.
Refer to caption
(b) Homogeneity and transferability of the editing direction.
Refer to caption
(c) Composability of disentangled directions.
Refer to caption
(d) Linearity in the editing direction.
Figure 1: LOCO Edit. (a) The proposed method can perform precise localized editing in the region of interest. The editing direction is (b) homogeneous, (c) composable, and (d) linear.

1 Introduction

Recently, diffusion models have emerged as a powerful new family of deep generative models with remarkable performance in many applications such as image generation across various domains [1, 2, 3, 4, 5, 6], audio synthesis [7, 8], solving inverse problem [9, 10, 11, 12, 13, 14], and video generation [15, 16, 17]. For example, recent advances in AI-based image generation, revolutionized by diffusion models such as Dalle-2 [18], Imagen [19], and stable diffusion [4], have taken the world of “AI Art generation”, enabling the generation of images directly from descriptive text inputs. These models corrupt images by adding noise through multiple steps of forward process and then generate samples by progressive denoising through multiple steps of the reverse generative process.

Although modern diffusion models are capable of generating photorealistic images from text prompts, manipulating the generated content by diffusion models in practice has remaining challenges. Unlike generative adversarial networks [20], the understanding of semantic spaces in diffusion models is still limited. Thus, achieving disentangled and localized control over content generation by direct manipulation of the semantic spaces remains a difficult task for diffusion models. Although effective, some existing editing methods in diffusion models often demand additional training procedures and are limited to global control of content generation [21, 22, 23]. Some methods are training-free or localized but are still based upon heuristics, lacking clear mathematical interpretations, or for text-supervised editing only [24, 25, 26, 27, 28]. Others provide analysis in diffusion models [29, 30, 31, 32, 33], but also have difficulty in local edits such as hair color.

In this study, we address the above problem by studying the low-rank semantic subspaces in diffusion models and proposing the LOw-rank COntrollable edit (LOCO Edit) approach. LOCO is the first local editing method that is single-step, training-free, requiring no text supervision, and having other intriguing properties (see Figure 1 for an illustration). Our method is highly intuitive and theoretically grounded, originating from a simple while intriguing observation in the learned posterior mean predictor (PMP) in diffusion models: for a large portion of denoising time steps,

The PMP is a locally linear mapping between the noise image and the estimated clean image, and the singular vectors of its Jacobian reside within low-dimensional subspaces.

The empirical evidence in Figure 2 consistently shows that this phenomenon occurs when training diffusion models using different network architectures on a range of real-world image datasets. Theoretically, we validated this observation by assuming a mixture of low-rank Gaussian distributions for the data. We then prove the local linearity of the PMP, the low-rank nature of its Jacobian, and that the singular vectors of the Jacobian span the low-dimensional subspaces.

By utilizing the linearity of the PMP, we can edit within the singular vector subspace of its Jacobian to achieve linear control of the image content with no label or text supervision. The editing direction can be efficiently computed using the generalized power method (GPM) [30, 34]. Furthermore, we can manipulate specific regions of interest in the image along a disentangled direction through efficient nullspace projection, taking advantage of the low-rank properties of the Jacobian.

Benefits of LOCO Edit.

Compared to existing editing methods (e.g., [29, 35, 23, 24]) based on diffusion models, the proposed LOCO Edit offers several benefits that we highlight below:

  • Precise, single-step, training-free, and unsupervised editing. LOCO enables precise localized editing (Figure 1(a)) in a single timestep without any training. Further, it requires no text supervision based on CLIP [36], thus integrating no intrinsic biases or flaws from CLIP [37]. LOCO is applicable to various diffusion models and datasets (Figure 5).

  • Linear, transferable, and composable editing directions. The identified editing direction is linear, meaning that changes along this direction produce proportional changes in a semantic feature in the image space (Figure 1(d)). These editing directions are homogeneous and can be transferred across various images and noise levels (Figure 1(b)). Moreover, combining disentangled editing directions leads to simultaneous semantic changes in the respective region, while maintaining consistency in other areas (Figure 1(c)).

  • An intuitive and theoretically grounded approach. Unlike previous works, by leveraging the local linearity of the PMP and the low-rankness of its Jacobian, our method is highly interpretable. The identified properties are well supported by both our empirical observation (Figure 2) and theoretical justifications in Section 4.

Moreover, LOCO Edit is generalizable to T-LOCO Edit for T2I diffusion models including DeepFloyd IF [19], Stable Diffusion [4], and Latent Consistency Models [38], with or without text supervision (Figure 4). A more detailed discussion on the relationship with prior arts can be found in Section 6.

Notations.

Throughout the paper, we use 𝒳tdsubscript𝒳𝑡superscript𝑑\mathcal{X}_{t}\subseteq\mathbb{R}^{d}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊆ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to denote the noise-corrupted image space at the time-step t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ]. In particular, 𝒳0subscript𝒳0\mathcal{X}_{0}caligraphic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes the clean image space with data distribution pdata(𝒙)subscript𝑝data𝒙p_{\text{data}}(\bm{x})italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( bold_italic_x ), and 𝒙0𝒳0subscript𝒙0subscript𝒳0\bm{x}_{0}\in\mathcal{X}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denote an image. 𝒳0,tsubscript𝒳0𝑡\mathcal{X}_{0,t}caligraphic_X start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT denote the posterior mean space at the time-step t(0,1]𝑡01t\in(0,1]italic_t ∈ ( 0 , 1 ]. Here, 𝕊d1superscript𝕊𝑑1\mathbb{S}^{d-1}roman_𝕊 start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT denotes a unit hypersphere in dsuperscript𝑑\mathbb{R}^{d}roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and St(d,r):={𝒁d×r𝒁𝒁=𝑰r}assignSt𝑑𝑟conditional-set𝒁superscript𝑑𝑟superscript𝒁top𝒁subscript𝑰𝑟\mathrm{St}(d,r):=\{\bm{Z}\in\mathbb{R}^{d\times r}\mid\bm{Z}^{\top}\bm{Z}=\bm% {I}_{r}\}roman_St ( italic_d , italic_r ) := { bold_italic_Z ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT ∣ bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_Z = bold_italic_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } denotes the Stiefel manifold. rank~(𝑨)~rank𝑨\widetilde{\operatorname{rank}}(\bm{A})over~ start_ARG roman_rank end_ARG ( bold_italic_A ) denotes the numerical rank of 𝑨𝑨\bm{A}bold_italic_A. 𝔼𝒙0pdata(𝒙)[𝒙0|𝒙t]subscript𝔼similar-tosubscript𝒙0subscript𝑝data𝒙delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\mathbb{E}_{\bm{x}_{0}\sim p_{\rm data}(\bm{x})}[\bm{x}_{0}|\bm{x}_{t}]roman_𝔼 start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT roman_data end_POSTSUBSCRIPT ( bold_italic_x ) end_POSTSUBSCRIPT [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] denotes the posterior mean and is written as 𝔼[𝒙0|𝒙t]𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]. range(𝐀)range𝐀\operatorname{range(\bm{A})}roman_range ( bold_A ) denotes the span of the columns of 𝑨𝑨\bm{A}bold_italic_A. null(𝑨)null𝑨\operatorname{null}(\bm{A})roman_null ( bold_italic_A ) denotes the set of all solutions to the equation 𝑨𝒙=0𝑨𝒙0\bm{A}\bm{x}=0bold_italic_A bold_italic_x = 0. projnull𝑨(𝒙)subscriptprojnull𝑨𝒙\operatorname{proj}_{\operatorname{null}\bm{A}}(\bm{x})roman_proj start_POSTSUBSCRIPT roman_null bold_italic_A end_POSTSUBSCRIPT ( bold_italic_x ) denotes the projection of 𝒙𝒙\bm{x}bold_italic_x onto null(𝑨)null𝑨\operatorname{null}(\bm{A})roman_null ( bold_italic_A ).

Organization.

In Section 2, we introduce preliminaries on diffusion models. In Section 3, we present the exploration of local linearity and low-rankness in PMP, the method intuition based on the insights, and method details of LOCO and T-LOCO Edits. In Section 4, we theoretically justify the low-rankness, linearity, and semantic subspace in the PMP. In Section 5, we conduct comprehensive experiments to demonstrate the superiority of LOCO Edit and investigate its robustness. In Section 6, we thoroughly discuss related works. In Section 7, we conclude with potential future directions.

2 Preliminaries on Diffusion Models

In this section, we start by reviewing the basics of diffusion models [1, 2, 39], followed by several key techniques that will be used in our approach, such as Denoising Diffusion Implicit Models (DDIM) [3] and its inversion [40], T2I diffusion model, and classifier-free guidance [41].

Basics of Diffusion Models.

In general, diffusion models consist of two processes:

  • The forward diffusion process. The forward process progressively perturbs the original data 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to a noisy sample 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ] with the Gaussian noise. As in [1], this can be characterized by a conditional Gaussian distribution pt(𝒙t|𝒙0)=𝒩(𝒙t;αt𝒙0,(1αt)Id)subscript𝑝𝑡conditionalsubscript𝒙𝑡subscript𝒙0𝒩subscript𝒙𝑡subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡subscriptI𝑑p_{t}(\bm{x}_{t}|\bm{x}_{0})=\mathcal{N}(\bm{x}_{t};\sqrt{\alpha_{t}}\bm{x}_{0% },(1-\alpha_{t})\textbf{I}_{d})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). Particularly, parameters {αt}t=01superscriptsubscriptsubscript𝛼𝑡𝑡01\{\alpha_{t}\}_{t=0}^{1}{ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT sastify: (i) α0=1subscript𝛼01\alpha_{0}=1italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1, and thus p0=pdatasubscript𝑝0subscript𝑝datap_{0}=p_{\rm data}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT roman_data end_POSTSUBSCRIPT, and (ii) α1=0subscript𝛼10\alpha_{1}=0italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0, and thus p1=𝒩(𝟎,Id)subscript𝑝1𝒩0subscriptI𝑑p_{1}=\mathcal{N}(\bm{0},\textbf{I}_{d})italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = caligraphic_N ( bold_0 , I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ).

  • The reverse sampling process. To generate a new sample, previous works [1, 3, 42, 43] have proposed various methods to approximate the reverse process of diffusion models. Typically, these methods involve estimating the noise ϵtsubscriptbold-italic-ϵ𝑡\bm{\epsilon}_{t}bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and removing the estimated noise from 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT recursively to obtain an estimate of 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Specifically, the sampling step from 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to 𝒙tΔtsubscript𝒙𝑡Δ𝑡\bm{x}_{t-\Delta t}bold_italic_x start_POSTSUBSCRIPT italic_t - roman_Δ italic_t end_POSTSUBSCRIPT with a small Δt>0Δ𝑡0\Delta t>0roman_Δ italic_t > 0 can be described as:

    𝒙tΔt=αtΔt(𝒙t1αtϵ𝜽(𝒙t,t)αt)+1αtΔtϵ𝜽(𝒙t,t),subscript𝒙𝑡Δ𝑡subscript𝛼𝑡Δ𝑡subscript𝒙𝑡1subscript𝛼𝑡subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡subscript𝛼𝑡1subscript𝛼𝑡Δ𝑡subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡\bm{x}_{t-\Delta t}=\sqrt{\alpha_{t-\Delta t}}\left(\frac{\bm{x}_{t}-\sqrt{1-% \alpha_{t}}\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t)}{\sqrt{\alpha_{t}}}\right% )+\sqrt{1-\alpha_{t-\Delta t}}\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t),bold_italic_x start_POSTSUBSCRIPT italic_t - roman_Δ italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t - roman_Δ italic_t end_POSTSUBSCRIPT end_ARG ( divide start_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ) + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - roman_Δ italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , (1)

    where ϵ𝜽(𝒙t,t)subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t)bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is parameterized by a neural network and trained to predict the noise at time t𝑡titalic_t.

Denoiser and Posterior Mean Predictor (PMP).

According to [1], the denoiser ϵ𝜽(𝒙t,t)subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t)bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is optimized by solving the following problem:

min𝜽(𝜽):-𝔼t[0,1],𝒙tpt(𝒙t|𝒙0),ϵ𝒩(0,I)[ϵ𝜽(𝒙t,t)ϵ22],:-subscript𝜽𝜽subscript𝔼formulae-sequencesimilar-to𝑡01formulae-sequencesimilar-tosubscript𝒙𝑡subscript𝑝𝑡conditionalsubscript𝒙𝑡subscript𝒙0similar-tobold-italic-ϵ𝒩0Idelimited-[]superscriptsubscriptnormsubscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡bold-italic-ϵ22\displaystyle\min_{\bm{\theta}}\ell(\bm{\theta})\coloneq\mathbb{E}_{t\sim[0,1]% ,\bm{x}_{t}\sim p_{t}(\bm{x}_{t}|\bm{x}_{0}),\bm{\epsilon}\sim\mathcal{N}(% \textbf{0},\textbf{I})}\left[\|\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t)-\bm{% \epsilon}\|_{2}^{2}\right],roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_ℓ ( bold_italic_θ ) :- roman_𝔼 start_POSTSUBSCRIPT italic_t ∼ [ 0 , 1 ] , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , bold_italic_ϵ ∼ caligraphic_N ( 0 , I ) end_POSTSUBSCRIPT [ ∥ bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - bold_italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

where 𝜽𝜽\bm{\theta}bold_italic_θ denotes the network parameters of the denoiser. Once ϵ𝜽subscriptbold-italic-ϵ𝜽\bm{\epsilon}_{\bm{\theta}}bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT is well trained, recent studies [44, 45] show that the posterior mean 𝔼[𝒙0|𝒙t]𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ], i.e., predicted clean image at time t𝑡titalic_t, can be estimated as follows:

𝒙^0,t=𝒇𝜽,t(𝒙t;t)𝒙t1αtϵ𝜽(𝒙t,t)αt,subscript^𝒙0𝑡subscript𝒇𝜽𝑡subscript𝒙𝑡𝑡subscript𝒙𝑡1subscript𝛼𝑡subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡subscript𝛼𝑡\displaystyle\hat{\bm{x}}_{0,t}=\bm{f}_{\bm{\theta},t}(\bm{x}_{t};t)\coloneqq% \frac{\bm{x}_{t}-\sqrt{1-\alpha_{t}}\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t)}% {\sqrt{\alpha_{t}}},over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT = bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) ≔ divide start_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG , (2)

Here, 𝒇𝜽,t(𝒙t;t)subscript𝒇𝜽𝑡subscript𝒙𝑡𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t};t)bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) denotes the posterior mean predictor (PMP) [45, 44], and 𝒙^0,t𝒳0,tsubscriptbold-^𝒙0𝑡subscript𝒳0𝑡\bm{\hat{x}}_{0,t}\in\mathcal{X}_{0,t}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT denotes the estimated posterior mean output from PMP given 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and t𝑡titalic_t as the input. For simplicity, we denote 𝒇𝜽,t(𝒙t;t)subscript𝒇𝜽𝑡subscript𝒙𝑡𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t};t)bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) as 𝒇𝜽,t(𝒙t)subscript𝒇𝜽𝑡subscript𝒙𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

DDIM and DDIM Inversion.

Given a noisy sample 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t, DDIM [3] can generate clean images by multiple denoising steps. Given a clean sample 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, DDIM inversion [3] can generate a noisy 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t by adding multiple steps of noise following the reversed trajectory of DDIM. DDIM inversion has been widely in image editing methods [40, 46, 29, 35, 47, 26] to obtain 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT given the original 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and then performing editing starting from 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In our work, after getting 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT given 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT via DDIM inversion, we edit 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to 𝒙tsubscriptsuperscript𝒙𝑡\bm{x}^{\prime}_{t}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT only at the single time step t𝑡titalic_t with the help of PMP, and then utilize DDIM to generate the edited image 𝒙0subscriptsuperscript𝒙0\bm{x}^{\prime}_{0}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

For ease of exposition, for any t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with t2>t1subscript𝑡2subscript𝑡1t_{2}>t_{1}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we denote DDIM operator and its inversion as 𝒙t1=DDIM(𝒙t2,t1)and𝒙t2=DDIM-Inv(𝒙t1,t2).formulae-sequencesubscript𝒙subscript𝑡1DDIMsubscript𝒙subscript𝑡2subscript𝑡1andsubscript𝒙subscript𝑡2DDIM-Invsubscript𝒙subscript𝑡1subscript𝑡2\bm{x}_{t_{1}}=\texttt{DDIM}(\bm{x}_{t_{2}},t_{1})\quad\text{and}\quad\bm{x}_{% t_{2}}=\texttt{DDIM-Inv}(\bm{x}_{t_{1}},t_{2}).bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = DDIM ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = DDIM-Inv ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

Text-to-image (T2I) Diffusion Models & Classifier-Free Guidance.

So far, our discussion has only focused on unconditional diffusion models. Moreover, our approach can be generalized from unconditional diffusion models to T2I diffusion models [38, 4, 48, 19], where the latter enables controllable image generation 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT guided by a text prompt c𝑐citalic_c. In more detail, when training T2I diffusion models, we optimize a conditional denoising function ϵ𝜽(𝒙t,t,c)subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡𝑐\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t,c)bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ). For sampling, we employ a technique called classifier-free guidance [41], which substitutes the unconditional denoiser ϵ𝜽(𝒙t,t)subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t)bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) in Equation 1 with its conditional counterpart ϵ~𝜽(𝒙t,t,c)subscriptbold-~bold-italic-ϵ𝜽subscript𝒙𝑡𝑡𝑐\bm{\tilde{\epsilon}}_{\bm{\theta}}(\bm{x}_{t},t,c)overbold_~ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) that can be described as follows:

ϵ~𝜽(𝒙t,t,c)=ϵ𝜽(𝒙t,t,)+η(ϵ𝜽(𝒙t,t,c)ϵ𝜽(𝒙t,t,)).subscriptbold-~bold-italic-ϵ𝜽subscript𝒙𝑡𝑡𝑐subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡𝜂subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡𝑐subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡\bm{\tilde{\epsilon}}_{\bm{\theta}}(\bm{x}_{t},t,c)=\bm{\epsilon}_{\bm{\theta}% }(\bm{x}_{t},t,\varnothing)+\eta(\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t,c)-% \bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t,\varnothing)).overbold_~ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) = bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , ∅ ) + italic_η ( bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) - bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , ∅ ) ) . (3)

Here, \varnothing denotes the empty prompt and η>0𝜂0\eta>0italic_η > 0 denotes the strength for the classifier-free guidance.

3 Exploring Linearity & Low-Dimensionality for Image Editting

In this section, we formally introduce the identified low-rank subspace in diffusion models and the proposed LOCO Edit method with the underlying intuitions. In Section 3.1, we present the benign properties in PMP that our method utilizes. Followed by this, in Section 3.3 we provide a detailed description of our method.

Refer to caption
Figure 2: Low-rankness of the Jacobian Jθ,t(xt)subscript𝐽𝜃𝑡subscript𝑥𝑡\bm{J}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and Local linearity of the PMP fθ,t(xt)subscript𝑓𝜃𝑡subscript𝑥𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). We evaluated DDPM (U-Net [49] architecture) on CIFAR-10 dataset [50], U-ViT [51] (Transformer based networks) on CelebA [52], ImageNet [53] datasets and DeepFloy IF [19] trained on LAION-5B [54] dataset. (a) The rank ratio of 𝑱𝜽,t(𝒙t)subscript𝑱𝜽𝑡subscript𝒙𝑡\bm{J}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) against timestep t𝑡titalic_t. (b) The norm ratio (Top) and cosine similarity (Bottom) between 𝒇𝜽,t(𝒙t+λΔ𝒙)subscript𝒇𝜽𝑡subscript𝒙𝑡𝜆Δ𝒙\bm{f}_{\bm{\theta},t}(\bm{x}_{t}+\lambda\Delta\bm{x})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_Δ bold_italic_x ) and 𝒍𝜽(𝒙t;λΔ𝒙)subscript𝒍𝜽subscript𝒙𝑡𝜆Δ𝒙\bm{l}_{\bm{\theta}}(\bm{x}_{t};\lambda\Delta\bm{x})bold_italic_l start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_λ roman_Δ bold_italic_x ) against step size λ𝜆\lambdaitalic_λ at timestep t=0.7𝑡0.7t=0.7italic_t = 0.7.

3.1 Local Linearity and Intrinsic Low-Dimensionality in PMP

First, let us delve into the key intuitions behind the proposed LOCO Edit method, which lie in the benign properties of the PMP f𝜽,t(𝒙t)subscript𝑓𝜽𝑡subscript𝒙𝑡f_{\bm{\theta},t}(\bm{x}_{t})italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). At one given timestep t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ], let us consider the first-order Taylor expansion of 𝒇𝜽,t(𝒙t+λΔ𝒙)subscript𝒇𝜽𝑡subscript𝒙𝑡𝜆Δ𝒙\bm{f}_{\bm{\theta},t}(\bm{x}_{t}+\lambda\Delta\bm{x})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_Δ bold_italic_x ) at the point 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

𝒍𝜽(𝒙t;λΔ𝒙):=𝒇𝜽,t(𝒙t)+λ𝑱𝜽,t(𝒙t)Δ𝒙,\displaystyle\boxed{\quad\bm{l}_{\bm{\theta}}(\bm{x}_{t};\lambda\Delta\bm{x})% \;:=\;\bm{f}_{\bm{\theta},t}(\bm{x}_{t})+\lambda\bm{J}_{\bm{\theta},t}(\bm{x}_% {t})\cdot\Delta\bm{x},\quad}bold_italic_l start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_λ roman_Δ bold_italic_x ) := bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_λ bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ roman_Δ bold_italic_x , (4)

where Δ𝒙𝕊d1Δ𝒙superscript𝕊𝑑1\Delta\bm{x}\in\mathbb{S}^{d-1}roman_Δ bold_italic_x ∈ roman_𝕊 start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT is a perturbation direction with unit length, λ𝜆\lambda\in\mathbb{R}italic_λ ∈ roman_ℝ is the perturbation strength, and 𝑱𝜽,t(𝒙t)=𝒙t𝒇𝜽,t(𝒙t)subscript𝑱𝜽𝑡subscript𝒙𝑡subscriptsubscript𝒙𝑡subscript𝒇𝜽𝑡subscript𝒙𝑡\bm{J}_{\bm{\theta},t}(\bm{x}_{t})=\nabla_{\bm{x}_{t}}\bm{f}_{\bm{\theta},t}(% \bm{x}_{t})bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the Jacobian of 𝒇𝜽,t(𝒙t)subscript𝒇𝜽𝑡subscript𝒙𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Interestingly, we discovered that within a certain range of noise levels, the learned PMP 𝒇𝜽,tsubscript𝒇𝜽𝑡\bm{f}_{\bm{\theta},t}bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT exhibits local linearity, and the singular subspace of its Jacobian 𝑱𝜽,tsubscript𝑱𝜽𝑡\bm{J}_{\bm{\theta},t}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT is low rank. Notably, these properties are universal across various network architectures (e.g., UNet and Transformers) and datasets.

We measure the low-rankness with rank ratio and the local linearity with norm ratio and cosine similarity. Specifically, (i) rank ratio is the ratio of rank~(𝑱𝜽,t(𝒙t))~ranksubscript𝑱𝜽𝑡subscript𝒙𝑡\widetilde{\operatorname{rank}}(\bm{J}_{\bm{\theta},t}(\bm{x}_{t}))over~ start_ARG roman_rank end_ARG ( bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) and the ambient dimension d𝑑ditalic_d; (ii) norm ratio is the ratio of 𝒇𝜽,t(𝒙t+λΔ𝒙)2subscriptnormsubscript𝒇𝜽𝑡subscript𝒙𝑡𝜆Δ𝒙2\|\bm{f}_{\bm{\theta},t}(\bm{x}_{t}+\lambda\Delta\bm{x})\|_{2}∥ bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_Δ bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 𝒍𝜽(𝒙t;λΔ𝒙)2subscriptnormsubscript𝒍𝜽subscript𝒙𝑡𝜆Δ𝒙2\|\bm{l}_{\bm{\theta}}(\bm{x}_{t};\lambda\Delta\bm{x})\|_{2}∥ bold_italic_l start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_λ roman_Δ bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT; (iii) cosine similarity is between 𝒇𝜽,t(𝒙t+λΔ𝒙)subscript𝒇𝜽𝑡subscript𝒙𝑡𝜆Δ𝒙\bm{f}_{\bm{\theta},t}(\bm{x}_{t}+\lambda\Delta\bm{x})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_Δ bold_italic_x ) and 𝒍𝜽(𝒙t;λΔ𝒙)subscript𝒍𝜽subscript𝒙𝑡𝜆Δ𝒙\bm{l}_{\bm{\theta}}(\bm{x}_{t};\lambda\Delta\bm{x})bold_italic_l start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_λ roman_Δ bold_italic_x ). The detailed experiment settings are provided in Section A.1, and results are illustrated in Figure 2, from which we observe:

  • Low-rankness of the Jacobian Jθ,t(xt)subscript𝐽𝜃𝑡subscript𝑥𝑡\bm{J}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). As shown in Figure 2(a), the rank ratio for t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ] consistently displays a U-shaped pattern across various network architectures and datasets: (i) it is close to 1111 near either the pure noise t=1𝑡1t=1italic_t = 1 or the clean image t=0𝑡0t=0italic_t = 0, (ii) 𝑱𝜽,t(𝒙t)subscript𝑱𝜽𝑡subscript𝒙𝑡\bm{J}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is low-rank (i.e., rank ratio less than 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT) for all diffusion models within the range t[0.2,0.7]𝑡0.20.7t\in[0.2,0.7]italic_t ∈ [ 0.2 , 0.7 ], (iii) it achieves the lowest value around mid-to-late timestep, slightly differs depending on architecture and dataset.

  • Local linearity of the PMP fθ,t(xt)subscript𝑓𝜃𝑡subscript𝑥𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Moreover, the mapping 𝒇𝜽,t(𝒙t)subscript𝒇𝜽𝑡subscript𝒙𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) exhibits strong linearity across a large portion of the timesteps; see Figure 2(b) and Figure 10. Specifically, in Figure 2(b), we evaluate the linearity of 𝒇𝜽,t(𝒙t)subscript𝒇𝜽𝑡subscript𝒙𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) at t=0.7𝑡0.7t=0.7italic_t = 0.7 where the rank ratio is close to the lowest value. We can see that 𝒇𝜽,t(𝒙t+λΔ𝒙)𝒍𝜽(𝒙t;λΔ𝒙)subscript𝒇𝜽𝑡subscript𝒙𝑡𝜆Δ𝒙subscript𝒍𝜽subscript𝒙𝑡𝜆Δ𝒙\bm{f}_{\bm{\theta},t}(\bm{x}_{t}+\lambda\Delta\bm{x})\approx\bm{l}_{\bm{% \theta}}(\bm{x}_{t};\lambda\Delta\bm{x})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_Δ bold_italic_x ) ≈ bold_italic_l start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_λ roman_Δ bold_italic_x ) even when λ=40𝜆40\lambda=40italic_λ = 40, which is consistently true among different architectures trained on different datasets.

In addition to comprehensive experimental studies, we will also demonstrate in Section 4 that both properties can be theoretically justified.

3.2 Key Intuitions for Our Image Editing Method

The two benign properties offer valuable insights for image editing with precise control. Here, we first present the high-level intuitions behind our method, with further details postponed to Section 3.3. Specifically, for any given time-step t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ], let us denote the compact singular value decomposition (SVD) of the Jacobian 𝑱𝜽,t(𝒙t)subscript𝑱𝜽𝑡subscript𝒙𝑡\bm{J}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as

𝑱𝜽,t(𝒙t)=𝑼𝚺𝑽=i=1rσi𝒖i𝒗i,subscript𝑱𝜽𝑡subscript𝒙𝑡𝑼𝚺superscript𝑽topsuperscriptsubscript𝑖1𝑟subscript𝜎𝑖subscript𝒖𝑖superscriptsubscript𝒗𝑖top\displaystyle\bm{J}_{\bm{\theta},t}(\bm{x}_{t})\;=\;\bm{U}\bm{\Sigma}\bm{V}^{% \top}\;=\;\sum_{i=1}^{r}\sigma_{i}\bm{u}_{i}\bm{v}_{i}^{\top},bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = bold_italic_U bold_Σ bold_italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (5)

where r𝑟ritalic_r is the rank of 𝑱𝜽,t(𝒙t)subscript𝑱𝜽𝑡subscript𝒙𝑡\bm{J}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), 𝑼=[𝒖1𝒖r]St(d,r)𝑼matrixsubscript𝒖1subscript𝒖𝑟St𝑑𝑟\bm{U}=\begin{bmatrix}\bm{u}_{1}&\cdots&\bm{u}_{r}\end{bmatrix}\in\mathrm{St}(% d,r)bold_italic_U = [ start_ARG start_ROW start_CELL bold_italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_italic_u start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ∈ roman_St ( italic_d , italic_r ) and 𝑽=[𝒗1𝒗r]St(d,r)𝑽matrixsubscript𝒗1subscript𝒗𝑟St𝑑𝑟\bm{V}=\begin{bmatrix}\bm{v}_{1}&\cdots&\bm{v}_{r}\end{bmatrix}\in\mathrm{St}(% d,r)bold_italic_V = [ start_ARG start_ROW start_CELL bold_italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ∈ roman_St ( italic_d , italic_r ) denote the left and right singular vectors, and 𝚺=diag(σ1,,σr)𝚺diagsubscript𝜎1subscript𝜎𝑟\bm{\Sigma}=\operatorname{diag}(\sigma_{1},\cdots,\sigma_{r})bold_Σ = roman_diag ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) denote the singular values. We write 𝑱𝜽,t(𝒙t)=𝑱𝜽,tsubscript𝑱𝜽𝑡subscript𝒙𝑡subscript𝑱𝜽𝑡\bm{J}_{\bm{\theta},t}(\bm{x}_{t})=\bm{J}_{\bm{\theta},t}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT in short for a specific 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and denote range(𝑱𝜽,t)=span(𝑽)rangesuperscriptsubscript𝑱𝜽𝑡topspan𝑽\operatorname{range}(\bm{J}_{\bm{\theta},t}^{\top})=\operatorname{span}(\bm{V})roman_range ( bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) = roman_span ( bold_italic_V ) and null(𝑱𝜽,t)={𝒘𝑱𝜽,t𝒘=0}nullsubscript𝑱𝜽𝑡conditional-set𝒘subscript𝑱𝜽𝑡𝒘0\operatorname{null}(\bm{J}_{\bm{\theta},t})=\{\bm{w}\mid\bm{J}_{\bm{\theta},t}% \bm{w}=0\}roman_null ( bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ) = { bold_italic_w ∣ bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT bold_italic_w = 0 }.

  • Local linearity of PMP for one-step, training-free, and supervision-free editing. Given the PMP f𝜽,t(𝒙t)subscript𝑓𝜽𝑡subscript𝒙𝑡f_{\bm{\theta},t}(\bm{x}_{t})italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is locally linear at the t𝑡titalic_t-th timestep, if we perturb 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by Δ𝒙=λ𝒗iΔ𝒙𝜆subscript𝒗𝑖\Delta\bm{x}=\lambda\bm{v}_{i}roman_Δ bold_italic_x = italic_λ bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, using one right singular vector 𝒗isubscript𝒗𝑖\bm{v}_{i}bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of 𝑱𝜽,t(𝒙t)subscript𝑱𝜽𝑡subscript𝒙𝑡\bm{J}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as an example editing direction, then by orthogonality

    𝒇𝜽,t(𝒙t+λ𝒗i)𝒇𝜽,t(𝒙t)+𝑱𝜽,t(𝒙t)𝒗i=𝒇𝜽,t(𝒙t)+λσi𝒖i=𝒙^0,t+ρi𝒖i.subscript𝒇𝜽𝑡subscript𝒙𝑡𝜆subscript𝒗𝑖subscript𝒇𝜽𝑡subscript𝒙𝑡subscript𝑱𝜽𝑡subscript𝒙𝑡subscript𝒗𝑖subscript𝒇𝜽𝑡subscript𝒙𝑡𝜆subscript𝜎𝑖subscript𝒖𝑖subscriptbold-^𝒙0𝑡subscript𝜌𝑖subscript𝒖𝑖\displaystyle\bm{f}_{\bm{\theta},t}(\bm{x}_{t}+\lambda\bm{v}_{i})\;\approx\;% \bm{f}_{\bm{\theta},t}(\bm{x}_{t})+\bm{J}_{\bm{\theta},t}(\bm{x}_{t})\bm{v}_{i% }\;=\;\bm{f}_{\bm{\theta},t}(\bm{x}_{t})+\lambda\sigma_{i}\bm{u}_{i}\;=\;\bm{% \hat{x}}_{0,t}+\rho_{i}\bm{u}_{i}.bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≈ bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_λ italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT + italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (6)

    This implies we can achieve one-step editing along the semantic direction 𝒖isubscript𝒖𝑖\bm{u}_{i}bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Notably, the method is training-free and supervision-free since the editing direction 𝒗𝒗\bm{v}bold_italic_v can be simply found via the SVD of 𝑱𝜽,t(𝒙t)subscript𝑱𝜽𝑡subscript𝒙𝑡\bm{J}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

  • Local linearity of PMP for linear, homogeneous, and composable image editing. (i) First, the editing direction 𝒗=𝒗i𝒗subscript𝒗𝑖\bm{v}=\bm{v}_{i}bold_italic_v = bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is linear, where any linear λ𝜆\lambda\in\mathbb{R}italic_λ ∈ roman_ℝ change along 𝒗isubscript𝒗𝑖\bm{v}_{i}bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT results in a linear change ρi=λσisubscript𝜌𝑖𝜆subscript𝜎𝑖\rho_{i}=\lambda\sigma_{i}italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_λ italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT along 𝒖isubscript𝒖𝑖\bm{u}_{i}bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the edited image. (ii) Second, the editing direction 𝒗=𝒗i𝒗subscript𝒗𝑖\bm{v}=\bm{v}_{i}bold_italic_v = bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is homogeneous due to its independence of 𝒙^0,tsubscriptbold-^𝒙0𝑡\bm{\hat{x}}_{0,t}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT, where it could be applied on any images from the same data distribution and results in the same semantic editing. (iii) Third, editing directions are composable. Any linearly combined editing direction 𝒗=iλi𝒗irange(𝑱θ,t)𝒗subscript𝑖subscript𝜆𝑖subscript𝒗𝑖rangesuperscriptsubscript𝑱𝜃𝑡top\bm{v}=\sum_{i\in\mathcal{I}}\lambda_{i}\bm{v}_{i}\in\operatorname{range}\left% (\bm{J}_{\theta,t}^{\top}\right)bold_italic_v = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_range ( bold_italic_J start_POSTSUBSCRIPT italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) is a valid editing direction which would result in a composable change iρi𝒖isubscript𝑖subscript𝜌𝑖subscript𝒖𝑖\sum_{i\in\mathcal{I}}\rho_{i}\bm{u}_{i}∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the edited image. On the contrary, 𝒘null(𝑱𝜽,t)𝒘nullsubscript𝑱𝜽𝑡\bm{w}\in\operatorname{null}\left(\bm{J}_{\bm{\theta},t}\right)bold_italic_w ∈ roman_null ( bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ) results in no editing since 𝒇𝜽,t(𝒙t+λ𝒘)𝒇𝜽,t(𝒙t)subscript𝒇𝜽𝑡subscript𝒙𝑡𝜆𝒘subscript𝒇𝜽𝑡subscript𝒙𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t}+\lambda\bm{w})\approx\bm{f}_{\bm{\theta},t}(% \bm{x}_{t})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ bold_italic_w ) ≈ bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

  • Low-rankness of Jacobian for localized and efficient editing. 𝑱𝜽,t(𝒙t)subscript𝑱𝜽𝑡subscript𝒙𝑡\bm{J}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is for the entire predicted clean image, thus 𝑱𝜽,t(𝒙t)subscript𝑱𝜽𝑡subscript𝒙𝑡\bm{J}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) finds editing directions in the entire image. Denote 𝑱~𝜽,tsubscriptbold-~𝑱𝜽𝑡\bm{\tilde{J}}_{\bm{\theta},t}overbold_~ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT the Jacobian only for a certain region of interest (ROI), and 𝑱¯𝜽,tsubscriptbold-¯𝑱𝜽𝑡\bm{\bar{J}}_{\bm{\theta},t}overbold_¯ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT the Jacobian for regions outside ROI. Similarly, 𝒗range(𝑱~θ,t)𝒗rangesuperscriptsubscript~𝑱𝜃𝑡top\bm{v}\in\operatorname{range}\left(\tilde{\bm{J}}_{\theta,t}^{\top}\right)bold_italic_v ∈ roman_range ( over~ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) can edit mainly regions within the ROI, and null(𝑱¯θ,t)nullsuperscriptsubscript¯𝑱𝜃𝑡top\operatorname{null}\left(\bar{\bm{J}}_{\theta,t}^{\top}\right)roman_null ( over¯ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) contain directions that do not edit regions outside of ROI. Further projection of 𝒗𝒗\bm{v}bold_italic_v onto null(𝑱¯θ,t)nullsuperscriptsubscript¯𝑱𝜃𝑡top\operatorname{null}\left(\bar{\bm{J}}_{\theta,t}^{\top}\right)roman_null ( over¯ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) can result in a more localized editing direction for ROI. To perform such nullspace projection, computing the full SVD can be very expensive. But we can highly reduce the computation by the low-rank estimation of Jacobians with rank rdmuch-less-thansuperscript𝑟𝑑r^{\prime}\ll ditalic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≪ italic_d. The estimation is efficient yet effective with t[0.5,0.7]𝑡0.50.7t\in[0.5,0.7]italic_t ∈ [ 0.5 , 0.7 ] when the rank of the Jacobian achieves the lowest value.

Upon publishing our results, we encountered a concurrent study [32] that introduced a method for editing audio and images similar to Equation 6, drawing on an interesting analysis of the posterior distribution presented in [33]. However, our approach offers a distinct perspective, providing complementary insights and new findings. Specifically: (i) We explore the low-rank nature and local linearity in PMP, offering rigorous theoretical analyses of these characteristics and shedding light on the semantic meanings of the low-rank subspaces. (ii) These insights give rise to favorable properties in our editing method, such as transferability and composability; see Figure 1. (iii) Furthermore, we enable localized editing through null space projection (see Section 3) and demonstrate the robustness of the method across a variety of models (see Figure 5). (iv) Finally, we extend the method to unsupervised and text-supervised editing in various text-to-image models; see Figure 4.

Refer to caption
Figure 3: Illustration of the unsupervised LOCO Edit for unconditional diffusion models. Given an image 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we perform DDIM-Inv until time t𝑡titalic_t to get 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and estimate 𝒙^𝟎,𝒕subscriptbold-^𝒙0𝒕\bm{\hat{x}_{0,t}}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT bold_0 bold_, bold_italic_t end_POSTSUBSCRIPT from 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. After masking to get the region of interest (ROI) 𝒙~𝟎,𝒕subscriptbold-~𝒙0𝒕\bm{\tilde{x}_{0,t}}overbold_~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT bold_0 bold_, bold_italic_t end_POSTSUBSCRIPT and its counterparts 𝒙¯𝟎,𝒕subscriptbold-¯𝒙0𝒕\bm{\bar{x}_{0,t}}overbold_¯ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT bold_0 bold_, bold_italic_t end_POSTSUBSCRIPT, we find the disentangled edit direction 𝒗psubscript𝒗𝑝\bm{v}_{p}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT via SVD and nullspace projection based on their Jacobians (Algorithm 1). By denoising 𝒙t+λ𝒗psubscript𝒙𝑡𝜆subscript𝒗𝑝\bm{x}_{t}+{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \lambda\bm{v}_{p}}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, an image 𝒙0subscriptsuperscript𝒙0\bm{x}^{\prime}_{0}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with localized edition is generated. In this paper, the variables and notions related to ROI, nullspace, and final direction are respectively highlighted by green, blue, and red colors.

3.3 Low-rank Controllable Image Editing Method with Nullspace Projection

In this subsection, we provide a detailed introduction to LOCO Edit, expanding on the discussion in Section 3.1. We first introduce the supervision-free LOCO Edit, where we further enable localized image editing through nullspace projection with masks. Second, we present how to generalize to T-LOCO Edit for T2I diffusion models w/wo text-supervision to define the semantic editing directions.

LOCO Edit.

We first introduce the general pipeline of LOCO Edit. As illustrated in Figure 3, given an original image 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we first use 𝒙t=DDIM-Inv(𝒙0,t)subscript𝒙𝑡DDIM-Invsubscript𝒙0𝑡\bm{x}_{t}=\texttt{DDIM-Inv}(\bm{x}_{0},t)bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = DDIM-Inv ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t ) to generate a noisy image 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In particular, we choose t[0.5,0.7]𝑡0.50.7t\in[0.5,0.7]italic_t ∈ [ 0.5 , 0.7 ] so that the PMP 𝒇𝜽,t(𝒙t)subscript𝒇𝜽𝑡subscript𝒙𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is locally linear and its Jacobian 𝑱𝜽,t(𝒙t)subscript𝑱𝜽𝑡subscript𝒙𝑡\bm{J}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is close to its lowest rank. From Section 3.1, we know that we can edit the image by changing 𝒙t=𝒙t+λ𝒗psuperscriptsubscript𝒙𝑡subscript𝒙𝑡𝜆subscript𝒗𝑝\bm{x}_{t}^{\prime}=\bm{x}_{t}+\lambda\bm{v}_{p}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, where 𝒗psubscript𝒗𝑝\bm{v}_{p}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the identified editing direction. After editing 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to 𝒙tsuperscriptsubscript𝒙𝑡\bm{x}_{t}^{\prime}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we use 𝒙0=DDIM(𝒙t,0)superscriptsubscript𝒙0DDIMsuperscriptsubscript𝒙𝑡0\bm{x}_{0}^{\prime}=\texttt{DDIM}\left(\bm{x}_{t}^{\prime},0\right)bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = DDIM ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 0 ) to generate the edited image.

Algorithm 1 Unsupervised LOCO Edit
1:Input: original image 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the mask ΩΩ\Omegaroman_Ω, pretrained diffusion model ϵ𝜽subscriptbold-italic-ϵ𝜽\bm{\epsilon}_{\bm{\theta}}bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT, editing strength λ𝜆\lambdaitalic_λ, semantic index k𝑘kitalic_k, number of semantic directions r𝑟ritalic_r, editing timestep t[0.5,0.7]𝑡0.50.7t\in[0.5,0.7]italic_t ∈ [ 0.5 , 0.7 ], the rank r=5superscript𝑟5r^{\prime}=5italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 5.
2:Output: edited image 𝒙0subscriptsuperscript𝒙0\bm{x}^{\prime}_{0}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT,
3:Generate 𝒙tDDIM-Inv(𝒙0,t)subscript𝒙𝑡DDIM-Invsubscript𝒙0𝑡\bm{x}_{t}\leftarrow\text{{DDIM-Inv}}(\bm{x}_{0},t)bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← DDIM-Inv ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t ) \triangleright noisy image at t𝑡titalic_t-th timestep
4:Compute the top-r𝑟ritalic_r SVD (𝑼~,𝚺~,𝑽~)bold-~𝑼bold-~𝚺bold-~𝑽(\bm{\tilde{U}},\bm{\tilde{\Sigma}},{\color[rgb]{0,0.88,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0.88,0}\pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}% \pgfsys@color@cmyk@fill{0.91}{0}{0.88}{0.12}\bm{\tilde{V}}})( overbold_~ start_ARG bold_italic_U end_ARG , overbold_~ start_ARG bold_Σ end_ARG , overbold_~ start_ARG bold_italic_V end_ARG ) of 𝑱~𝜽,t=𝒙tPΩ(𝒇𝜽,t(𝒙t))subscriptbold-~𝑱𝜽𝑡subscriptsubscript𝒙𝑡subscript𝑃Ωsubscript𝒇𝜽𝑡subscript𝒙𝑡{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{\tilde{J}}_{\bm{\theta},t}}=\nabla_{\bm{x}_{t}}P_{\Omega}(\bm{% f}_{\bm{\theta},t}(\bm{x}_{t}))overbold_~ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT = ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
5:Compute the top-rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT SVD (𝑼¯,𝚺¯,𝑽¯)bold-¯𝑼bold-¯𝚺bold-¯𝑽(\bm{\bar{U}},\bm{\bar{\Sigma}},{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\bm{\bar{V}}})( overbold_¯ start_ARG bold_italic_U end_ARG , overbold_¯ start_ARG bold_Σ end_ARG , overbold_¯ start_ARG bold_italic_V end_ARG ) of 𝑱¯𝜽,t=𝒙tPΩC(𝒇𝜽,t(𝒙t))subscriptbold-¯𝑱𝜽𝑡subscriptsubscript𝒙𝑡subscript𝑃superscriptΩ𝐶subscript𝒇𝜽𝑡subscript𝒙𝑡{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\bm{\bar{J}}% _{\bm{\theta},t}}=\nabla_{\bm{x}_{t}}P_{\Omega^{C}}(\bm{f}_{\bm{\theta},t}(\bm% {x}_{t}))overbold_¯ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT = ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT roman_Ω start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
6:Pick direction 𝒗𝑽~[:,i]𝒗bold-~𝑽:𝑖{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{v}}\leftarrow{\color[rgb]{0,0.88,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0.88,0}\pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}% \pgfsys@color@cmyk@fill{0.91}{0}{0.88}{0.12}\bm{\tilde{V}}}[:,i]bold_italic_v ← overbold_~ start_ARG bold_italic_V end_ARG [ : , italic_i ] \triangleright 1 Pick the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT singular vector for the editing direction
7:Compute 𝒗p(𝑰𝑽¯𝑽¯)𝒗subscript𝒗𝑝𝑰bold-¯𝑽superscriptbold-¯𝑽top𝒗{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\bm{v}_{p}}% \leftarrow(\bm{I}-{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0,1}\bm{\bar{V}}\bm{\bar{V}}^{\top}})\cdot{\color[rgb]{0,0.88,0}\definecolor% [named]{pgfstrokecolor}{rgb}{0,0.88,0}\pgfsys@color@cmyk@stroke{0.91}{0}{0.88}% {0.12}\pgfsys@color@cmyk@fill{0.91}{0}{0.88}{0.12}\bm{v}}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ← ( bold_italic_I - overbold_¯ start_ARG bold_italic_V end_ARG overbold_¯ start_ARG bold_italic_V end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ⋅ bold_italic_v \triangleright 2 Nullspace projection for editing within the mask ΩΩ\Omegaroman_Ω
8:𝒗p𝒗p/𝒗p2subscript𝒗𝑝subscript𝒗𝑝subscriptnormsubscript𝒗𝑝2{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\bm{v}_{p}}% \leftarrow{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \bm{v}_{p}}/\|{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 1,0,0}\bm{v}_{p}}\|_{2}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ← bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT / ∥ bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT \triangleright Normalize the editing direction
9:Return: 𝒙0DDIM(𝒙t+λ𝒗p,0)subscriptsuperscript𝒙bold-′0DDIMsubscript𝒙𝑡𝜆subscript𝒗𝑝0\bm{x^{\prime}}_{0}\leftarrow\text{{DDIM}}(\bm{x}_{t}+{\color[rgb]{1,0,0}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\lambda\bm{v}_{p}},0)bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← DDIM ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , 0 ) \triangleright Editing with forward DDIM along the direction 𝒗psubscript𝒗𝑝\bm{v}_{p}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT

In many practical applications, we often need to edit only specific local regions of an image while leaving the rest unchanged. As discussed in Section 3.2, we can achieve this task by finding a precise local editing direction with localized Jacobians and nullspace projection. Overall, the complete method is in Algorithm 1. We describe the key details as follows.

  • Finding localized Jacobians via masking. To enable local editing, we use a mask ΩΩ\Omegaroman_Ω (i.e., an index set of pixels) to select the region of interest,222For datasets that have predefined masks, we can use them directly. For other datasets that lack predefined masks as well as generate images, we can utilize Segment Anything (SAM) to generate masks [55]. with 𝒫Ω()subscript𝒫Ω\mathcal{P}_{\Omega}(\cdot)caligraphic_P start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( ⋅ ) denoting the projection onto the index set ΩΩ\Omegaroman_Ω. For picking a local editing direction, we calculate the Jacobian of 𝒇𝜽,t(𝒙t)subscript𝒇𝜽𝑡subscript𝒙𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) restricted to the region of interest, 𝑱~𝜽,t=𝒙tPΩ(𝒇𝜽,t(𝒙t))=𝑼~𝚺~𝑽~subscriptbold-~𝑱𝜽𝑡subscriptsubscript𝒙𝑡subscript𝑃Ωsubscript𝒇𝜽𝑡subscript𝒙𝑡bold-~𝑼bold-~𝚺superscriptbold-~𝑽top{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{\tilde{J}}_{\bm{\theta},t}}=\nabla_{\bm{x}_{t}}P_{\Omega}(\bm{% f}_{\bm{\theta},t}(\bm{x}_{t}))=\bm{\tilde{U}}\bm{\tilde{\Sigma}}{\color[rgb]{% 0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{\tilde{V}}}^{\top}overbold_~ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT = ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = overbold_~ start_ARG bold_italic_U end_ARG overbold_~ start_ARG bold_Σ end_ARG overbold_~ start_ARG bold_italic_V end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, and select the localized editing direction 𝒗𝒗{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{v}}bold_italic_v from the top-r𝑟ritalic_r singular vectors of 𝑽~bold-~𝑽{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{\tilde{V}}}overbold_~ start_ARG bold_italic_V end_ARG (e.g., 𝒗=𝑽~[:,k]range𝑱~θ,t𝒗bold-~𝑽:𝑘rangesuperscriptsubscript~𝑱𝜃𝑡top{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{v}}={\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{% rgb}{0,0.88,0}\pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}% \pgfsys@color@cmyk@fill{0.91}{0}{0.88}{0.12}\bm{\tilde{V}}[:,k]\in% \operatorname{range}\tilde{\bm{J}}_{\theta,t}^{\top}}bold_italic_v = overbold_~ start_ARG bold_italic_V end_ARG [ : , italic_k ] ∈ roman_range over~ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT for some index k[r]𝑘delimited-[]𝑟k\in[r]italic_k ∈ [ italic_r ]). In practice, a top-r𝑟ritalic_r rank estimation for 𝑽~bold-~𝑽{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{\tilde{V}}}overbold_~ start_ARG bold_italic_V end_ARG is calculated through the generalized power method (GPM) Algorithm 2 with r=5𝑟5r=5italic_r = 5 to improve efficiency.

  • Better semantic disentanglement via nullspace projection. However, the projection 𝒫Ω()subscript𝒫Ω\mathcal{P}_{\Omega}(\cdot)caligraphic_P start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( ⋅ ) introduces extra nonlinearity into the mapping PΩ(𝒇𝜽,t(𝒙t))subscript𝑃Ωsubscript𝒇𝜽𝑡subscript𝒙𝑡P_{\Omega}(\bm{f}_{\bm{\theta},t}(\bm{x}_{t}))italic_P start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ), causing the identified direction to have semantic entanglements with the area ΩCsuperscriptΩ𝐶\Omega^{C}roman_Ω start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT outside of the mask. Here, ΩCsuperscriptΩ𝐶\Omega^{C}roman_Ω start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT denotes the complimentary set of ΩΩ\Omegaroman_Ω. To address this issue, we can use the nullspace projection method [56, 57]. Specifically, given 𝑱¯𝜽,t=𝒙tPΩC(𝒇𝜽,t(𝒙t))=𝑼¯𝚺¯𝑽¯subscriptbold-¯𝑱𝜽𝑡subscriptsubscript𝒙𝑡subscript𝑃superscriptΩ𝐶subscript𝒇𝜽𝑡subscript𝒙𝑡bold-¯𝑼bold-¯𝚺superscriptbold-¯𝑽top{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\bm{\bar{J}}% _{\bm{\theta},t}}=\nabla_{\bm{x}_{t}}P_{\Omega^{C}}(\bm{f}_{\bm{\theta},t}(\bm% {x}_{t}))=\bm{\bar{U}}\bm{\bar{\Sigma}}{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\bm{\bar{V}}}^{\top}overbold_¯ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT = ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT roman_Ω start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = overbold_¯ start_ARG bold_italic_U end_ARG overbold_¯ start_ARG bold_Σ end_ARG overbold_¯ start_ARG bold_italic_V end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, nullspace projection projects 𝒗𝒗{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{v}}bold_italic_v onto null(𝑱¯θ,t)nullsuperscriptsubscript¯𝑱𝜃𝑡top{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}% \operatorname{null}\left(\bar{\bm{J}}_{\theta,t}^{\top}\right)}roman_null ( over¯ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ). The projection can be computed as 𝒗p=projnull(𝑱¯𝜽,t)(𝒗)=(𝑰𝑽¯𝑽¯)𝒗subscript𝒗𝑝subscriptprojnullsubscript¯𝑱𝜽𝑡𝒗𝑰bold-¯𝑽superscriptbold-¯𝑽top𝒗{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\bm{v}_{p}}=% \operatorname{proj}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{% rgb}{0,0,1}\operatorname{null}\left(\bar{\bm{J}}_{\bm{\theta},t}\right)}}({% \color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{v}})=(\bm{I}-{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\bm{\bar{V}}\bm{\bar{V}}^{\top}}){\color[rgb]{% 0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{v}}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = roman_proj start_POSTSUBSCRIPT roman_null ( over¯ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( bold_italic_v ) = ( bold_italic_I - overbold_¯ start_ARG bold_italic_V end_ARG overbold_¯ start_ARG bold_italic_V end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_italic_v so that the modified editing direction 𝒗psubscript𝒗𝑝{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\bm{v}_{p}}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT does not change the image in ΩCsuperscriptΩ𝐶\Omega^{C}roman_Ω start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT. In practice, we calculate a top-rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT rank estimation for 𝑽¯bold-¯𝑽{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\bm{\bar{V}}}overbold_¯ start_ARG bold_italic_V end_ARG through the generalized power method (GPM) Algorithm 2 with r=5superscript𝑟5r^{\prime}=5italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 5.

Refer to caption
Figure 4: T-LOCO Edit on T2I diffusion models. (a) Unsupervised editing direction is found only via the given mask without editing prompt. (b) Text-supervised editing direction is found with both a mask and an editing prompt such as "with glasses". Experiment details can be found in Section D.3.

T-LOCO Edit.

The unsupervised edit method can be seamlessly applied to T2I diffusion models with classifier-free guidance (3) (Algorithm 3). Besides, we can further enable text-supervised image editing with an editing prompt (Algorithm 4). See results in Figure 4(a). This is useful because the additional text prompt allows us to enforce a specified editing direction that cannot be found easily in the semantic subspace of the vanilla Jacobian 𝑱𝜽,tsubscript𝑱𝜽𝑡\bm{J}_{\bm{\theta},t}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT. As illustrated in Figure 4(b), this includes adding glasses or changing the curly hair of a human face. For simplicity, we introduce the key ideas of text-supervised T-LOCO Edit based upon DeepFloyd IF [19]. Similar procedures are also generalized to Stable Diffusion and Latent Consistency Models with an additional decoding step [4, 38]. We discuss the key intuition below, see Section B.2 and Section B.3 for method details.

We first introduce some notations. Let cosubscript𝑐𝑜c_{o}italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT denote the original prompt, and cesubscript𝑐𝑒c_{e}italic_c start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT denote the editing prompt. For example, in Figure 4(b), cosubscript𝑐𝑜c_{o}italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT can be “portrait of a man”, while cesubscript𝑐𝑒c_{e}italic_c start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT can be “portrait of a man with glasses”. Correspondingly, given the noisy image 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the clean image 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT generated with cosubscript𝑐𝑜c_{o}italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, let 𝒇𝜽,to(𝒙t)superscriptsubscript𝒇𝜽𝑡𝑜subscript𝒙𝑡\bm{f}_{\bm{\theta},t}^{o}(\bm{x}_{t})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and 𝑱𝜽,to(𝒙t)superscriptsubscript𝑱𝜽𝑡𝑜subscript𝒙𝑡\bm{J}_{\bm{\theta},t}^{o}(\bm{x}_{t})bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) be the estimated posterior mean and its Jacobian conditioned on the original prompt cosubscript𝑐𝑜c_{o}italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and let 𝒇𝜽,te(𝒙t)superscriptsubscript𝒇𝜽𝑡𝑒subscript𝒙𝑡\bm{f}_{\bm{\theta},t}^{e}(\bm{x}_{t})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and 𝑱𝜽,te(𝒙t)superscriptsubscript𝑱𝜽𝑡𝑒subscript𝒙𝑡\bm{J}_{\bm{\theta},t}^{e}(\bm{x}_{t})bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) be the estimated posterior mean and its Jacobian conditioned on both the editing prompt cesubscript𝑐𝑒c_{e}italic_c start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and cosubscript𝑐𝑜c_{o}italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT.

According to the classifier-free guidance (3), we can estimate the difference of estimated posterior means caused by the editing prompt as 𝒅=𝒇𝜽,te(𝒙t)𝒇𝜽,to(𝒙t)𝒅superscriptsubscript𝒇𝜽𝑡𝑒subscript𝒙𝑡superscriptsubscript𝒇𝜽𝑡𝑜subscript𝒙𝑡{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{d}=\bm{f}_{\bm{\theta},t}^{e}(\bm{x}_{t})-\bm{f}_{\bm{\theta},% t}^{o}(\bm{x}_{t})}bold_italic_d = bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and then set 𝒗=𝑱𝜽,te(𝒙t)𝒅𝒗superscriptsubscript𝑱𝜽𝑡𝑒superscriptsubscript𝒙𝑡top𝒅{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{v}=\bm{J}_{\bm{\theta},t}^{e}(\bm{x}_{t})^{\top}\bm{d}}bold_italic_v = bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_d as an initial estimator of the editing direction.333The idea is to identify the editing direction in the 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT space based on changes in the estimated posterior mean caused by the editing prompt. More details are provided in Section B.3. Based upon this, to enable localized editing, similar to the unsupervised case, we can apply masks ΩΩ\Omegaroman_Ω to select ROI in 𝒅𝒅{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{d}}bold_italic_d and calculate localized Jacobian to get 𝒗𝒗{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{v}}bold_italic_v. After that, similarly, we can perform nullspace projection of 𝒗𝒗{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{v}}bold_italic_v for better disentanglement to get the final editing direction 𝒗psubscript𝒗𝑝{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\bm{v}_{p}}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

4 Justification of Local Linearity, Low-rankness, & Semantic Direction

In this section, we provide theoretical justification for the benign properties in Section 3.1. First, we assume that the image distribution pdatasubscript𝑝datap_{\text{data}}italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT follows mixture of low-rank Gaussians defined as follows.

Assumption 1.

The data 𝐱0dsubscript𝐱0superscript𝑑\bm{x}_{0}\in\mathbb{R}^{d}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT generated distribution pdatasubscript𝑝datap_{\text{data}}italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT lies on a union of K𝐾Kitalic_K subspaces. The basis of each subspace {𝐌kSt(d,rk)}k=1Ksuperscriptsubscriptsubscript𝐌𝑘St𝑑subscript𝑟𝑘𝑘1𝐾\left\{\bm{M}_{k}\in\mathrm{St}(d,r_{k})\right\}_{k=1}^{K}{ bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ roman_St ( italic_d , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT are orthogonal to each other with 𝐌i𝐌j=𝟎superscriptsubscript𝐌𝑖topsubscript𝐌𝑗0\bm{M}_{i}^{\top}\bm{M}_{j}=\bm{0}bold_italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_0 for all 1ijK1𝑖𝑗𝐾1\leq i\neq j\leq K1 ≤ italic_i ≠ italic_j ≤ italic_K, and the subspace dimension rksubscript𝑟𝑘r_{k}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is much smaller than the ambient dimension d𝑑ditalic_d. Moreover, for each k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ], 𝐱0subscript𝐱0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT follows degenerated Gaussian with (𝐱0=𝐌k𝐚k)=1/K,𝐚k𝒩(𝟎,Irk).formulae-sequencesubscript𝐱0subscript𝐌𝑘subscript𝐚𝑘1𝐾similar-tosubscript𝐚𝑘𝒩0subscriptIsubscript𝑟𝑘\mathbb{P}\left(\bm{x}_{0}=\bm{M}_{k}\bm{a}_{k}\right)=1/K,\bm{a}_{k}\sim% \mathcal{N}(\bm{0},\textbf{I}_{r_{k}}).roman_ℙ ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = 1 / italic_K , bold_italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , I start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) . Without loss of generality, suppose 𝐱tsubscript𝐱𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is from the hhitalic_h-th class, that is 𝐱t=αt𝐱0+1αtϵsubscript𝐱𝑡subscript𝛼𝑡subscript𝐱01subscript𝛼𝑡bold-ϵ\bm{x}_{t}=\sqrt{\alpha_{t}}\bm{x}_{0}+\sqrt{1-\alpha_{t}}\bm{\epsilon}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ where 𝐱0range(𝐌h)subscript𝐱0rangesubscript𝐌\bm{x}_{0}\in\operatorname{range}(\bm{M}_{h})bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_range ( bold_italic_M start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), i.e. 𝐱0=𝐌h𝐚hsubscript𝐱0subscript𝐌subscript𝐚\bm{x}_{0}=\bm{M}_{h}\bm{a}_{h}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_M start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT bold_italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. Both 𝐱02,ϵ2subscriptnormsubscript𝐱02subscriptnormbold-ϵ2||\bm{x}_{0}||_{2},||\bm{\epsilon}||_{2}| | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , | | bold_italic_ϵ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is bounded.

Our data assumption is motivated by the intrinsic low-dimensionality of real-world image dataset [58].Additionally, Wang et al. [59] demonstrated that images generated by an analytical score function derived from a mixture of Gaussians distribution exhibit conceptual similarities to those produced by practically trained diffusion models. Given that 𝒇𝜽,t(𝒙t)subscript𝒇𝜽𝑡subscript𝒙𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is an estimator of the posterior mean 𝔼[𝒙0|𝒙t]𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ], we show that the posterior mean 𝔼[𝒙0|𝒙t]𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] can analytically derived as follows.

Lemma 1.

Under 1, for t(0,1]𝑡01t\in(0,1]italic_t ∈ ( 0 , 1 ], the posterior mean is

𝔼[𝒙0|𝒙t]=αtk=1Kexp(αt2(1αt)𝑴k𝒙t2)𝑴k𝑴k𝒙tk=1Kexp(αt2(1αt)𝑴k𝒙t2).𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡subscript𝛼𝑡superscriptsubscript𝑘1𝐾subscript𝛼𝑡21subscript𝛼𝑡superscriptnormsuperscriptsubscript𝑴𝑘topsubscript𝒙𝑡2subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡superscriptsubscript𝑘1𝐾subscript𝛼𝑡21subscript𝛼𝑡superscriptnormsuperscriptsubscript𝑴𝑘topsubscript𝒙𝑡2\displaystyle\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]=\sqrt{\alpha_{t}}% \frac{\sum_{k=1}^{K}\exp\left(\dfrac{\alpha_{t}}{2\left(1-\alpha_{t}\right)}\|% \bm{M}_{k}^{\top}\bm{x}_{t}\|^{2}\right)\bm{M}_{k}\bm{M}_{k}^{\top}\bm{x}_{t}}% {\sum_{k=1}^{K}\exp\left(\dfrac{\alpha_{t}}{2\left(1-\alpha_{t}\right)}\|\bm{M% }_{k}^{\top}\bm{x}_{t}\|^{2}\right)}.roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ∥ bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ∥ bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG . (7)

Lemma 1 shows that the posterior mean 𝔼[𝒙0|𝒙t]𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] could be viewed as a convex combination of 𝑴k𝑴k𝒙tsubscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡\bm{M}_{k}\bm{M}_{k}^{\top}\bm{x}_{t}bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, i.e. 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT projected onto each subspace 𝑴ksubscript𝑴𝑘\bm{M}_{k}bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. This lemma leads to the following theorem:

Theorem 1.

Based upon 1, we can show the following three properties for the posterior mean 𝔼[𝐱0|𝐱t]𝔼delimited-[]conditionalsubscript𝐱0subscript𝐱𝑡\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]:

  • The Jacobian of posterior mean satisfies rank(𝒙t𝔼[𝒙0|𝒙t])r:=k=1Krkranksubscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡𝑟assignsuperscriptsubscript𝑘1𝐾subscript𝑟𝑘\mathrm{rank}\left(\nabla_{\bm{x}_{t}}\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]\right)% \leq r:=\sum\limits_{k=1}^{K}r_{k}roman_rank ( ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) ≤ italic_r := ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for all t(0,1]𝑡01t\in(0,1]italic_t ∈ ( 0 , 1 ].

  • The posterior mean 𝔼[𝒙0|𝒙t]𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] has local linearity such that

    𝔼[𝒙0|𝒙t+λΔ𝒙]𝔼[𝒙0|𝒙t]λ𝒙t𝔼[𝒙0|𝒙t]Δ𝒙=λαt(1αt)𝒪(λ),\displaystyle\left\|\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}+\lambda\Delta\bm{x}% \right]-\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]-\lambda\nabla_{\bm{x}_{t}% }\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]\cdot\Delta\bm{x}\right\|=\lambda\dfrac{% \alpha_{t}}{\left(1-\alpha_{t}\right)}\mathcal{O}(\lambda),∥ roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_Δ bold_italic_x ] - roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] - italic_λ ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ⋅ roman_Δ bold_italic_x ∥ = italic_λ divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG caligraphic_O ( italic_λ ) , (8)

    where Δ𝒙𝕊d1Δ𝒙superscript𝕊𝑑1\Delta\bm{x}\in\mathbb{S}^{d-1}roman_Δ bold_italic_x ∈ roman_𝕊 start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT and λ𝜆\lambda\in\mathbb{R}italic_λ ∈ roman_ℝ is the step size.

  • 𝒙t𝔼[𝒙0|𝒙t]subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\nabla_{\bm{x}_{t}}\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] is symmetric and the full SVD of 𝒙t𝔼[𝒙0|𝒙t]subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\nabla_{\bm{x}_{t}}\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] could be written as 𝒙t𝔼[𝒙0|𝒙t]=𝑼t𝚺t𝑽tsubscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡subscript𝑼𝑡subscript𝚺𝑡superscriptsubscript𝑽𝑡top\nabla_{\bm{x}_{t}}\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]=\bm{U}_{t}\bm{\Sigma}_{t}% \bm{V}_{t}^{\top}∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = bold_italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where 𝑼t=[𝒖t,1,𝒖t,2,,𝒖t,d]St(d,d)subscript𝑼𝑡subscript𝒖𝑡1subscript𝒖𝑡2subscript𝒖𝑡𝑑St𝑑𝑑\bm{U}_{t}=\left[\bm{u}_{t,1},\bm{u}_{t,2},\ldots,\bm{u}_{t,d}\right]\in% \mathrm{St}(d,d)bold_italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ bold_italic_u start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT , … , bold_italic_u start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT ] ∈ roman_St ( italic_d , italic_d ), 𝚺t=diag(σt,1,,σt,r,,0)subscript𝚺𝑡diagsubscript𝜎𝑡1subscript𝜎𝑡𝑟0\bm{\Sigma}_{t}=\mathrm{diag}(\sigma_{t,1},\dots,\sigma_{t,r},\dots,0)bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_diag ( italic_σ start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT , … , 0 ) with σt,1σt,r0subscript𝜎𝑡1subscript𝜎𝑡𝑟0\sigma_{t,1}\geq\dots\geq\sigma_{t,r}\geq 0italic_σ start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ≥ ⋯ ≥ italic_σ start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT ≥ 0 and 𝑽t=[𝒗t,1,𝒗t,2,,𝒗t,d]St(d,d)subscript𝑽𝑡subscript𝒗𝑡1subscript𝒗𝑡2subscript𝒗𝑡𝑑St𝑑𝑑\bm{V}_{t}=\left[\bm{v}_{t,1},\bm{v}_{t,2},\ldots,\bm{v}_{t,d}\right]\in% \mathrm{St}(d,d)bold_italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ bold_italic_v start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT , … , bold_italic_v start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT ] ∈ roman_St ( italic_d , italic_d ). Let 𝑼t,1:=[𝒖t,1,𝒖t,2,,𝒖t,r]assignsubscript𝑼𝑡1subscript𝒖𝑡1subscript𝒖𝑡2subscript𝒖𝑡𝑟\bm{U}_{t,1}:=\left[\bm{u}_{t,1},\bm{u}_{t,2},\ldots,\bm{u}_{t,r}\right]bold_italic_U start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT := [ bold_italic_u start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT , … , bold_italic_u start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT ] and 𝑴:=[𝑴1,𝑴2,,𝑴K]assign𝑴subscript𝑴1subscript𝑴2subscript𝑴𝐾\bm{M}:=\left[\bm{M}_{1},\bm{M}_{2},\ldots,\bm{M}_{K}\right]bold_italic_M := [ bold_italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ]. It holds that limt1(𝑰d𝑼t,1𝑼t,1)𝑴F=0.subscript𝑡1subscriptnormsubscript𝑰𝑑subscript𝑼𝑡1superscriptsubscript𝑼𝑡1top𝑴𝐹0\lim_{t\rightarrow 1}\left\|\left(\bm{I}_{d}-\bm{U}_{t,1}\bm{U}_{t,1}^{\top}% \right)\bm{M}\right\|_{F}=0.roman_lim start_POSTSUBSCRIPT italic_t → 1 end_POSTSUBSCRIPT ∥ ( bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - bold_italic_U start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT bold_italic_U start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_italic_M ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = 0 .

The proof is deferred to Appendix C. Admittedly, there are gap between our theory and practice, such as the approximation error between 𝒇𝜽,t(𝒙t)subscript𝒇𝜽𝑡subscript𝒙𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and 𝔼[𝒙0|𝒙t]𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ], assumptions about the data distribution, and the high rankness of 𝑱𝜽,tsubscript𝑱𝜽𝑡\bm{J}_{\bm{\theta},t}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT for t<0.2𝑡0.2t<0.2italic_t < 0.2 and t>0.9𝑡0.9t>0.9italic_t > 0.9 in Figure 2. Nonetheless, Theorem 1 largely supports our empirical observation in Section 3 that we discuss below:

  • Low-rankness of the Jacobian. The first property in Theorem 1 demonstrates that the rank of 𝒙t𝔼[𝒙0|𝒙t]subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\nabla_{\bm{x}_{t}}\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] is always no greater than the intrinsic dimension of the data distribution. Given that the intrinsic dimension of the real data distribution is usually much lower than the ambient dimension [58], the rank of 𝑱𝜽,tsubscript𝑱𝜽𝑡\bm{J}_{\bm{\theta},t}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT on the real dataset should also be low. The results align with our empirical observations in Figure 2 when t[0.2,0.7]𝑡0.20.7t\in[0.2,0.7]italic_t ∈ [ 0.2 , 0.7 ].

  • Linearity of the posterior mean. The second property in Theorem 1 shows that the linear approximation error is within the order of λαt/(1αt)𝒪(λ)𝜆subscript𝛼𝑡1subscript𝛼𝑡𝒪𝜆\lambda{\alpha_{t}}/{\left(1-\alpha_{t}\right)}\cdot\mathcal{O}(\lambda)italic_λ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ caligraphic_O ( italic_λ ). This implies that when t𝑡titalic_t approaches 1, αt/(1αt)subscript𝛼𝑡1subscript𝛼𝑡{\alpha_{t}}/{\left(1-\alpha_{t}\right)}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) becomes small, resulting in a small approximation error even for large λ𝜆\lambdaitalic_λ. Empirically, Figure 2 shows that the linear approximation error of 𝒇𝜽,t(𝒙t)subscript𝒇𝜽𝑡subscript𝒙𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t})bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is small when t=0.7𝑡0.7t=0.7italic_t = 0.7 and λ=40𝜆40\lambda=40italic_λ = 40, whereas Figure 10 shows a much larger error for t=0.0𝑡0.0t=0.0italic_t = 0.0 under the same λ𝜆\lambdaitalic_λ. These observations align well with our theory.

  • Low-dimensional semantic subspace. The third property in Theorem 1 shows that, when t𝑡titalic_t is close to 1, the left singular vectors associated with the top-r𝑟ritalic_r singular values form the basis of the image distribution. Thus, if the edited direction is the basis, the edited image will remain within the image distribution. This explains why 𝒖isubscript𝒖𝑖\bm{u}_{i}bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT found in Equation 6 is a semantic direction for image editing.

5 Experiments

In this section, we perform extensive experiments to demonstrate the effectiveness and efficiency of LOCO Edit. We first showcase LOCO Edit has strong localized editing ability across a variety of datasets in Section 5.1. Moreover, we conduct comprehensive comparisons with other methods to show the superiority of the LOCO Edit method in Section 5.2. Besides, we provide ablation studies on multiple components in our method in Section 5.3. Further, we visualize and analyze the editing directions from LOCO Edit in Section 5.4. All the experiments can be conducted with a single A40 GPU with 48G memory, with extra experimental details postponed to Appendix D.

5.1 Demonstration on Localized Editing and Other Benign Properties

Refer to caption
Figure 5: Benchmarking LOCO Edit across various datasets. For each group of three images, in the center is the original image, and on the left and right are edited images along the negative and the positive directions accordingly.

First, we demonstrate several benign properties of our unsupervised LOCO Edit in Algorithm 1 on a variety of datasets, including LSUN-Church [60], Flower [61], AFHQ [62], CelebA-HQ [52], and FFHQ [63].

As shown in Figure 5 and Figure 1(a), our method enables editing specific localized regions such as eye size/focus, hair curvature, length/amount, and architecture, while preserving the consistency of other regions. Besides the ability of precise local editing, Figure 1 demonstrates the benign properties of the identified editing directions and verify our analysis in Section 4:

  • Linearity. As shown Figure 1(d), the semantic editing can be strengthened through larger editing scales and can be flipped by negating the scale.

  • Homogeneity and transferability. As shown Figure 1(b), the discovered editing direction can be transferred across samples and timesteps in 𝒳tsubscript𝒳𝑡\mathcal{X}_{t}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

  • Composability. As shown Figure 1(c), the identified disentangled editing directions in the low-rank subspace allow direct composition without influencing each other.

5.2 Comprehensive Comparison with Other Image Editing Methods

Method Name Pullback ϵt/𝒙tsubscriptbold-italic-ϵ𝑡subscript𝒙𝑡\partial\bm{\epsilon}_{t}/\partial\bm{x}_{t}∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT NoiseCLR Asyrp BlendedDiffusion LOCO (Ours)
Local Edit Success Rate↑ 0.32 0.37 0.32 0.47 0.55 0.80
LPIPS↓ 0.16 0.13 0.14 0.22 0.03 0.08
SSIM↑ 0.60 0.66 0.68 0.68 0.94 0.71
Transfer Success Rate↑ 0.14 0.24 0.66 0.58 Can’t Transfer 0.91
Transfer Edit Time↓ 4s 2s 5s 3s Can’t Transfer 2s
#Images for Learning 1 1 100 100 1 1
Learning Time↓ 8s 44s 1 day 475s 120s 79s
One-step Edit?
No Additional Supervision?
Theoretically Grounded?
Localized Edit?
Table 1: Comparisons with existing methods. Our LOCO Edit excels in localized editing, transferability and efficiency, with other intriguing properties such as one-step edit, supervision-free, and theoretically grounded.

We compare LOCO Edit with several notable and recent image editing techniques, including Asyrp [29], Pullback [30] (in the 𝒉t𝒙tsubscript𝒉𝑡subscript𝒙𝑡\frac{\partial\bm{h}_{t}}{\partial\bm{x}_{t}}divide start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG manifold), NoiseCLR [23], and BlendedDifusion [24]. Additionally, we also compare with an unexplored method using the Jacobians ϵt𝒙tsubscriptbold-italic-ϵ𝑡subscript𝒙𝑡\frac{\partial\bm{\epsilon}_{t}}{\partial\bm{x}_{t}}divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG to find the editing direction, named as ϵt𝒙tsubscriptbold-italic-ϵ𝑡subscript𝒙𝑡\frac{\partial\bm{\epsilon}_{t}}{\partial\bm{x}_{t}}divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG.

Evaluation Metrics.

We evaluate our method based upon the metrics that we elaborate on below and summarize the results in Table 1. Besides the image generation quality, we also compared other attributes such as the local edit ability, efficiency, the requirement for supervision, and theoretical justifications.

  • Local Edit Success Rate evaluates whether the editing successfully changes the target semantics and preserves unrelated regions by human evaluators.

  • LPIPS [64] and SSIM [65] measure the consistency between edited and original images.

  • Transfer Success Rate measures whether the editing transferred to other images successfully changes the target semantics and preserves unrelated regions by human evaluators.

  • Learning time to measure the time required to identify edit directions or perform other learning/training.

  • Transfer Edit Time to measure the time required to transfer the editing to other images directly.

  • #Images for Learning measures the number of images used to find the editing directions.

  • One-step Edit, No Additional Supervision, Theoretically Grounded, and Localized Edit are attributes of the editing methods, where each of them measures a specific property for the method.

Refer to caption
Figure 6: Compare local edit ability with other works on non-cherry-picked images. LOCO Edit has consistent and accurate local edit ability, while other methods may have unnoticeable, wrong, or global edits.

For fair comparison, we evaluate the methods on randomly selected images without cherry-picking for methods having strong edit ability in Figure 6. The detailed evaluation settings are provided in Section D.2.

Benefits of Our Method.

Based upon the qualitative and quantitative comparisons, our method shows several clear advantages that we summarize as follows.

  • Superior local edit ability with one-step edit. Table 1 shows LOCO Edit achieves the best Local Edit Success Rate. Such local edit ability only requires one-step edit at a specific time t𝑡titalic_t. For LPIPS and SSIM, our method performs better than global edit methods but worse than BlendedDiffusion. However, BlendedDiffusion sometimes fails the edit within the masks (as visualized in Figure F, rows 1, 3, 4, and 5). Other methods like NoiseCLR find semantic direction more globally, such as style and race, leading to worse performance in Local Edit Success Rate, LPIPS, and SSIM for localized edits.

  • Transferability and efficiency. First, LOCO Edit requires less learning time than most of the other methods and requires learning only for a single time step with a single image. Moreover, LOCO Edit is highly transferable, having the highest Transfer Success Rate in Table A. In contrast, BlendedDiffusion cannot transfer and requires optimization for each individual image. NoiseCLR has the second-best yet lower transfer success rate, while other methods exhibit worse transferability.

  • Theoretically-grounded and supervision-free. LOCO Edit is theoretically grounded. Besides, it is supervision-free, thus integrating no biases from other modules such as CLIP [36]. [37] shows CLIP sometimes can’t capture detailed semantics such as color. We can observe failures in capturing detailed semantics for methods that utilize CLIP guidance such as BlendedDiffusion and Asyrp in Figure 6, where there are no edits or wrong edits.

5.3 Ablation Studies

Refer to caption
(a) Ablation on time step.
Refer to caption
(b) Ablation on nullspace and rank.
Refer to caption
(c) Ablation on edit strengths.
Figure 7: Ablation Study. (a) Effects of one-step edit time. (b)Effects of using nullspace projection and rank. (c)Effects of editing strengths.

We conduct several important ablation studies on noise levels, the rank of nullspace projection, and editing strength, which demonstrates the robustness of our method.

  • Noise levels (i.e., editing time step t𝑡titalic_t). We conducted an ablation study on different noise levels, with representative examples shown in Figure 7(a). The key observations are summarized as follows: (a) Larger noise levels (i.e., edit on xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with larger t𝑡titalic_t) perform more coarse edit while small noise levels perform finer edit; (b) LOCO Edit is applicable to a generally large range of noise levels ([0.2T, 0.7T]) for precise edit.

  • Rank of nullspace projection rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Ablation study on nullspace projection is in Figure 7(b) (definition of rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is in Algorithm 1). We present the key observations: (a) the local edit ability with no nullspace projection is weaker than that with nullspace projection; (b) when conducting nullspace projection, an effective low-rank estimation with r=5superscript𝑟5r^{\prime}=5italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 5 can already achieve good local edit results.

  • Editing strength λ𝜆\lambdaitalic_λ. The linearity with respect to editing strengths is visualized in Figure 7(c), with the key observations in addition to linearity: LOCO Edit is applicable to a generally wide range of editing strengths ([-15, 15]) to achieve localized edit.

Refer to caption
Figure 8: Visualizing edit directions identified via LOCO Edit. The edit directions are semantically meaningful.

5.4 Visualization and Analysis of Editing Directions

We visualize the identified editing direction 𝒗psubscript𝒗𝑝\bm{v}_{p}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (see Algorithm 1) in Figure 8. The editing directions are semantically meaningful to the region of interest for editing. For example, the editing directions for eyes, lips, nose, etc., have similar shapes to eyes, lips, nose, etc.

Refer to caption
Figure 9: Analyzing transferability of edit directions to objects with different positions and shapes, images from different datasets, or images with no corresponding semantics.

Further, since the objects in datasets Flower, AFHQ, CelebA-HQ, and FFHQ are usually positioned at the center, the identified editing directions also tend to be at the center. Besides, objects could have different shapes, and semantics in some images do not exist in other images. To further study the robustness of transferability for the editing directions, we transfer editing directions to images with objects at different positions, from different datasets, with different shapes, and with no corresponding semantics. We present the results in Figure 9, with key observations that: (a) the edit directions are generally robust to gender differences, shape differences, moderate position differences, and dataset differences, illustrated in the first five rows of Figure 9 (b) transferring editing direction to images without corresponding semantics results in almost no editing (shown in the last row of Figure 9). Therefore, in practical applications, meaningful transfer editing scenarios for LOCO Edit occur when the transferred editing directions correspond to existing semantics in the target image (e.g., transferring the editing direction of "eyes" is effective only if the target image also contains eyes).

6 Discussion on Related Works

Study of Latent Semantic Space in Generative Models.

Although diffusion models have demonstrated their strengths in state-of-the-art image synthesis, the understanding of diffusion models is still far behind the other generative models such as Generative Adversarial Networks (GAN) [66, 57], the understanding of which can provide tools as well as inspiration for the understanding of diffusion models. Some recent works have identified such gaps, discovered latent semantic spaces in diffusion models [29], and further studied the properties of the latent space from a geometrical perspective [30]. These prior arts deepen our understanding of the latent semantic space in diffusion models, and inspire later works to study the structures of information represented in diffusion models from various angles. However, their semantic space is constrained to diffusion models using UNet architecture, and can not represent localized semantics. Our work explores an alternative space to study the semantic expression in diffusion models, inspired by our observation of the low-rank and locally linear Jacobian of the denoiser over the noisy images. We provide a theoretical framework for demonstrating and understanding such properties, which can deepen the interpretation of the learned data distribution in diffusion models.

Image Editing in Unconditional Diffusion Models.

Recent research has significantly improved the understanding of latent semantic spaces in diffusion models, enabling global image editing through either training-free methods [29, 30, 31] or by incorporating an additional lightweight model [30, 67]. However, these methods result in poor performance for localized edit. In contrast, our approach achieves localized editing without requiring supervised training. For localized edits, [25] builds on [30], enabling local edits by altering the intermediate layers of UNet. However, these approaches are restricted to UNet-based architectures in diffusion models and have largely ignored intrinsic properties like linearity and low-rankness. In comparison, our work provides a rigorous theoretical analysis of low-rankness and local linearity in diffusion models, and we are the first to offer a principled justification of the semantic significance of the basis used for editing. Moreover, our method is independent of specific network architectures.

Other recent works, such as [32], introduce training-free global audio and image editing based on a theoretical understanding of the posterior covariance matrix [33], also independent of UNet architectures. However, our proposed LOCO Edit method allows unsupervised and localized editing, and is principled in the low-rank and locally linear subspaces in diffusion models, which enables several advantageous properties including transferability, composability, and linearity – benign features that have not been explored in prior work. Additionally, while [24] supports localized editing, it requires supervision from CLIP, lacks a theoretical basis, and is time-consuming for editing each image. In contrast, our method is more efficient, theoretically grounded, and free from failures or biases in CLIP. The CLIP-supervised may also exhibit a bias toward the CLIP score, leading to suboptimal editing results, as shown in Figure 6. In comparison, our method consistently enables high-quality edits without such bias.

Image Editing in T2I Diffusion Models.

T2I image editing usually requires much more complicated sampling and training procedures, such as providing certainly learned guidance in the reverse sampling process [11], training an extra neural network [21], or fine-tuning the models for certain attributes [22]. Although effective, these methods often require extra training or even human intervention. Some other T2I image editing methods are training-free [46, 27, 28], and further enable editing with identifying masks [46], or optimizing the soft combination of text prompts [28]. These methods involve a continuous injection of the edit prompt during the generation process to gradually refine the generated image to have the target semantics. Though effective, all of the above methods (either training-free or not) as well as instruction-guided ones [68, 69, 70, 71] lack clear mathematical interpretations and requires text supervision. [23] discovers editing directions in T2I diffusion models through contrastive learning without text supervision, but is not generalizable to editing with text supervision. [30] has some theoretical basis and extends to an editing approach in T2I diffusion models with text supervision, but such supervision is only for unconditional sampling. In contrast, our extended T-LOCO Edit, which originated from the understanding of diffusion models, is the first method exploring single-step editing with or without text supervision for conditional sampling.

7 Conclusion & Future Directions

We proposed a new low-rank controllable image editing method, LOCO Edit, which enables precise, one-step, localized editing using diffusion models. Our approach stems from the discovery of the locally linear posterior mean estimator in diffusion models and the identification of a low-dimensional semantic subspace in its Jacobian, theoretically verified under certain data assumptions. The identified editing directions possess several beneficial properties, such as linearity, homogeneity, and composability. Additionally, our method is versatile across different datasets and models and is applicable to text-supervised editing in T2I diffusion models. Through various experiments, we demonstrate the superiority of our method compared to existing approaches.

We identify several future directions and limitations of the current work. The current theoretical framework explains mainly the unsupervised image editing part. A more solid and thorough analysis of text-supervised image editing is of significant importance in understanding T2I diffusion models, which is yet a difficult open problem in the field. For example, there is still a lack of geometric analysis of the relationship between subspaces under different text-prompt conditions [4, 19, 38, 72]. Based on such understandings, it may be possible to further discover benign properties of editing directions in T2I diffusion models, or design more efficient fine-tuning [73, 74] accordingly. Besides, the current method has the potential to be extended for combining coarse to fine editing across different time steps. Furthermore, it is worth exploring the direct manipulation of semantic spaces in flow-matching diffusion models and transformer-architecture diffusion models. Lastly, it is possible to connect the current finding to image or video representation learning in diffusion models [75, 76, 77], or utilize the low-rank structures to build dictionaries [78].

Acknowledgement

We acknowledge support from NSF CAREER CCF-2143904, NSF CCF-2212066, NSF CCF-2212326, NSF IIS 2312842, NSF IIS 2402950, ONR N00014-22-1-2529, a gift grant from KLA, an Amazon AWS AI Award, MICDE Catalyst Grant. The authors acknowledge valuable discussions with Mr. Zekai Zhang (U. Michigan), Dr. Ismail R. Alkhouri (U. Michigan and MSU), Mr. Jinfan Zhou (U. Michigan), and Mr. Xiao Li (U. Michigan).

References

  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Song et al. [2021a] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=PxTIG12RRHS.
  • Song et al. [2021b] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum?id=St1giarCHLP.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  • Zhang et al. [2024a] Huijie Zhang, Yifu Lu, Ismail Alkhouri, Saiprasad Ravishankar, Dogyoon Song, and Qing Qu. Improving training efficiency of diffusion models via multi-stage framework and tailored multi-decoder architectures. In Conference on Computer Vision and Pattern Recognition 2024, 2024a. URL https://openreview.net/forum?id=YtptmpZQOg.
  • Alkhouri et al. [2023a] Ismail Alkhouri, Shijun Liang, Rongrong Wang, Qing Qu, and Saiprasad Ravishankar. Diffusion-based adversarial purification for robust deep mri reconstruction. ArXiv preprint arXiv:2309.05794, 2023a.
  • Kong et al. [2021] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=a-xFK8Ymz5J.
  • Chen et al. [2021] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=NsMLjcFaO8O.
  • Chung et al. [2022] Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=nJJjv0JDJju.
  • Song et al. [2023] Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2023.
  • Chung et al. [2023] Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=OnD9zGAGT0k.
  • Li et al. [2024a] Xiang Li, Soo Min Kwon, Ismail R Alkhouri, Saiprasad Ravishanka, and Qing Qu. Decoupled data consistency with diffusion purification for image restoration. ArXiv preprint arXiv:2403.06054, 2024a.
  • Alkhouri et al. [2023b] Ismail Alkhouri, Shijun Liang, Rongrong Wang, Qing Qu, and Saiprasad Ravishankar. Robust physics-based deep mri reconstruction via diffusion purification. In Conference on Parsimony and Learning (Recent Spotlight Track), 2023b.
  • Song et al. [2024] Bowen Song, Soo Min Kwon, Zecheng Zhang, Xinyu Hu, Qing Qu, and Liyue Shen. Solving inverse problems with latent diffusion models via hard data consistency. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=j8hdRqOUhN.
  • Yu et al. [2023] Sihyun Yu, Kihyuk Sohn, Subin Kim, and Jinwoo Shin. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18456–18466, 2023.
  • Blattmann et al. [2023] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. ArXiv preprint arXiv:2311.15127, 2023.
  • Khachatryan et al. [2023] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ArXiv preprint arXiv:2204.06125, 2022.
  • Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  • Karras et al. [2018] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2018. URL https://api.semanticscholar.org/CorpusID:54482423.
  • Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  • Ruiz et al. [2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  • Dalva and Yanardag [2024] Yusuf Dalva and Pinar Yanardag. Noiseclr: A contrastive learning approach for unsupervised discovery of interpretable directions in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24209–24218, 2024.
  • Avrahami et al. [2022] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18208–18218, June 2022.
  • Kouzelis et al. [2024] Theodoros Kouzelis, Manos Plitsis, Mihalis A. Nicolaou, and Yannis Panagakis. Enabling local editing in diffusion models by joint and individual component analysis, 2024.
  • Couairon et al. [2023] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=3lge0p5o-M-.
  • Brack et al. [2023] Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek, Patrick Schramowski, and Kristian Kersting. SEGA: Instructing text-to-image models using semantic guidance. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=KIPAIy329j.
  • Wu et al. [2023] Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and Shiyu Chang. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2023.
  • Kwon et al. [2023] Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. Diffusion models already have a semantic latent space. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=pd1P2eUBVfq.
  • Park et al. [2023a] Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, and Youngjung Uh. Understanding the latent space of diffusion models through the lens of riemannian geometry. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=VUlYp3jiEI.
  • Zhu et al. [2023] Ye Zhu, Yu Wu, Zhiwei Deng, Olga Russakovsky, and Yan Yan. Boundary guided learning-free semantic control with diffusion models. In Conference on Neural Information Processing Systems (NeurIPS), 2023.
  • Manor and Michaeli [2024a] Hila Manor and Tomer Michaeli. Zero-shot unsupervised and text-based audio editing using DDPM inversion. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 34603–34629. PMLR, 21–27 Jul 2024a. URL https://proceedings.mlr.press/v235/manor24a.html.
  • Manor and Michaeli [2024b] Hila Manor and Tomer Michaeli. On the posterior distribution in denoising: Application to uncertainty quantification. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=adSGeugiuj.
  • Saad [2011] Yousef Saad. Numerical methods for large eigenvalue problems: revised edition. SIAM, 2011.
  • Park et al. [2023b] Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, and Youngjung Uh. Understanding the latent space of diffusion models through the lens of riemannian geometry. Advances in Neural Information Processing Systems, 36:24129–24142, 2023b.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  • Tong et al. [2024] Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024.
  • Luo et al. [2023] Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. ArXiv preprint arXiv:2310.04378, 2023.
  • Karras et al. [2022a] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022a.
  • Mokady et al. [2023] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
  • Ho and Salimans [2021] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. URL https://openreview.net/forum?id=qw8AKxfYbI.
  • Lu et al. [2022] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  • Karras et al. [2022b] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022b.
  • Zhang et al. [2024b] Huijie Zhang, Jinfan Zhou, Yifu Lu, Minzhe Guo, Peng Wang, Liyue Shen, and Qing Qu. The emergence of reproducibility and consistency in diffusion models. In Forty-first International Conference on Machine Learning, 2024b. URL https://openreview.net/forum?id=HsliOqZkc0.
  • Luo [2022] Calvin Luo. Understanding diffusion models: A unified perspective. ArXiv preprint arXiv:2208.11970, 2022.
  • Kim et al. [2022] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
  • Haas et al. [2024a] René Haas, Inbar Huberman-Spiegelglas, Rotem Mulayoff, and Tomer Michaeli. Discovering interpretable directions in the semantic latent space of diffusion models. International Conference on Automatic Face and Gesture Recognition, abs/2303.11073, 2024a. URL https://api.semanticscholar.org/CorpusID:257631803.
  • Esser et al. [2024] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=FPnUhsQJ5B.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Bao et al. [2023] Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023.
  • Liu et al. [2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009.
  • Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, October 2023.
  • Banerjee and Roy [2014] S. Banerjee and A. Roy. Linear Algebra and Matrix Analysis for Statistics. Chapman & Hall/CRC Texts in Statistical Science. CRC Press, 2014. ISBN 9781482248241. URL https://books.google.com/books?id=WDTcBQAAQBAJ.
  • Zhu et al. [2021] Jiapeng Zhu, Ruili Feng, Yujun Shen, Deli Zhao, Zhengjun Zha, Jingren Zhou, and Qifeng Chen. Low-rank subspaces in gans. In Neural Information Processing Systems, 2021. URL https://api.semanticscholar.org/CorpusID:235367855.
  • Pope et al. [2021] Phil Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=XJk19XzGq2J.
  • Wang and Vastola [2023] Binxu Wang and John J Vastola. The hidden linear structure in score-based models and its application. ArXiv preprint arXiv:2311.10892, 2023.
  • Yu et al. [2015] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. ArXiv, abs/1506.03365, 2015. URL https://api.semanticscholar.org/CorpusID:8317437.
  • Nilsback and Zisserman [2008] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729, 2008. doi: 10.1109/ICVGIP.2008.47.
  • Choi et al. [2020] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8188–8197, 2020.
  • Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
  • Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  • Goodfellow et al. [2014] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Neural Information Processing Systems, 2014. URL https://api.semanticscholar.org/CorpusID:261560300.
  • Haas et al. [2024b] René Haas, Inbar Huberman-Spiegelglas, Rotem Mulayoff, and Tomer Michaeli. Discovering interpretable directions in the semantic latent space of diffusion models. International Conference on Automatic Face and Gesture Recognition, abs/2303.11073, 2024b. URL https://api.semanticscholar.org/CorpusID:257631803.
  • Hertz et al. [2023] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image editing with cross-attention control. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=_CDixzkzeyb.
  • Wang et al. [2023a] Qian Wang, Biao Zhang, Michael Birsak, and Peter Wonka. Instructedit: Improving automatic masks for diffusion-based image editing with user instructions. ArXiv, abs/2305.18047, 2023a. URL https://api.semanticscholar.org/CorpusID:258959425.
  • Brooks et al. [2022] Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18392–18402, 2022. URL https://api.semanticscholar.org/CorpusID:253581213.
  • Li et al. [2024b] Shanglin Li, Bo-Wen Zeng, Yutang Feng, Sicheng Gao, Xuhui Liu, Jiaming Liu, Li Lin, Xu Tang, Yao Hu, Jianzhuang Liu, and Baochang Zhang. Zone: Zero-shot instruction-guided local editing. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024b.
  • Wang et al. [2024] Peng Wang, Huikang Liu, Druv Pai, Yaodong Yu, Zhihui Zhu, Qing Qu, and Yi Ma. A global geometric analysis of maximal coding rate reduction. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=u9qmjV2khT.
  • Hu et al. [2022] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  • Yaras et al. [2024] Can Yaras, Peng Wang, Laura Balzano, and Qing Qu. Compressible dynamics in deep overparameterized low-rank learning & adaptation. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=uDkXoZMzBv.
  • Fuest et al. [2024] Michael Fuest, Pingchuan Ma, Ming Gui, Johannes S Fischer, Vincent Tao Hu, and Bjorn Ommer. Diffusion models and representation learning: A survey. ArXiv preprint arXiv:2407.00783, 2024.
  • Li et al. [2024c] Xiang Li, Yixiang Dai, and Qing Qu. Understanding generalizability of diffusion models requires rethinking the hidden gaussian structure. 2024c.
  • Wang et al. [2023b] Peng Wang, Xiao Li, Yaras Can, Zhihui Zhu, Laura Balzano, Wei Hu, and Qing Qu. Understanding deep representation learning via layerwise feature compression and discrimination. ArXiv preprint arXiv:2311.02960, 2023b.
  • Luo et al. [2024] Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison-Burch, and Rene Vidal. Pace: Parsimonious concept engineering for large language models. ArXiv preprint arXiv:2406.04331, 2024.
  • Efron [2011] Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011.
  • Max [1950] A Woodbury Max. Inverting modified matrices. In Memorandum Rept. 42, Statistical Research Group, page 4. Princeton Univ., 1950.
  • Davis and Kahan [1970] Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970.
  • Karras et al. [2020] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  • Choi et al. [2022] Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo J. Kim, and Sung-Hoon Yoon. Perception prioritized training of diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11462–11471, 2022. URL https://api.semanticscholar.org/CorpusID:247922317.
  • Karras et al. [2022c] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022c. URL https://openreview.net/forum?id=k7FuTOWMOc7.

Organization.

In Appendix A, we provide the experiment details and more results for the empirical study on low-rankness and local linearity. In Appendix B, we show extra details of LOCO Edit and T-LOCO Edit. In Appendix C, we present the proofs for Section 4. In Appendix D, we discuss image editing experiment details.

Appendix A More Empirical Study on Low-rankness & Local Linearity

A.1 Experiment Setup for Section 3.1

We evaluate the numerical rank of the denoiser function 𝒙𝜽(𝒙t,t)subscript𝒙𝜽subscript𝒙𝑡𝑡\bm{x}_{\bm{\theta}}(\bm{x}_{t},t)bold_italic_x start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) for DDPM (U-Net [49] architecture) on CIFAR-10 dataset [50] (d=32×32×3𝑑32323d=32\times 32\times 3italic_d = 32 × 32 × 3), U-ViT [51] (Transformer based networks) on CelebA [52] (d=64×64×3𝑑64643d=64\times 64\times 3italic_d = 64 × 64 × 3), ImageNet [53] datasets (d=64×64×3𝑑64643d=64\times 64\times 3italic_d = 64 × 64 × 3) and DeepFloy IF [19] trained on LAION-5B [54] dataset (d=64×64×3𝑑64643d=64\times 64\times 3italic_d = 64 × 64 × 3). Notably, U-ViT architecture uses the autoencoder to compress the image 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to embedding vector 𝒛0=Encoder(𝒙0)subscript𝒛0Encodersubscript𝒙0\bm{z}_{0}=\texttt{Encoder}(\bm{x}_{0})bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = Encoder ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and adding noise to 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the diffusion forward process; and the reverse process replaces 𝒙t,𝒙tΔtsubscript𝒙𝑡subscript𝒙𝑡Δ𝑡\bm{x}_{t},\bm{x}_{t-\Delta t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t - roman_Δ italic_t end_POSTSUBSCRIPT with 𝒛t,𝒛tΔtsubscript𝒛𝑡subscript𝒛𝑡Δ𝑡\bm{z}_{t},\bm{z}_{t-\Delta t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t - roman_Δ italic_t end_POSTSUBSCRIPT in Equation 1. And the generated image 𝒙0=Decoder(𝒛0)subscript𝒙0Decodersubscript𝒛0\bm{x}_{0}=\texttt{Decoder}(\bm{z}_{0})bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = Decoder ( bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). The PMP defined for U-ViT is:

𝒙^0,t=f𝜽,t(𝒛t;t)Decoder(𝒛t1αtϵ𝜽(𝒛t,t)αt).subscriptbold-^𝒙0𝑡subscript𝑓𝜽𝑡subscript𝒛𝑡𝑡Decodersubscript𝒛𝑡1subscript𝛼𝑡subscriptbold-italic-ϵ𝜽subscript𝒛𝑡𝑡subscript𝛼𝑡\displaystyle\bm{\hat{x}}_{0,t}=f_{\bm{\theta},t}(\bm{z}_{t};t)\coloneqq% \texttt{Decoder}\left(\frac{\bm{z}_{t}-\sqrt{1-\alpha_{t}}\bm{\epsilon}_{\bm{% \theta}}(\bm{z}_{t},t)}{\sqrt{\alpha_{t}}}\right).overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) ≔ Decoder ( divide start_ARG bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ) . (9)

The 𝑱𝜽,t(𝒛t;t)=𝒛tf𝜽,t(𝒛t;t)subscript𝑱𝜽𝑡subscript𝒛𝑡𝑡subscriptsubscript𝒛𝑡subscript𝑓𝜽𝑡subscript𝒛𝑡𝑡\bm{J}_{\bm{\theta},t}(\bm{z}_{t};t)=\nabla_{\bm{z}_{t}}f_{\bm{\theta},t}(\bm{% z}_{t};t)bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) = ∇ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) for f𝜽,t(𝒛t;t)subscript𝑓𝜽𝑡subscript𝒛𝑡𝑡f_{\bm{\theta},t}(\bm{z}_{t};t)italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) defined above. For DeepFloy IF, there are three diffusion models, one for generation and the other two for super-resolution. Here we only evaluate 𝑱𝜽,t(𝒛t;t)subscript𝑱𝜽𝑡subscript𝒛𝑡𝑡\bm{J}_{\bm{\theta},t}(\bm{z}_{t};t)bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) for diffusion generating the images.

Given a random initial noise 𝒙Tsubscript𝒙𝑇\bm{x}_{T}bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, diffusion model 𝒙𝜽subscript𝒙𝜽\bm{x}_{\bm{\theta}}bold_italic_x start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT generate image sequence {𝒙t}subscript𝒙𝑡\{\bm{x}_{t}\}{ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } follows reverse sampler Equation 1. Along the sampling trajectory {𝒙t}subscript𝒙𝑡\{\bm{x}_{t}\}{ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, for each 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we calculate 𝑱𝜽,t(𝒛t;t)subscript𝑱𝜽𝑡subscript𝒛𝑡𝑡\bm{J}_{\bm{\theta},t}(\bm{z}_{t};t)bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) and compute its numerical rank via

rank~(𝑱𝜽,t(𝒙t))=argminr{r:i=1rσi2(𝑱𝜽,t(𝒙t;t))i=1nσi2(𝑱𝜽,t(𝒙t;t))>η2},~ranksubscript𝑱𝜽𝑡subscript𝒙𝑡subscript𝑟:𝑟superscriptsubscript𝑖1𝑟superscriptsubscript𝜎𝑖2subscript𝑱𝜽𝑡subscript𝒙𝑡𝑡superscriptsubscript𝑖1𝑛superscriptsubscript𝜎𝑖2subscript𝑱𝜽𝑡subscript𝒙𝑡𝑡superscript𝜂2\widetilde{\operatorname{rank}}(\bm{J}_{\bm{\theta},t}(\bm{x}_{t}))=\arg\min% \limits_{r}\left\{r:\frac{\sum_{i=1}^{r}\sigma_{i}^{2}\left(\bm{J}_{\bm{\theta% },t}(\bm{x}_{t};t)\right)}{\sum_{i=1}^{n}\sigma_{i}^{2}\left(\bm{J}_{\bm{% \theta},t}(\bm{x}_{t};t)\right)}>\eta^{2}\right\},over~ start_ARG roman_rank end_ARG ( bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = roman_arg roman_min start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT { italic_r : divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) ) end_ARG > italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } , (10)

where σi(𝑨)subscript𝜎𝑖𝑨\sigma_{i}(\bm{A})italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_A ) denotes the i𝑖iitalic_ith largest singular value of 𝑨𝑨\bm{A}bold_italic_A. In our experiments, we set η=0.99𝜂0.99\eta=0.99italic_η = 0.99. We random generate 15151515 initialize noise 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for U-ViT). We only use one prompt for DeepFloyd IF. We use DDIM with 100 steps for DDPM and DeepFloyd IF, DPM-Solver with 20 steps for U-ViT, and select some of the steps to calculate rank(𝑱𝜽,t(𝒙t;t))ranksubscript𝑱𝜽𝑡subscript𝒙𝑡𝑡\operatorname{rank}(\bm{J}_{\bm{\theta},t}(\bm{x}_{t};t))roman_rank ( bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) ), reported the averaged rank in Figure 2. To report the norm ratio and cosine similarity, we select the closest t𝑡titalic_t to 0.7 along the sampling trajectory and reported in Figure 2, i.e. t=0.71𝑡0.71t=0.71italic_t = 0.71 for DDPM, t=0.66𝑡0.66t=0.66italic_t = 0.66 for U-ViT and t=0.69𝑡0.69t=0.69italic_t = 0.69 for DeepFloyd IF. The norm ratio and cosine similarity are also averaged over 15 samples.

Refer to caption
(a) t = 0.0
Refer to caption
(b) t = 0.5
Figure 10: More results on the linearity of fθ,t(xt,t)subscript𝑓𝜃𝑡subscript𝑥𝑡𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t},t)bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ).
Refer to caption
Figure 11: More empirical study on low-rankness and local linearity on more prompts and models trained with flow-matching objectives.

A.2 More Experiments for Section 3.1

We illustrated the norm ratio and cosine similarity for more timesteps in Figure 10, more text prompts, and flow-matching-based diffusion model in Figure 11. More specifically, for the plot of t=0.0𝑡0.0t=0.0italic_t = 0.0, we exactly use t=0.04𝑡0.04t=0.04italic_t = 0.04 for DDPM, t=0.005𝑡0.005t=0.005italic_t = 0.005 for U-ViT and t=0.09𝑡0.09t=0.09italic_t = 0.09 for DeepFloyd IF; for the plot of t=0.5𝑡0.5t=0.5italic_t = 0.5, we exactly use t=0.49𝑡0.49t=0.49italic_t = 0.49 for DDPM, t=0.50𝑡0.50t=0.50italic_t = 0.50 for U-ViT and t=0.49𝑡0.49t=0.49italic_t = 0.49 for DeepFloyd IF. The results aligned with our results in Theorem 1 that when t𝑡titalic_t is closer the 1, the linearity of 𝒇𝜽,t(𝒙t,t)subscript𝒇𝜽𝑡subscript𝒙𝑡𝑡\bm{f}_{\bm{\theta},t}(\bm{x}_{t},t)bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is better.

A.3 Comparison for Low-rankness & Local Linearity for Different Manifold

Refer to caption
Figure 12: (Left) Numerical rank of different jacobian J𝐽\bm{J}bold_italic_J at different timestep t𝑡titalic_t. (Right) Frobenius norm of different jacobian J𝐽\bm{J}bold_italic_J at different timestep t𝑡titalic_t
Refer to caption
Figure 13: (Left, Middle) Cosine similarity and norm ration of different mappings with respect to λ𝜆\lambdaitalic_λ. (Right) Symmetric property of x0,t^xt^subscript𝑥0𝑡subscript𝑥𝑡\dfrac{\partial\hat{\bm{x}_{0,t}}}{\partial\bm{x}_{t}}divide start_ARG ∂ over^ start_ARG bold_italic_x start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG with respect to timestep t𝑡titalic_t.

This section is an extension of Section 3.1. We study the low rankness and local linearity of more mappings between spaces of diffusion models. The sampling process of diffusion model involved the following space: 𝒙t𝒳tsubscript𝒙𝑡subscript𝒳𝑡\bm{x}_{t}\in\mathcal{X}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, 𝒙^0,t𝒳0,tsubscript^𝒙0𝑡subscript𝒳0𝑡\hat{\bm{x}}_{0,t}\in\mathcal{X}_{0,t}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT, 𝒉ttsubscript𝒉𝑡subscript𝑡\bm{h}_{t}\in\mathcal{H}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, ϵttsubscriptbold-italic-ϵ𝑡subscript𝑡\bm{\epsilon}_{t}\in\mathcal{E}_{t}bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the h-space of U-Net’s bottleneck feature space [29] and tsubscript𝑡\mathcal{E}_{t}caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the predict noise space. First, we explore the rank ratio of Jacobian 𝑱𝜽,tsubscript𝑱𝜽𝑡\bm{J}_{\bm{\theta},t}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT and Frobenius norm 𝑱𝜽,tFsubscriptnormsubscript𝑱𝜽𝑡𝐹||\bm{J}_{\bm{\theta},t}||_{F}| | bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT for: 𝒉t𝒙t,ϵt𝒉t,𝒙^0,t𝒉t,ϵt𝒙t,^𝒙0,t𝒙tsubscript𝒉𝑡subscript𝒙𝑡subscriptbold-italic-ϵ𝑡subscript𝒉𝑡subscript^𝒙0𝑡subscript𝒉𝑡subscriptbold-italic-ϵ𝑡subscript𝒙𝑡bold-^absentsubscript𝒙0𝑡subscript𝒙𝑡\dfrac{\partial\bm{h}_{t}}{\partial\bm{x}_{t}},\dfrac{\partial\bm{\epsilon}_{t% }}{\partial\bm{h}_{t}},\dfrac{\partial\hat{\bm{x}}_{0,t}}{\partial\bm{h}_{t}},% \dfrac{\partial\bm{\epsilon}_{t}}{\partial\bm{x}_{t}},\dfrac{\partial\bm{\hat{% }}{\bm{x}}_{0,t}}{\partial\bm{x}_{t}}divide start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ overbold_^ start_ARG end_ARG bold_italic_x start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG. We use DDPM with U-Net architecture, trained on CIFAR-10 dataset, and other experiment settings are the same as Section A.1, results are shown in Figure 12. The conclusion could be summarized as :

  • 𝐡t𝐱t,ϵt𝐡t,𝐱^0,t𝐡t,^𝐱0,t𝐱tsubscript𝐡𝑡subscript𝐱𝑡subscriptbold-ϵ𝑡subscript𝐡𝑡subscript^𝐱0𝑡subscript𝐡𝑡bold-^absentsubscript𝐱0𝑡subscript𝐱𝑡\dfrac{\partial\bm{h}_{t}}{\partial\bm{x}_{t}},\dfrac{\partial\bm{\epsilon}_{t% }}{\partial\bm{h}_{t}},\dfrac{\partial\hat{\bm{x}}_{0,t}}{\partial\bm{h}_{t}},% \dfrac{\partial\bm{\hat{}}{\bm{x}}_{0,t}}{\partial\bm{x}_{t}}divide start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ overbold_^ start_ARG end_ARG bold_italic_x start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG are low rank jacobian when t[0.2,0.7]𝑡0.20.7t\in[0.2,0.7]italic_t ∈ [ 0.2 , 0.7 ]. As shown in the left of Figure 12, rank ratio for 𝒉t𝒙t,ϵt𝒉t,𝒙^0,t𝒉t,^𝒙0,t𝒙tsubscript𝒉𝑡subscript𝒙𝑡subscriptbold-italic-ϵ𝑡subscript𝒉𝑡subscript^𝒙0𝑡subscript𝒉𝑡bold-^absentsubscript𝒙0𝑡subscript𝒙𝑡\dfrac{\partial\bm{h}_{t}}{\partial\bm{x}_{t}},\dfrac{\partial\bm{\epsilon}_{t% }}{\partial\bm{h}_{t}},\dfrac{\partial\hat{\bm{x}}_{0,t}}{\partial\bm{h}_{t}},% \dfrac{\partial\bm{\hat{}}{\bm{x}}_{0,t}}{\partial\bm{x}_{t}}divide start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ overbold_^ start_ARG end_ARG bold_italic_x start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG is less than 0.1. It should be noted that:

    • rank~(ϵt𝒙t)drank~(^𝒙0,t𝒙t)~ranksubscriptbold-italic-ϵ𝑡subscript𝒙𝑡𝑑~rankbold-^absentsubscript𝒙0𝑡subscript𝒙𝑡\widetilde{\operatorname{rank}}(\dfrac{\partial\bm{\epsilon}_{t}}{\partial\bm{% x}_{t}})\geq d-\widetilde{\operatorname{rank}}(\dfrac{\partial\bm{\hat{}}{\bm{% x}}_{0,t}}{\partial\bm{x}_{t}})over~ start_ARG roman_rank end_ARG ( divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) ≥ italic_d - over~ start_ARG roman_rank end_ARG ( divide start_ARG ∂ overbold_^ start_ARG end_ARG bold_italic_x start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ). This is because

      rank~(1αtαtϵt𝒙t)rank~(1αt𝑰d)rank~(^𝒙0,t𝒙t).~rank1subscript𝛼𝑡subscript𝛼𝑡subscriptbold-italic-ϵ𝑡subscript𝒙𝑡~rank1subscript𝛼𝑡subscript𝑰𝑑~rankbold-^absentsubscript𝒙0𝑡subscript𝒙𝑡\displaystyle\widetilde{\operatorname{rank}}(\frac{\sqrt{1-\alpha_{t}}}{\sqrt{% \alpha_{t}}}\dfrac{\partial\bm{\epsilon}_{t}}{\partial\bm{x}_{t}})\geq% \widetilde{\operatorname{rank}}(\frac{1}{\sqrt{\alpha_{t}}}\bm{I}_{d})-% \widetilde{\operatorname{rank}}(\dfrac{\partial\bm{\hat{}}{\bm{x}}_{0,t}}{% \partial\bm{x}_{t}}).over~ start_ARG roman_rank end_ARG ( divide start_ARG square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) ≥ over~ start_ARG roman_rank end_ARG ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) - over~ start_ARG roman_rank end_ARG ( divide start_ARG ∂ overbold_^ start_ARG end_ARG bold_italic_x start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) .

      Therefore, ϵt𝒙tsubscriptbold-italic-ϵ𝑡subscript𝒙𝑡\dfrac{\partial\bm{\epsilon}_{t}}{\partial\bm{x}_{t}}divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG is high rank when ^𝒙0,t𝒙tbold-^absentsubscript𝒙0𝑡subscript𝒙𝑡\dfrac{\partial\bm{\hat{}}{\bm{x}}_{0,t}}{\partial\bm{x}_{t}}divide start_ARG ∂ overbold_^ start_ARG end_ARG bold_italic_x start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG is low rank.

    • rank~(^𝒙0,t𝒉t)=rank~(^𝒙0,t𝒙t)~rankbold-^absentsubscript𝒙0𝑡subscript𝒉𝑡~rankbold-^absentsubscript𝒙0𝑡subscript𝒙𝑡\widetilde{\operatorname{rank}}(\dfrac{\partial\bm{\hat{}}{\bm{x}}_{0,t}}{% \partial\bm{h}_{t}})=\widetilde{\operatorname{rank}}(\dfrac{\partial\bm{\hat{}% }{\bm{x}}_{0,t}}{\partial\bm{x}_{t}})over~ start_ARG roman_rank end_ARG ( divide start_ARG ∂ overbold_^ start_ARG end_ARG bold_italic_x start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) = over~ start_ARG roman_rank end_ARG ( divide start_ARG ∂ overbold_^ start_ARG end_ARG bold_italic_x start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) This is because 𝒙^0,t=𝒙t1αtϵ𝜽(𝒙t,t)αtsubscript^𝒙0𝑡subscript𝒙𝑡1subscript𝛼𝑡subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡subscript𝛼𝑡\hat{\bm{x}}_{0,t}=\dfrac{\bm{x}_{t}-\sqrt{1-\alpha_{t}}\bm{\epsilon}_{\bm{% \theta}}(\bm{x}_{t},t)}{\sqrt{\alpha_{t}}}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT = divide start_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG and 𝒙t𝒉t=0subscript𝒙𝑡subscript𝒉𝑡0\dfrac{\partial\bm{x}_{t}}{\partial\bm{h}_{t}}=0divide start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = 0

  • When 𝐱tsubscript𝐱𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT fixed, ^𝐱0,t,ϵtbold-^absentsubscript𝐱0𝑡subscriptbold-ϵ𝑡\bm{\hat{}}{\bm{x}}_{0,t},\bm{\epsilon}_{t}overbold_^ start_ARG end_ARG bold_italic_x start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will change little when changing 𝐡tsubscript𝐡𝑡\bm{h}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. As shown in the right of Figure 12, 𝒙^0,t𝒉tF𝒙^0,t𝒙tFmuch-less-thansubscriptnormsubscript^𝒙0𝑡subscript𝒉𝑡𝐹subscriptnormsubscript^𝒙0𝑡subscript𝒙𝑡𝐹||\dfrac{\partial\hat{\bm{x}}_{0,t}}{\partial\bm{h}_{t}}||_{F}\ll||\dfrac{% \partial\hat{\bm{x}}_{0,t}}{\partial\bm{x}_{t}}||_{F}| | divide start_ARG ∂ over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≪ | | divide start_ARG ∂ over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and ϵt𝒉tϵt𝒙tmuch-less-thansubscriptbold-italic-ϵ𝑡subscript𝒉𝑡subscriptbold-italic-ϵ𝑡subscript𝒙𝑡\dfrac{\partial\bm{\epsilon}_{t}}{\partial\bm{h}_{t}}\ll\dfrac{\partial\bm{% \epsilon}_{t}}{\partial\bm{x}_{t}}divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ≪ divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG. This means when 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT fixed, ^𝒙0,t,ϵtbold-^absentsubscript𝒙0𝑡subscriptbold-italic-ϵ𝑡\bm{\hat{}}{\bm{x}}_{0,t},\bm{\epsilon}_{t}overbold_^ start_ARG end_ARG bold_italic_x start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will change little when changing 𝒉tsubscript𝒉𝑡\bm{h}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Then, we also study the linearity of 𝒉tsubscript𝒉𝑡\bm{h}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒙^0,tsubscript^𝒙0𝑡\hat{\bm{x}}_{0,t}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT given 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, using DDPM with U-Net architecture trained on CIFAR-10 dataset. We change the step size λ𝜆\lambdaitalic_λ defined in Equation 4. Results are shown in Figure 13, both 𝐡tsubscript𝐡𝑡\bm{h}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐱^0,tsubscript^𝐱0𝑡\hat{\bm{x}}_{0,t}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT have good linearity with respect to 𝐱tsubscript𝐱𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT..

In Theorem 1, the jacobian 𝒙t𝔼[𝒙0|𝒙t]subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\nabla_{\bm{x}_{t}}\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] is a symmetric matrix. Therefore, we also verify the symmetry of the jacobian over the PMP 𝑱𝜽,tsubscript𝑱𝜽𝑡\bm{J}_{\bm{\theta},t}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT. We use DDPM with U-Net architecture trained on CIFAR-10 dataset. At different timestep t𝑡titalic_t, we measure 𝑱𝜽,t𝑱𝜽,tFsubscriptnormsubscript𝑱𝜽𝑡subscriptsuperscript𝑱top𝜽𝑡𝐹||\bm{J}_{\bm{\theta},t}-\bm{J}^{\top}_{\bm{\theta},t}||_{F}| | bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT - bold_italic_J start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. Results are shown on the right of Figure 13. 𝑱𝜽,tsubscript𝑱𝜽𝑡\bm{J}_{\bm{\theta},t}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT has good symmetric property when t<0.1𝑡0.1t<0.1italic_t < 0.1 and t[0.6,0.7]𝑡0.60.7t\in[0.6,0.7]italic_t ∈ [ 0.6 , 0.7 ]. Additionally, 𝑱𝜽,tsubscript𝑱𝜽𝑡\bm{J}_{\bm{\theta},t}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT is low rank when t[0.6,0.7]𝑡0.60.7t\in[0.6,0.7]italic_t ∈ [ 0.6 , 0.7 ]. So 𝑱𝜽,tsubscript𝑱𝜽𝑡\bm{J}_{\bm{\theta},t}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT aligned with Theorem 1 t[0.6,0.7]𝑡0.60.7t\in[0.6,0.7]italic_t ∈ [ 0.6 , 0.7 ].

To the end, we want to based on the experiments in Figure 12 and Figure 13 to select the best space for out image editing method. ϵt𝒙tsubscriptbold-italic-ϵ𝑡subscript𝒙𝑡\dfrac{\partial\bm{\epsilon}_{t}}{\partial\bm{x}_{t}}divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG is the high-rank matrix, not suitable for efficiently estimate the nullspace; ϵt𝒉tsubscriptbold-italic-ϵ𝑡subscript𝒉𝑡\dfrac{\partial\bm{\epsilon}_{t}}{\partial\bm{h}_{t}}divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG and 𝒙^0,t𝒉tsubscript^𝒙0𝑡subscript𝒉𝑡\dfrac{\partial\hat{\bm{x}}_{0,t}}{\partial\bm{h}_{t}}divide start_ARG ∂ over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG has too small Frobenius norm to edit the image. Therefore, only 𝒉t𝒙tsubscript𝒉𝑡subscript𝒙𝑡\dfrac{\partial\bm{h}_{t}}{\partial\bm{x}_{t}}divide start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG and ^𝒙0,t𝒙tbold-^absentsubscript𝒙0𝑡subscript𝒙𝑡\dfrac{\partial\bm{\hat{}}{\bm{x}}_{0,t}}{\partial\bm{x}_{t}}divide start_ARG ∂ overbold_^ start_ARG end_ARG bold_italic_x start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG are low-rank and linear for image editing. What’s more, htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT space is restricted to UNet architecture, but the property of the ^𝒙0,t𝒙tbold-^absentsubscript𝒙0𝑡subscript𝒙𝑡\dfrac{\partial\bm{\hat{}}{\bm{x}}_{0,t}}{\partial\bm{x}_{t}}divide start_ARG ∂ overbold_^ start_ARG end_ARG bold_italic_x start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG does not depend on the UNet architecture and is verified in diffusion models using transformer architectures. Additionally, we could only apply masks on ^𝒙0,tbold-^absentsubscript𝒙0𝑡\bm{\hat{}}{\bm{x}}_{0,t}overbold_^ start_ARG end_ARG bold_italic_x start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT but cannot on 𝒉tsubscript𝒉𝑡\bm{h}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Therefore, the PMP fθ,tsubscript𝑓𝜃𝑡\bm{f}_{\bm{\theta},t}bold_italic_f start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT is the best mapping for image editing.

Appendix B Extra Details of LOCO Edit and T-LOCO Edit

B.1 Generalized Power Method

The Generalized Power Method [34, 30] for calculating the op-t𝑡titalic_t singular vectors of the Jacobian is summarized in Algorithm 2. It efficiently computes the top-k𝑘kitalic_k singular values and singular vectors of the Jacobian with a randomly initialized orthonormal 𝑽d×k𝑽superscript𝑑𝑘\bm{V}\in\mathbb{R}^{d\times k}bold_italic_V ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT.

Algorithm 2 Generalized Power Method
1:Input: 𝒇:dd:𝒇superscript𝑑superscript𝑑\bm{f}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}bold_italic_f : roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, 𝒙d𝒙superscript𝑑\bm{x}\in\mathbb{R}^{d}bold_italic_x ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and 𝑽d×k𝑽superscript𝑑𝑘\bm{V}\in\mathbb{R}^{d\times k}bold_italic_V ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT
2:Output: (𝑼,𝚺,𝑽)k𝑼𝚺superscript𝑽top𝑘\left(\bm{U},\bm{\Sigma},\bm{V}^{\top}\right)-k( bold_italic_U , bold_Σ , bold_italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) - italic_k top singular values and vectors of the Jacobian 𝒇𝒙𝒇𝒙\dfrac{\partial\bm{f}}{\partial\bm{x}}divide start_ARG ∂ bold_italic_f end_ARG start_ARG ∂ bold_italic_x end_ARG
3:𝒚𝒇(𝒙)𝒚𝒇𝒙\bm{y}\leftarrow\bm{f}(\bm{x})bold_italic_y ← bold_italic_f ( bold_italic_x )
4:if 𝑽𝑽\bm{V}bold_italic_V is empty then
5:    𝑽𝑽absent\bm{V}\leftarrowbold_italic_V ← i.i.d. standard Gaussian samples
6:end if
7:𝑸,𝑹QR(𝑽)𝑸𝑹QR𝑽\bm{Q},\bm{R}\leftarrow\mathrm{QR}(\bm{V})bold_italic_Q , bold_italic_R ← roman_QR ( bold_italic_V ) \triangleright Reduced QRQR\mathrm{QR}roman_QR decomposition
8:𝑽𝑸𝑽𝑸\bm{V}\leftarrow\bm{Q}bold_italic_V ← bold_italic_Q \triangleright Ensures 𝑽𝑽=𝑰superscript𝑽top𝑽𝑰\bm{V}^{\top}\bm{V}=\bm{I}bold_italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_V = bold_italic_I
9:while stopping criteria do
10:    𝑼𝒇(𝒙+a𝑽)a𝑼𝒇𝒙𝑎𝑽𝑎\bm{U}\leftarrow\dfrac{\partial\bm{f}(\bm{x}+a\bm{V})}{\partial a}bold_italic_U ← divide start_ARG ∂ bold_italic_f ( bold_italic_x + italic_a bold_italic_V ) end_ARG start_ARG ∂ italic_a end_ARG at a=0𝑎0a=0italic_a = 0 \triangleright Batch forward
11:    𝑽^(𝑼𝒚)𝒙^𝑽superscript𝑼top𝒚𝒙\hat{\bm{V}}\leftarrow\dfrac{\partial\left(\bm{U}^{\top}\bm{y}\right)}{% \partial\bm{x}}over^ start_ARG bold_italic_V end_ARG ← divide start_ARG ∂ ( bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_y ) end_ARG start_ARG ∂ bold_italic_x end_ARG
12:    𝑽,𝚺2,𝑹SVD(𝑽^)𝑽superscript𝚺2𝑹SVD^𝑽\bm{V},\bm{\Sigma}^{2},\bm{R}\leftarrow\operatorname{SVD}(\hat{\bm{V}})bold_italic_V , bold_Σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_italic_R ← roman_SVD ( over^ start_ARG bold_italic_V end_ARG ) \triangleright Reduced SVD
13:end while
14:Orthonormalize 𝑼𝑼\bm{U}bold_italic_U

B.2 Unsupervised T-LOCO Edit

The overall method for DeepFloyd is summarized in Algorithm 3. For T2I diffusion models in the latent space such as Stable Diffusion and Latent Consistency Model, at time t𝑡titalic_t, we additionally decode 𝒛^0subscriptbold-^𝒛0\bm{\hat{z}}_{0}overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT into the image space 𝒙^0subscriptbold-^𝒙0\bm{\hat{x}}_{0}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to enable masking and nullspace projection. The editing is still in the space of 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Algorithm 3 Unsupervised T-LOCO Edit for T2I diffusion models
1:Input: Random noise 𝒙Tsubscript𝒙𝑇\bm{x}_{T}bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, the mask ΩΩ\Omegaroman_Ω, edit timestep t𝑡titalic_t, pretrained diffusion model ϵ𝜽subscriptbold-italic-ϵ𝜽\bm{\epsilon}_{\bm{\theta}}bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT, editing scale λ𝜆\lambdaitalic_λ, noise scheduler αt,σtsubscript𝛼𝑡subscript𝜎𝑡\alpha_{t},\sigma_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, selected semantic index k𝑘kitalic_k, nullspace approximate rank r𝑟ritalic_r, original prompt cosubscript𝑐𝑜c_{o}italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, null prompt cnsubscript𝑐𝑛c_{n}italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, classifier free guidance scale s𝑠sitalic_s.
2:Output: Edited image 𝒙0subscriptsuperscript𝒙0\bm{x}^{\prime}_{0}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT,
3:𝒙tDDIM(𝒙T,1,t,ϵ𝜽(𝒙T,t,cn)+s(ϵ𝜽(𝒙T,t,co)ϵ𝜽(𝒙T,t,cn)))subscript𝒙𝑡DDIMsubscript𝒙𝑇1𝑡subscriptbold-italic-ϵ𝜽subscript𝒙𝑇𝑡subscript𝑐𝑛𝑠subscriptbold-italic-ϵ𝜽subscript𝒙𝑇𝑡subscript𝑐𝑜subscriptbold-italic-ϵ𝜽subscript𝒙𝑇𝑡subscript𝑐𝑛\bm{x}_{t}\leftarrow\text{{DDIM}}(\bm{x}_{T},1,t,\bm{\epsilon}_{\bm{\theta}}(% \bm{x}_{T},t,c_{n})+s(\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{T},t,c_{o})-\bm{% \epsilon}_{\bm{\theta}}(\bm{x}_{T},t,c_{n})))bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← DDIM ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , 1 , italic_t , bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_s ( bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) - bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) )
4:𝒙^0,t𝒇𝜽,to(𝒙t)subscriptbold-^𝒙0𝑡subscriptsuperscript𝒇𝑜𝜽𝑡subscript𝒙𝑡\bm{\hat{x}}_{0,t}\leftarrow\bm{f}^{o}_{\bm{\theta},t}(\bm{x}_{t})overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT ← bold_italic_f start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
5:Masking by 𝒙~0,t𝒫Ω(𝒙^0,t)subscriptbold-~𝒙0𝑡subscript𝒫Ωsubscriptbold-^𝒙0𝑡{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{\tilde{x}}_{0,t}}\leftarrow\mathcal{P}_{\Omega}(\bm{\hat{x}}_{% 0,t})overbold_~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT ← caligraphic_P start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT ) and 𝒙¯0,t𝒙^0,t𝒙~0,tsubscriptbold-¯𝒙0𝑡subscriptbold-^𝒙0𝑡subscriptbold-~𝒙0𝑡{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\bm{\bar{x}}% _{0,t}}\leftarrow\bm{\hat{x}}_{0,t}-\bm{\tilde{x}}_{0,t}overbold_¯ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT ← overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT - overbold_~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT \triangleright Use the mask for local image editing
6:The top-k𝑘kitalic_k SVD (𝑼~t,k,𝚺~t,k,𝑽~t,k)subscriptbold-~𝑼𝑡𝑘subscriptbold-~𝚺𝑡𝑘subscriptbold-~𝑽𝑡𝑘(\bm{\tilde{U}}_{t,k},\bm{\tilde{\Sigma}}_{t,k},{\color[rgb]{0,0.88,0}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}\pgfsys@color@cmyk@stroke{0.% 91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}{0.88}{0.12}\bm{\tilde{V}}_{% t,k}})( overbold_~ start_ARG bold_italic_U end_ARG start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT , overbold_~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT , overbold_~ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT ) of 𝑱~𝜽,t=𝒙~0,t𝒙tsubscriptbold-~𝑱𝜽𝑡subscriptbold-~𝒙0𝑡subscript𝒙𝑡{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{\tilde{J}}_{\bm{\theta},t}}=\dfrac{\partial{\color[rgb]{% 0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{\tilde{x}}_{0,t}}}{\partial\bm{x}_{t}}overbold_~ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT = divide start_ARG ∂ overbold_~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG \triangleright Efficiently computed via generalized power method
7:The top-r𝑟ritalic_r SVD (𝑼¯t,r,𝚺¯t,r,𝑽¯t,r)subscriptbold-¯𝑼𝑡𝑟subscriptbold-¯𝚺𝑡𝑟subscriptbold-¯𝑽𝑡𝑟(\bm{\bar{U}}_{t,r},\bm{\bar{\Sigma}}_{t,r},{\color[rgb]{0,0,1}\definecolor[% named]{pgfstrokecolor}{rgb}{0,0,1}\bm{\bar{V}}_{t,r}})( overbold_¯ start_ARG bold_italic_U end_ARG start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT , overbold_¯ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT , overbold_¯ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT ) of 𝑱¯𝜽,t=𝒙¯0,t𝒙tsubscriptbold-¯𝑱𝜽𝑡subscriptbold-¯𝒙0𝑡subscript𝒙𝑡{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\bm{\bar{J}}% _{\bm{\theta},t}}=\dfrac{\partial{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\bm{\bar{x}}_{0,t}}}{\partial\bm{x}_{t}}overbold_¯ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT = divide start_ARG ∂ overbold_¯ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG \triangleright Efficiently computed via generalized power method
8:Pick direction 𝒗𝑽~t,k[:,i]𝒗subscriptbold-~𝑽𝑡𝑘:𝑖{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{v}}\leftarrow{\color[rgb]{0,0.88,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0.88,0}\pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}% \pgfsys@color@cmyk@fill{0.91}{0}{0.88}{0.12}\bm{\tilde{V}}_{t,k}}[:,i]bold_italic_v ← overbold_~ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT [ : , italic_i ] \triangleright Pick the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT singular vector for editing within the mask ΩΩ\Omegaroman_Ω
9:Compute 𝒗p(𝑰𝑽¯t,r𝑽¯t,r)𝒗subscript𝒗𝑝𝑰subscriptbold-¯𝑽𝑡𝑟superscriptsubscriptbold-¯𝑽𝑡𝑟top𝒗{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\bm{v}_{p}}% \leftarrow(\bm{I}-{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0,1}\bm{\bar{V}}_{t,r}\bm{\bar{V}}_{t,r}^{\top}})\cdot{\color[rgb]{0,0.88,0}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}\pgfsys@color@cmyk@stroke{0.% 91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}{0.88}{0.12}\bm{v}}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ← ( bold_italic_I - overbold_¯ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT overbold_¯ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ⋅ bold_italic_v \triangleright Nullspace projection for editing within the mask ΩΩ\Omegaroman_Ω
10:𝒗p𝒗p𝒗p2subscript𝒗𝑝subscript𝒗𝑝subscriptnormsubscript𝒗𝑝2{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\bm{v}_{p}}% \leftarrow\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 1,0,0}\bm{v}_{p}}}{\|{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{% rgb}{1,0,0}\bm{v}_{p}}\|_{2}}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ← divide start_ARG bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG \triangleright Normalize the editing direction
11:𝒙t𝒙t+λ𝒗psuperscriptsubscript𝒙𝑡subscript𝒙𝑡𝜆subscript𝒗𝑝\bm{x}_{t}^{\prime}\leftarrow\bm{x}_{t}+{\color[rgb]{1,0,0}\definecolor[named]% {pgfstrokecolor}{rgb}{1,0,0}\lambda\bm{v}_{p}}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
12:𝒙𝟎DDIM(𝒙t,t,0,ϵ𝜽(𝒙t,t,cn)+s(ϵ𝜽(𝒙t,t,co)ϵ𝜽(𝒙t,t,cn)))subscriptsuperscript𝒙bold-′0DDIMsuperscriptsubscript𝒙𝑡𝑡0subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡subscript𝑐𝑛𝑠subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡subscript𝑐𝑜subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡subscript𝑐𝑛\bm{x^{\prime}_{0}}\leftarrow\text{{DDIM}}(\bm{x}_{t}^{\prime},t,0,\bm{% \epsilon}_{\bm{\theta}}(\bm{x}_{t},t,c_{n})+s(\bm{\epsilon}_{\bm{\theta}}(\bm{% x}_{t},t,c_{o})-\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t,c_{n})))bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ← DDIM ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t , 0 , bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_s ( bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) - bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) )

B.3 Text-suprvised T-LOCO Edit

Before introducing the algorithm, we define:

𝒇𝜽,to(𝒙t)=𝒙tαtσt(ϵ𝜽(𝒙t,t,cn)+s(ϵ𝜽(𝒙t,t,co)ϵ𝜽(𝒙t,t,cn)))αt,subscriptsuperscript𝒇𝑜𝜽𝑡subscript𝒙𝑡subscript𝒙𝑡subscript𝛼𝑡subscript𝜎𝑡subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡subscript𝑐𝑛𝑠subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡subscript𝑐𝑜subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡subscript𝑐𝑛subscript𝛼𝑡\bm{f}^{o}_{\bm{\theta},t}(\bm{x}_{t})=\dfrac{\bm{x}_{t}-\alpha_{t}\sigma_{t}(% \bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t,c_{n})+s(\bm{\epsilon}_{\bm{\theta}}(% \bm{x}_{t},t,c_{o})-\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t,c_{n})))}{\alpha_% {t}},bold_italic_f start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_s ( bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) - bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) ) end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , (11)

and

𝒇𝜽,te(𝒙t)=𝒇𝜽,to(𝒙t)+m(ϵ𝜽(𝒙t,t,ce)ϵ𝜽(𝒙t,t,cn)))αt,\bm{f}^{e}_{\bm{\theta},t}(\bm{x}_{t})=\bm{f}^{o}_{\bm{\theta},t}(\bm{x}_{t})+% \dfrac{m(\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t,c_{e})-\bm{\epsilon}_{\bm{% \theta}}(\bm{x}_{t},t,c_{n})))}{\alpha_{t}},bold_italic_f start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = bold_italic_f start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + divide start_ARG italic_m ( bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) - bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) ) end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , (12)

to be the posterior mean predictors when using classifier-free guidance on the original prompt cosubscript𝑐𝑜c_{o}italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and both the original prompt cosubscript𝑐𝑜c_{o}italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and the edit prompt cesubscript𝑐𝑒c_{e}italic_c start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT accordingly.

Algorithm.

The overall method for DeepFloyd is summarized in Algorithm 4. For T2I diffusion models in the latent space such as Stable Diffusion and Latent Consistency Model, at time t𝑡titalic_t, we additionally decode 𝒛^0subscriptbold-^𝒛0\bm{\hat{z}}_{0}overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT into the image space 𝒙^0subscriptbold-^𝒙0\bm{\hat{x}}_{0}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to enable masking and nullspace projection. The editing is in the space of 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for Stable Diffusion and Latent Consistency Model. The proposed method is not proposed as an approach beating other T2I editing methods, but as a way to both understand semantic correspondences in the low-rank subspaces of T2I diffusion models and utilize subspaces for semantic control in a more interpretable way. We hope to inspire and open up directions in understanding T2I diffusion models and utilize the understanding in versatile applications.

Here, we want to find a specific change direction 𝒗psubscript𝒗𝑝\bm{v}_{p}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT in the 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT space that can provide target edited images in the space of 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by directly moving 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT along 𝒗psubscript𝒗𝑝\bm{v}_{p}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT: the whole generation is not conditioned on cesubscript𝑐𝑒c_{e}italic_c start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT at all, except that we utilize cesubscript𝑐𝑒c_{e}italic_c start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT in finding the editing direction 𝒗psubscript𝒗𝑝\bm{v}_{p}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. This is in contrast to the method proposed in [30], where additional semantic information is injected via indirect x-space guidance conditioned on the edit prompt at time t𝑡titalic_t. We hope to discover an editing direction that is expressive enough by itself to perform semantic editing.

Intuition.

Let 𝒙^0,tosuperscriptsubscriptbold-^𝒙0𝑡𝑜\bm{\hat{x}}_{0,t}^{o}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT be the estimated posterior mean conditioned on the original prompt cosubscript𝑐𝑜c_{o}italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and 𝒙^0,tesuperscriptsubscriptbold-^𝒙0𝑡𝑒\bm{\hat{x}}_{0,t}^{e}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT be the estimated posterior mean conditioned on both the original prompt cosubscript𝑐𝑜c_{o}italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and the edit prompt cesubscript𝑐𝑒c_{e}italic_c start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. Let 𝑱𝜽,tosuperscriptsubscript𝑱𝜽𝑡𝑜\bm{J}_{\bm{\theta},t}^{o}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT and 𝑱𝜽,tesuperscriptsubscript𝑱𝜽𝑡𝑒\bm{J}_{\bm{\theta},t}^{e}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT be their Jacobian over the noisy image 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT accordingly. The key intuition inspired by the unconditional cases are: i) the target editing direction 𝒗𝒗\bm{v}bold_italic_v in the 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT space is homogeneous between the subspaces in 𝑱𝜽,tosuperscriptsubscript𝑱𝜽𝑡𝑜\bm{J}_{\bm{\theta},t}^{o}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT and 𝑱𝜽,tesuperscriptsubscript𝑱𝜽𝑡𝑒\bm{J}_{\bm{\theta},t}^{e}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT; ii) the founded editing direction 𝒗𝒗\bm{v}bold_italic_v can effectively reside in the direction of a right singular vector for both 𝑱𝜽,tosuperscriptsubscript𝑱𝜽𝑡𝑜\bm{J}_{\bm{\theta},t}^{o}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT and 𝑱𝜽,tesuperscriptsubscript𝑱𝜽𝑡𝑒\bm{J}_{\bm{\theta},t}^{e}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT; iii) 𝒙^0,tesuperscriptsubscriptbold-^𝒙0𝑡𝑒\bm{\hat{x}}_{0,t}^{e}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and 𝒙^0,tosuperscriptsubscriptbold-^𝒙0𝑡𝑜\bm{\hat{x}}_{0,t}^{o}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT are locally linear.

Define 𝒙^0,te𝒙^0,to=𝒅superscriptsubscriptbold-^𝒙0𝑡𝑒superscriptsubscriptbold-^𝒙0𝑡𝑜𝒅\bm{\hat{x}}_{0,t}^{e}-\bm{\hat{x}}_{0,t}^{o}=\bm{d}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT - overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = bold_italic_d as the change of estimated posterior mean. Let 𝑱𝜽,te=𝑼te𝑺te𝑽teTsuperscriptsubscript𝑱𝜽𝑡𝑒superscriptsubscript𝑼𝑡𝑒superscriptsubscript𝑺𝑡𝑒superscriptsubscript𝑽𝑡superscript𝑒𝑇\bm{J}_{\bm{\theta},t}^{e}=\bm{U}_{t}^{e}\bm{S}_{t}^{e}\bm{V}_{t}^{{e}^{T}}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = bold_italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT bold_italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, then 𝒗=±𝒗ie𝒗plus-or-minussubscriptsuperscript𝒗𝑒𝑖\bm{v}=\pm\bm{v}^{e}_{i}bold_italic_v = ± bold_italic_v start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for some i𝑖iitalic_i. Besides, we have 𝒙^0,te=𝒙^0,to+λo𝑱𝜽,to𝒗superscriptsubscriptbold-^𝒙0𝑡𝑒superscriptsubscriptbold-^𝒙0𝑡𝑜superscript𝜆𝑜superscriptsubscript𝑱𝜽𝑡𝑜𝒗\bm{\hat{x}}_{0,t}^{e}=\bm{\hat{x}}_{0,t}^{o}+\lambda^{o}\bm{J}_{\bm{\theta},t% }^{o}\bm{v}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT + italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT bold_italic_v and 𝒙^0,to=𝒙^0,te+λe𝑱𝜽,te𝒗superscriptsubscriptbold-^𝒙0𝑡𝑜superscriptsubscriptbold-^𝒙0𝑡𝑒superscript𝜆𝑒superscriptsubscript𝑱𝜽𝑡𝑒𝒗\bm{\hat{x}}_{0,t}^{o}=\bm{\hat{x}}_{0,t}^{e}+\lambda^{e}\bm{J}_{\bm{\theta},t% }^{e}\bm{v}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT + italic_λ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT bold_italic_v due to homogeneity and linearity. Hence, 𝒅=λe𝑱𝜽,te𝒗=±λesie𝒖ie𝒅superscript𝜆𝑒superscriptsubscript𝑱𝜽𝑡𝑒𝒗plus-or-minussuperscript𝜆𝑒superscriptsubscript𝑠𝑖𝑒superscriptsubscript𝒖𝑖𝑒\bm{d}=-\lambda^{e}\bm{J}_{\bm{\theta},t}^{e}\bm{v}=\pm\lambda^{e}s_{i}^{e}\bm% {u}_{i}^{e}bold_italic_d = - italic_λ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT bold_italic_v = ± italic_λ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT and then 𝑱𝜽,teT𝒅=±λesiesie𝒗ie=±λesiesie𝒗superscriptsubscript𝑱𝜽𝑡superscript𝑒𝑇𝒅plus-or-minussuperscript𝜆𝑒superscriptsubscript𝑠𝑖𝑒superscriptsubscript𝑠𝑖𝑒superscriptsubscript𝒗𝑖𝑒plus-or-minussuperscript𝜆𝑒superscriptsubscript𝑠𝑖𝑒superscriptsubscript𝑠𝑖𝑒𝒗\bm{J}_{\bm{\theta},t}^{{e}^{T}}\bm{d}=\pm\lambda^{e}s_{i}^{e}s_{i}^{e}\bm{v}_% {i}^{e}=\pm\lambda^{e}s_{i}^{e}s_{i}^{e}\bm{v}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT bold_italic_d = ± italic_λ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = ± italic_λ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT bold_italic_v, which is along the desired direction 𝒗𝒗\bm{v}bold_italic_v. And this 𝒗𝒗\bm{v}bold_italic_v identified through the subspace in 𝑱𝜽,tesuperscriptsubscript𝑱𝜽𝑡𝑒\bm{J}_{\bm{\theta},t}^{e}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT can be effectively transferred in 𝑱𝜽,tosuperscriptsubscript𝑱𝜽𝑡𝑜\bm{J}_{\bm{\theta},t}^{o}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT for controlling the editing of target semantics. We further apply nullspace projection based on 𝑱𝜽,tosuperscriptsubscript𝑱𝜽𝑡𝑜\bm{J}_{\bm{\theta},t}^{o}bold_italic_J start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT to obtain the final editing direction 𝒗psubscript𝒗𝑝\bm{v}_{p}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

Algorithm 4 Text-supervised T-LOCO Edit for T2I diffusion models
1:Input: Random noise 𝒙Tsubscript𝒙𝑇\bm{x}_{T}bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, the mask ΩΩ\Omegaroman_Ω,, edit timestep t𝑡titalic_t, pretrained diffusion model ϵ𝜽subscriptbold-italic-ϵ𝜽\bm{\epsilon}_{\bm{\theta}}bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT, editing scale λ𝜆\lambdaitalic_λ, noise scheduler αt,σtsubscript𝛼𝑡subscript𝜎𝑡\alpha_{t},\sigma_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, selected semantic index k𝑘kitalic_k, nullspace approximate rank r𝑟ritalic_r, original prompt cosubscript𝑐𝑜c_{o}italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, edit prompt cesubscript𝑐𝑒c_{e}italic_c start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, null prompt cnsubscript𝑐𝑛c_{n}italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, classifier free guidance scale s𝑠sitalic_s.
2:Output: Edited image 𝒙0subscriptsuperscript𝒙0\bm{x}^{\prime}_{0}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT,
3:𝒙tDDIM(𝒙T,1,t,ϵ𝜽(𝒙T,t,cn)+s(ϵ𝜽(𝒙T,t,co)ϵ𝜽(𝒙T,t,cn)))subscript𝒙𝑡DDIMsubscript𝒙𝑇1𝑡subscriptbold-italic-ϵ𝜽subscript𝒙𝑇𝑡subscript𝑐𝑛𝑠subscriptbold-italic-ϵ𝜽subscript𝒙𝑇𝑡subscript𝑐𝑜subscriptbold-italic-ϵ𝜽subscript𝒙𝑇𝑡subscript𝑐𝑛\bm{x}_{t}\leftarrow\text{{DDIM}}(\bm{x}_{T},1,t,\bm{\epsilon}_{\bm{\theta}}(% \bm{x}_{T},t,c_{n})+s(\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{T},t,c_{o})-\bm{% \epsilon}_{\bm{\theta}}(\bm{x}_{T},t,c_{n})))bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← DDIM ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , 1 , italic_t , bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_s ( bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) - bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) )
4:𝒙^0,to𝒇𝜽,to(𝒙t)superscriptsubscriptbold-^𝒙0𝑡𝑜subscriptsuperscript𝒇𝑜𝜽𝑡subscript𝒙𝑡\bm{\hat{x}}_{0,t}^{o}\leftarrow\bm{f}^{o}_{\bm{\theta},t}(\bm{x}_{t})overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ← bold_italic_f start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
5:𝒙^0,te𝒇𝜽,te(𝒙t)superscriptsubscriptbold-^𝒙0𝑡𝑒subscriptsuperscript𝒇𝑒𝜽𝑡subscript𝒙𝑡\bm{\hat{x}}_{0,t}^{e}\leftarrow\bm{f}^{e}_{\bm{\theta},t}(\bm{x}_{t})overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ← bold_italic_f start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
6:𝒅𝒫Ω(𝒙^0,te𝒙^0,to)𝒅subscript𝒫Ωsuperscriptsubscriptbold-^𝒙0𝑡𝑒superscriptsubscriptbold-^𝒙0𝑡𝑜{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{d}}\leftarrow\mathcal{P}_{\Omega}\left(\bm{\hat{x}}_{0,t}^{e}-% \bm{\hat{x}}_{0,t}^{o}\right)bold_italic_d ← caligraphic_P start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT - overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT )
7:𝒙~0,t𝒫Ω(𝒙^0,te)subscriptbold-~𝒙0𝑡subscript𝒫Ωsuperscriptsubscriptbold-^𝒙0𝑡𝑒{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{\tilde{x}}_{0,t}}\leftarrow\mathcal{P}_{\Omega}(\bm{\hat{x}}_{% 0,t}^{e})overbold_~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT ← caligraphic_P start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT )
8:𝒗(𝒅𝒙~0,t)𝒙t𝒗superscript𝒅topsubscriptbold-~𝒙0𝑡subscript𝒙𝑡{\color[rgb]{0,0.88,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}% \pgfsys@color@cmyk@stroke{0.91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}% {0.88}{0.12}\bm{v}\leftarrow\dfrac{\partial(\bm{d}^{\top}\bm{\tilde{x}}_{0,t})% }{\partial\bm{x}_{t}}}bold_italic_v ← divide start_ARG ∂ ( bold_italic_d start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overbold_~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG \triangleright Get text-supervised editing direction within the mask
9:𝒙¯0,t𝒙^0,to𝒫Ω(𝒙^0,to)subscriptbold-¯𝒙0𝑡superscriptsubscriptbold-^𝒙0𝑡𝑜subscript𝒫Ωsuperscriptsubscriptbold-^𝒙0𝑡𝑜{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\bm{\bar{x}}% _{0,t}}\leftarrow\bm{\hat{x}}_{0,t}^{o}-\mathcal{P}_{\Omega}(\bm{\hat{x}}_{0,t% }^{o})overbold_¯ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT ← overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT - caligraphic_P start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT )
10:The top-r𝑟ritalic_r SVD (𝑼¯t,r,𝚺¯t,r,𝑽¯t,r)subscriptbold-¯𝑼𝑡𝑟subscriptbold-¯𝚺𝑡𝑟subscriptbold-¯𝑽𝑡𝑟(\bm{\bar{U}}_{t,r},\bm{\bar{\Sigma}}_{t,r},{\color[rgb]{0,0,1}\definecolor[% named]{pgfstrokecolor}{rgb}{0,0,1}\bm{\bar{V}}_{t,r}})( overbold_¯ start_ARG bold_italic_U end_ARG start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT , overbold_¯ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT , overbold_¯ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT ) of 𝑱¯𝜽,t=𝒙¯0,t𝒙tsubscriptbold-¯𝑱𝜽𝑡subscriptbold-¯𝒙0𝑡subscript𝒙𝑡{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\bm{\bar{J}}% _{\bm{\theta},t}}=\dfrac{\partial{\color[rgb]{0,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,1}\bm{\bar{x}}_{0,t}}}{\partial\bm{x}_{t}}overbold_¯ start_ARG bold_italic_J end_ARG start_POSTSUBSCRIPT bold_italic_θ , italic_t end_POSTSUBSCRIPT = divide start_ARG ∂ overbold_¯ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG \triangleright Efficiently computed via generalized power method
11:𝒗p(𝑰𝑽¯t,r𝑽¯t,r)𝒗subscript𝒗𝑝𝑰subscriptbold-¯𝑽𝑡𝑟superscriptsubscriptbold-¯𝑽𝑡𝑟top𝒗{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\bm{v}_{p}}% \leftarrow(\bm{I}-{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0,1}\bm{\bar{V}}_{t,r}\bm{\bar{V}}_{t,r}^{\top}})\cdot{\color[rgb]{0,0.88,0}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0.88,0}\pgfsys@color@cmyk@stroke{0.% 91}{0}{0.88}{0.12}\pgfsys@color@cmyk@fill{0.91}{0}{0.88}{0.12}\bm{v}}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ← ( bold_italic_I - overbold_¯ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT overbold_¯ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ⋅ bold_italic_v \triangleright nullspace projection for editing within the mask
12:𝒗p𝒗p𝒗p2subscript𝒗𝑝subscript𝒗𝑝subscriptnormsubscript𝒗𝑝2{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\bm{v}_{p}}% \leftarrow\frac{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 1,0,0}\bm{v}_{p}}}{\|{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{% rgb}{1,0,0}\bm{v}_{p}}\|_{2}}bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ← divide start_ARG bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG \triangleright Normalize the editing direction
13:𝒙t𝒙t+λ𝒗psuperscriptsubscript𝒙𝑡subscript𝒙𝑡𝜆subscript𝒗𝑝\bm{x}_{t}^{\prime}\leftarrow\bm{x}_{t}+{\color[rgb]{1,0,0}\definecolor[named]% {pgfstrokecolor}{rgb}{1,0,0}\lambda\bm{v}_{p}}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ bold_italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
14:𝒙𝟎DDIM(𝒙t,t,0,ϵ𝜽(𝒙t,t,cn)+s(ϵ𝜽(𝒙t,t,co)ϵ𝜽(𝒙t,t,cn)))subscriptsuperscript𝒙bold-′0DDIMsuperscriptsubscript𝒙𝑡𝑡0subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡subscript𝑐𝑛𝑠subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡subscript𝑐𝑜subscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡subscript𝑐𝑛\bm{x^{\prime}_{0}}\leftarrow\text{{DDIM}}(\bm{x}_{t}^{\prime},t,0,\bm{% \epsilon}_{\bm{\theta}}(\bm{x}_{t},t,c_{n})+s(\bm{\epsilon}_{\bm{\theta}}(\bm{% x}_{t},t,c_{o})-\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t,c_{n})))bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ← DDIM ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t , 0 , bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_s ( bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) - bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) )

Appendix C Proofs in Section 4

C.1 Proofs of Lemma 1

Proof of Lemma 1.

Under the 1, we could calculate the noised distribution pt(𝒙t)subscript𝑝𝑡subscript𝒙𝑡p_{t}(\bm{x}_{t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) at any timestep t𝑡titalic_t,

pt(𝒙t)subscript𝑝𝑡subscript𝒙𝑡\displaystyle p_{t}(\bm{x}_{t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =1Kk=1Kpt(𝒙t|"𝒙0 belongs to class k")absent1𝐾superscriptsubscript𝑘1𝐾subscript𝑝𝑡conditionalsubscript𝒙𝑡"subscript𝒙0 belongs to class 𝑘"\displaystyle=\frac{1}{K}\sum_{k=1}^{K}p_{t}(\bm{x}_{t}|"\bm{x}_{0}\text{ % belongs to class }k")= divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | " bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT belongs to class italic_k " )
=1Kk=1Kpt(𝒙t|𝒙0=𝑴k𝒂k,"𝒙0 belongs to class k")𝒩(𝒂k;𝟎,𝑰rk)𝑑𝒂k.absent1𝐾superscriptsubscript𝑘1𝐾subscript𝑝𝑡conditionalsubscript𝒙𝑡subscript𝒙0subscript𝑴𝑘subscript𝒂𝑘"subscript𝒙0 belongs to class 𝑘"𝒩subscript𝒂𝑘0subscript𝑰subscript𝑟𝑘differential-dsubscript𝒂𝑘\displaystyle=\frac{1}{K}\sum_{k=1}^{K}\int p_{t}(\bm{x}_{t}|\bm{x}_{0}=\bm{M}% _{k}\bm{a}_{k},"\bm{x}_{0}\text{ belongs to class }k")\mathcal{N}(\bm{a}_{k};% \bm{0},\bm{I}_{r_{k}})d\bm{a}_{k}.= divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∫ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , " bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT belongs to class italic_k " ) caligraphic_N ( bold_italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; bold_0 , bold_italic_I start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_d bold_italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT .

Because ak𝒩(𝟎,𝑰rk)similar-tosubscript𝑎𝑘𝒩0subscript𝑰subscript𝑟𝑘a_{k}\sim\mathcal{N}(\bm{0},\bm{I}_{r_{k}})italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_italic_I start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), pt(𝒙t|𝒙0=𝑴k𝒂k,"𝒙0 belongs to class k")𝒩(αt𝑴k𝒂k,(1αt)𝑰d)similar-tosubscript𝑝𝑡conditionalsubscript𝒙𝑡subscript𝒙0subscript𝑴𝑘subscript𝒂𝑘"subscript𝒙0 belongs to class 𝑘"𝒩subscript𝛼𝑡subscript𝑴𝑘subscript𝒂𝑘1subscript𝛼𝑡subscript𝑰𝑑p_{t}(\bm{x}_{t}|\bm{x}_{0}=\bm{M}_{k}\bm{a}_{k},"\bm{x}_{0}\text{ belongs to % class }k")\sim\mathcal{N}(\sqrt{\alpha_{t}}\bm{M}_{k}\bm{a}_{k},(1-\alpha_{t})% \bm{I}_{d})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , " bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT belongs to class italic_k " ) ∼ caligraphic_N ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). From the relationship between conditional Gaussian distribution and marginal Gaussian distribution, it is easy to show that pt(𝒙t|"𝒙0 belongs to class k")𝒩(𝟎,αt𝑴k𝑴k+(1αt)𝑰d)similar-tosubscript𝑝𝑡conditionalsubscript𝒙𝑡"subscript𝒙0 belongs to class 𝑘"𝒩0subscript𝛼𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top1subscript𝛼𝑡subscript𝑰𝑑p_{t}(\bm{x}_{t}|"\bm{x}_{0}\text{ belongs to class }k")\sim\mathcal{N}(\bm{0}% ,\alpha_{t}\bm{M}_{k}\bm{M}_{k}^{\top}+(1-\alpha_{t})\bm{I}_{d})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | " bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT belongs to class italic_k " ) ∼ caligraphic_N ( bold_0 , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT )

Then, we have

pt(𝒙t)subscript𝑝𝑡subscript𝒙𝑡\displaystyle p_{t}(\bm{x}_{t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =1Kk=1K𝒩(𝟎,αt𝑴k𝑴k+(1αt)𝑰d).absent1𝐾superscriptsubscript𝑘1𝐾𝒩0subscript𝛼𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top1subscript𝛼𝑡subscript𝑰𝑑\displaystyle=\frac{1}{K}\sum_{k=1}^{K}\mathcal{N}(\bm{0},\alpha_{t}\bm{M}_{k}% \bm{M}_{k}^{\top}+(1-\alpha_{t})\bm{I}_{d}).= divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_N ( bold_0 , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) .

Next, we compute the score function as follows:

𝒙tlogpt(𝒙t)subscriptsubscript𝒙𝑡logsubscript𝑝𝑡subscript𝒙𝑡\displaystyle\nabla_{\bm{x}_{t}}\text{log}p_{t}(\bm{x}_{t})∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =𝒙tpt(𝒙t)pt(𝒙t)absentsubscriptsubscript𝒙𝑡subscript𝑝𝑡subscript𝒙𝑡subscript𝑝𝑡subscript𝒙𝑡\displaystyle=\frac{\nabla_{\bm{x}_{t}}p_{t}(\bm{x}_{t})}{p_{t}(\bm{x}_{t})}= divide start_ARG ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG
=k=1K𝒩(𝟎,αt𝑴k𝑴k+(1αt)𝑰d)(11αt𝒙t+αt1αt𝑴k𝑴k𝒙t)k=1K𝒩(𝟎,αt𝑴k𝑴k+(1αt)𝑰d)absentsuperscriptsubscript𝑘1𝐾𝒩0subscript𝛼𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top1subscript𝛼𝑡subscript𝑰𝑑11subscript𝛼𝑡subscript𝒙𝑡subscript𝛼𝑡1subscript𝛼𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡superscriptsubscript𝑘1𝐾𝒩0subscript𝛼𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top1subscript𝛼𝑡subscript𝑰𝑑\displaystyle=\frac{\sum_{k=1}^{K}\mathcal{N}(\bm{0},\alpha_{t}\bm{M}_{k}\bm{M% }_{k}^{\top}+(1-\alpha_{t})\bm{I}_{d})\left(-\dfrac{1}{1-\alpha_{t}}\bm{x}_{t}% +\dfrac{\alpha_{t}}{1-\alpha_{t}}\bm{M}_{k}\bm{M}_{k}^{\top}\bm{x}_{t}\right)}% {\sum_{k=1}^{K}\mathcal{N}(\bm{0},\alpha_{t}\bm{M}_{k}\bm{M}_{k}^{\top}+(1-% \alpha_{t})\bm{I}_{d})}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_N ( bold_0 , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ( - divide start_ARG 1 end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_N ( bold_0 , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG
=11αt𝒙t+αt1αtk=1K𝒩(𝟎,αt𝑴k𝑴k+(1αt)𝑰d)𝑴k𝑴k𝒙tk=1K𝒩(𝟎,αt𝑴k𝑴k+(1αt)𝑰d).absent11subscript𝛼𝑡subscript𝒙𝑡subscript𝛼𝑡1subscript𝛼𝑡superscriptsubscript𝑘1𝐾𝒩0subscript𝛼𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top1subscript𝛼𝑡subscript𝑰𝑑subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡superscriptsubscript𝑘1𝐾𝒩0subscript𝛼𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top1subscript𝛼𝑡subscript𝑰𝑑\displaystyle=-\dfrac{1}{1-\alpha_{t}}\bm{x}_{t}+\dfrac{\alpha_{t}}{1-\alpha_{% t}}\frac{\sum_{k=1}^{K}\mathcal{N}(\bm{0},\alpha_{t}\bm{M}_{k}\bm{M}_{k}^{\top% }+(1-\alpha_{t})\bm{I}_{d})\bm{M}_{k}\bm{M}_{k}^{\top}\bm{x}_{t}}{\sum_{k=1}^{% K}\mathcal{N}(\bm{0},\alpha_{t}\bm{M}_{k}\bm{M}_{k}^{\top}+(1-\alpha_{t})\bm{I% }_{d})}.= - divide start_ARG 1 end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_N ( bold_0 , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_N ( bold_0 , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG .

Based on Tweedie’s formula [45, 79], the relationship between the score function and posterior is

𝔼[𝒙0|𝒙t]=𝒙t+(1αt)𝒙tlogpt(𝒙t)αt.𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡subscript𝒙𝑡1subscript𝛼𝑡subscriptsubscript𝒙𝑡logsubscript𝑝𝑡subscript𝒙𝑡subscript𝛼𝑡\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]=\frac{\bm{x}_{t}+(1-\alpha_{t})\nabla_{\bm{x% }_{t}}\text{log}p_{t}(\bm{x}_{t})}{\sqrt{\alpha_{t}}}.roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = divide start_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG . (13)

Therefore, the posterior mean is

𝔼[𝒙0|𝒙t]𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\displaystyle\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] =αtk=1K𝒩(𝟎,αt𝑴k𝑴k+(1αt)𝑰d)𝑴k𝑴k𝒙tk=1K𝒩(𝟎,αt𝑴k𝑴k+(1αt)𝑰d)absentsubscript𝛼𝑡superscriptsubscript𝑘1𝐾𝒩0subscript𝛼𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top1subscript𝛼𝑡subscript𝑰𝑑subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡superscriptsubscript𝑘1𝐾𝒩0subscript𝛼𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top1subscript𝛼𝑡subscript𝑰𝑑\displaystyle=\sqrt{\alpha_{t}}\frac{\sum_{k=1}^{K}\mathcal{N}(\bm{0},\alpha_{% t}\bm{M}_{k}\bm{M}_{k}^{\top}+(1-\alpha_{t})\bm{I}_{d})\bm{M}_{k}\bm{M}_{k}^{% \top}\bm{x}_{t}}{\sum_{k=1}^{K}\mathcal{N}(\bm{0},\alpha_{t}\bm{M}_{k}\bm{M}_{% k}^{\top}+(1-\alpha_{t})\bm{I}_{d})}= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_N ( bold_0 , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_N ( bold_0 , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG
=αtk=1Kexp(12𝒙t(αt𝑴k𝑴k+(1αt)𝑰d)1𝒙t)𝑴k𝑴k𝒙tk=1Kexp(12𝒙t(αt𝑴k𝑴k+(1αt)𝑰d)1𝒙t)absentsubscript𝛼𝑡superscriptsubscript𝑘1𝐾12superscriptsubscript𝒙𝑡topsuperscriptsubscript𝛼𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top1subscript𝛼𝑡subscript𝑰𝑑1subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡superscriptsubscript𝑘1𝐾12superscriptsubscript𝒙𝑡topsuperscriptsubscript𝛼𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top1subscript𝛼𝑡subscript𝑰𝑑1subscript𝒙𝑡\displaystyle=\sqrt{\alpha_{t}}\frac{\sum_{k=1}^{K}\exp\left(-\dfrac{1}{2}\bm{% x}_{t}^{\top}\left(\alpha_{t}\bm{M}_{k}\bm{M}_{k}^{\top}+(1-\alpha_{t})\bm{I}_% {d}\right)^{-1}\bm{x}_{t}\right)\bm{M}_{k}\bm{M}_{k}^{\top}\bm{x}_{t}}{\sum_{k% =1}^{K}\exp\left(-\dfrac{1}{2}\bm{x}_{t}^{\top}\left(\alpha_{t}\bm{M}_{k}\bm{M% }_{k}^{\top}+(1-\alpha_{t})\bm{I}_{d}\right)^{-1}\bm{x}_{t}\right)}= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG
=αtk=1Kexp(12(1αt)(𝒙t2αt𝑴k𝒙t2))𝑴k𝑴k𝒙tk=1Kexp(12(1αt)(𝒙t2αt𝑴k𝒙t2))absentsubscript𝛼𝑡superscriptsubscript𝑘1𝐾121subscript𝛼𝑡superscriptnormsubscript𝒙𝑡2subscript𝛼𝑡superscriptnormsuperscriptsubscript𝑴𝑘topsubscript𝒙𝑡2subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡superscriptsubscript𝑘1𝐾121subscript𝛼𝑡superscriptnormsubscript𝒙𝑡2subscript𝛼𝑡superscriptnormsuperscriptsubscript𝑴𝑘topsubscript𝒙𝑡2\displaystyle=\sqrt{\alpha_{t}}\frac{\sum_{k=1}^{K}\exp\left(-\dfrac{1}{2(1-% \alpha_{t})}\left(\|\bm{x}_{t}\|^{2}-\alpha_{t}\|\bm{M}_{k}^{\top}\bm{x}_{t}\|% ^{2}\right)\right)\bm{M}_{k}\bm{M}_{k}^{\top}\bm{x}_{t}}{\sum_{k=1}^{K}\exp% \left(-\dfrac{1}{2(1-\alpha_{t})}\left(\|\bm{x}_{t}\|^{2}-\alpha_{t}\|\bm{M}_{% k}^{\top}\bm{x}_{t}\|^{2}\right)\right)}= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) end_ARG
=αtk=1Kexp(αt2(1αt)𝑴k𝒙2)𝑴k𝑴k𝒙tk=1Kexp(αt2(1αt)𝑴k𝒙2),absentsubscript𝛼𝑡superscriptsubscript𝑘1𝐾subscript𝛼𝑡21subscript𝛼𝑡superscriptnormsuperscriptsubscript𝑴𝑘top𝒙2subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡superscriptsubscript𝑘1𝐾subscript𝛼𝑡21subscript𝛼𝑡superscriptnormsuperscriptsubscript𝑴𝑘top𝒙2\displaystyle=\sqrt{\alpha_{t}}\frac{\sum_{k=1}^{K}\exp\left(\dfrac{\alpha_{t}% }{2(1-\alpha_{t})}\|\bm{M}_{k}^{\top}\bm{x}\|^{2}\right)\bm{M}_{k}\bm{M}_{k}^{% \top}\bm{x}_{t}}{\sum_{k=1}^{K}\exp\left(\dfrac{\alpha_{t}}{2(1-\alpha_{t})}\|% \bm{M}_{k}^{\top}\bm{x}\|^{2}\right)},= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ∥ bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ∥ bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG ,

where the third equation is obtained by Woodbury formula [80] (αt𝑴k𝑴k+(1αt)𝑰d)1=11αt(𝑰dαt𝑴k𝑴k)superscriptsubscript𝛼𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top1subscript𝛼𝑡subscript𝑰𝑑111subscript𝛼𝑡subscript𝑰𝑑subscript𝛼𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top(\alpha_{t}\bm{M}_{k}\bm{M}_{k}^{\top}+(1-\alpha_{t})\bm{I}_{d})^{-1}=\dfrac{1% }{1-\alpha_{t}}\left(\bm{I}_{d}-\alpha_{t}\bm{M}_{k}\bm{M}_{k}^{\top}\right)( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ). ∎

C.2 Proofs of Theorem 1

Lemma 2.

The jacobian of the poster mean is

𝒙t𝔼[𝒙0|𝒙t]subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\displaystyle\nabla_{\bm{x}_{t}}\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] =αtk=1Kωk(𝒙t)𝑴k𝑴k𝑨absentsubscript𝛼𝑡subscriptsuperscriptsubscript𝑘1𝐾subscript𝜔𝑘subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top𝑨absent\displaystyle=\sqrt{\alpha_{t}}\underbrace{\sum_{k=1}^{K}\omega_{k}(\bm{x}_{t}% )\bm{M}_{k}\bm{M}_{k}^{\top}}_{\bm{A}\coloneqq}= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT bold_italic_A ≔ end_POSTSUBSCRIPT (14)
+αtαt(1αt)k=1Kωk(𝒙t)𝑴k𝑴k𝒙t𝒙t𝑴k𝑴k𝑩subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑡subscriptsuperscriptsubscript𝑘1𝐾subscript𝜔𝑘subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡subscriptsuperscript𝒙top𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top𝑩absent\displaystyle+\dfrac{\alpha_{t}\sqrt{\alpha_{t}}}{\left(1-\alpha_{t}\right)}% \underbrace{\sum_{k=1}^{K}\omega_{k}(\bm{x}_{t})\bm{M}_{k}\bm{M}_{k}^{\top}\bm% {x}_{t}\bm{x}^{\top}_{t}\bm{M}_{k}\bm{M}_{k}^{\top}}_{\bm{B}\coloneqq}+ divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT bold_italic_B ≔ end_POSTSUBSCRIPT
αtαt(1αt)(k=1Kωk(𝒙t)𝑴k𝑴k)𝒙t𝒙t(k=1Kωk(𝒙t)𝑴k𝑴k)𝑪,subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑡subscriptsuperscriptsubscript𝑘1𝐾subscript𝜔𝑘subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡superscriptsubscript𝒙𝑡topsuperscriptsuperscriptsubscript𝑘1𝐾subscript𝜔𝑘subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘toptop𝑪absent\displaystyle-\dfrac{\alpha_{t}\sqrt{\alpha_{t}}}{\left(1-\alpha_{t}\right)}% \underbrace{\left(\sum_{k=1}^{K}\omega_{k}(\bm{x}_{t})\bm{M}_{k}\bm{M}_{k}^{% \top}\right)\bm{x}_{t}\bm{x}_{t}^{\top}\left(\sum_{k=1}^{K}\omega_{k}(\bm{x}_{% t})\bm{M}_{k}\bm{M}_{k}^{\top}\right)^{\top}}_{\bm{C}\coloneqq},- divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG under⏟ start_ARG ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT bold_italic_C ≔ end_POSTSUBSCRIPT ,

where ωk(𝐱t)exp(αt2(1αt)𝐌k𝐱t2)l=1Kexp(αt2(1αt)𝐌l𝐱2)subscript𝜔𝑘subscript𝐱𝑡subscript𝛼𝑡21subscript𝛼𝑡superscriptnormsuperscriptsubscript𝐌𝑘topsubscript𝐱𝑡2superscriptsubscript𝑙1𝐾subscript𝛼𝑡21subscript𝛼𝑡superscriptnormsuperscriptsubscript𝐌𝑙top𝐱2\omega_{k}(\bm{x}_{t})\coloneqq\dfrac{\exp\left(\dfrac{\alpha_{t}}{2\left(1-% \alpha_{t}\right)}\|\bm{M}_{k}^{\top}\bm{x}_{t}\|^{2}\right)}{\sum_{l=1}^{K}% \exp\left(\dfrac{\alpha_{t}}{2(1-\alpha_{t})}\|\bm{M}_{l}^{\top}\bm{x}\|^{2}% \right)}italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≔ divide start_ARG roman_exp ( divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ∥ bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ∥ bold_italic_M start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG

Proof of Lemma 2.

Let ωk(𝒙t)exp(αt2(1αt)𝑴k𝒙t2)l=1Kexp(αt2(1αt)𝑴l𝒙2)subscript𝜔𝑘subscript𝒙𝑡subscript𝛼𝑡21subscript𝛼𝑡superscriptnormsuperscriptsubscript𝑴𝑘topsubscript𝒙𝑡2superscriptsubscript𝑙1𝐾subscript𝛼𝑡21subscript𝛼𝑡superscriptnormsuperscriptsubscript𝑴𝑙top𝒙2\omega_{k}(\bm{x}_{t})\coloneqq\dfrac{\exp\left(\dfrac{\alpha_{t}}{2\left(1-% \alpha_{t}\right)}\|\bm{M}_{k}^{\top}\bm{x}_{t}\|^{2}\right)}{\sum_{l=1}^{K}% \exp\left(\dfrac{\alpha_{t}}{2(1-\alpha_{t})}\|\bm{M}_{l}^{\top}\bm{x}\|^{2}% \right)}italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≔ divide start_ARG roman_exp ( divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ∥ bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ∥ bold_italic_M start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG, so we have:

𝔼[𝒙0|𝒙t]𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\displaystyle\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] =αtk=1Kωk(𝒙t)𝑴k𝑴k𝒙tabsentsubscript𝛼𝑡superscriptsubscript𝑘1𝐾subscript𝜔𝑘subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡\displaystyle=\sqrt{\alpha_{t}}\sum_{k=1}^{K}\omega_{k}(\bm{x}_{t})\bm{M}_{k}% \bm{M}_{k}^{\top}\bm{x}_{t}= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
𝒙tωk(𝒙t)subscriptsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡\displaystyle\nabla_{\bm{x}_{t}}\omega_{k}(\bm{x}_{t})∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =αt(1αt)ωk(𝒙t)[𝑴k𝑴k𝒙tl=1Kωl(𝒙t)𝑴l𝑴l𝒙t]absentsubscript𝛼𝑡1subscript𝛼𝑡subscript𝜔𝑘subscript𝒙𝑡delimited-[]subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡superscriptsubscript𝑙1𝐾subscript𝜔𝑙subscript𝒙𝑡subscript𝑴𝑙superscriptsubscript𝑴𝑙topsubscript𝒙𝑡\displaystyle=\dfrac{\alpha_{t}}{\left(1-\alpha_{t}\right)}\omega_{k}(\bm{x}_{% t})\left[\bm{M}_{k}\bm{M}_{k}^{\top}\bm{x}_{t}-\sum_{l=1}^{K}\omega_{l}(\bm{x}% _{t})\bm{M}_{l}\bm{M}_{l}^{\top}\bm{x}_{t}\right]= divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) [ bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]

So:

𝒙t𝔼[𝒙0|𝒙t]subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\displaystyle\nabla_{\bm{x}_{t}}\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] =αtk=1Kωk(𝒙t)𝑴k𝑴k+αtk=1K𝒙tωk(𝒙t)𝒙t𝑴k𝑴kabsentsubscript𝛼𝑡superscriptsubscript𝑘1𝐾subscript𝜔𝑘subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝛼𝑡superscriptsubscript𝑘1𝐾subscriptsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡superscriptsubscript𝒙𝑡topsubscript𝑴𝑘superscriptsubscript𝑴𝑘top\displaystyle=\sqrt{\alpha_{t}}\sum_{k=1}^{K}\omega_{k}(\bm{x}_{t})\bm{M}_{k}% \bm{M}_{k}^{\top}+\sqrt{\alpha_{t}}\sum_{k=1}^{K}\nabla_{\bm{x}_{t}}\omega_{k}% (\bm{x}_{t})\bm{x}_{t}^{\top}\bm{M}_{k}\bm{M}_{k}^{\top}= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT
=αtk=1Kωk(𝒙t)𝑴k𝑴kabsentsubscript𝛼𝑡superscriptsubscript𝑘1𝐾subscript𝜔𝑘subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top\displaystyle=\sqrt{\alpha_{t}}\sum_{k=1}^{K}\omega_{k}(\bm{x}_{t})\bm{M}_{k}% \bm{M}_{k}^{\top}= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT
+αtαt(1αt)k=1Kωk(𝒙t)𝑴k𝑴k𝒙t𝒙t𝑴k𝑴ksubscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑡superscriptsubscript𝑘1𝐾subscript𝜔𝑘subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡subscriptsuperscript𝒙top𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top\displaystyle+\dfrac{\alpha_{t}\sqrt{\alpha_{t}}}{\left(1-\alpha_{t}\right)}% \sum_{k=1}^{K}\omega_{k}(\bm{x}_{t})\bm{M}_{k}\bm{M}_{k}^{\top}\bm{x}_{t}\bm{x% }^{\top}_{t}\bm{M}_{k}\bm{M}_{k}^{\top}+ divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT
αtαt(1αt)(k=1Kωk(𝒙t)𝑴k𝑴k)𝒙t𝒙t(k=1Kωk(𝒙t)𝑴k𝑴k).subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑡superscriptsubscript𝑘1𝐾subscript𝜔𝑘subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡superscriptsubscript𝒙𝑡topsuperscriptsuperscriptsubscript𝑘1𝐾subscript𝜔𝑘subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘toptop\displaystyle-\dfrac{\alpha_{t}\sqrt{\alpha_{t}}}{\left(1-\alpha_{t}\right)}% \left(\sum_{k=1}^{K}\omega_{k}(\bm{x}_{t})\bm{M}_{k}\bm{M}_{k}^{\top}\right)% \bm{x}_{t}\bm{x}_{t}^{\top}\left(\sum_{k=1}^{K}\omega_{k}(\bm{x}_{t})\bm{M}_{k% }\bm{M}_{k}^{\top}\right)^{\top}.- divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

Lemma 3.

Assume second-order partial derivatives of pt(𝐱t)subscript𝑝𝑡subscript𝐱𝑡p_{t}(\bm{x}_{t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) exist for any 𝐱tsubscript𝐱𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, then the posterior mean 𝐱t𝔼[𝐱0|𝐱t]subscriptsubscript𝐱𝑡𝔼delimited-[]conditionalsubscript𝐱0subscript𝐱𝑡\nabla_{\bm{x}_{t}}\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] satisfied 𝐱t𝔼[𝐱0|𝐱t]=𝐱t𝔼[𝐱0|𝐱t]subscriptsubscript𝐱𝑡𝔼delimited-[]conditionalsubscript𝐱0subscript𝐱𝑡subscriptsubscript𝐱𝑡superscript𝔼topdelimited-[]conditionalsubscript𝐱0subscript𝐱𝑡\nabla_{\bm{x}_{t}}\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]=\nabla_{\bm{x}% _{t}}\mathbb{E}^{\top}\left[\bm{x}_{0}|\bm{x}_{t}\right]∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ].

Proof of Lemma 3.

By taking the gradient of Equation 13 with respect to 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for both side, because the second-order partial derivatives of pt(𝒙t)subscript𝑝𝑡subscript𝒙𝑡p_{t}(\bm{x}_{t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) exist for any 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we have:

𝒙t𝔼[𝒙0|𝒙t]=𝑰+(1αt)𝒙t2logpt(𝒙t)αt.subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡𝑰1subscript𝛼𝑡subscriptsuperscript2subscript𝒙𝑡logsubscript𝑝𝑡subscript𝒙𝑡subscript𝛼𝑡\nabla_{\bm{x}_{t}}\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]=\frac{\bm{I}+(1-\alpha_{t% })\nabla^{2}_{\bm{x}_{t}}\text{log}p_{t}(\bm{x}_{t})}{\sqrt{\alpha_{t}}}.∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = divide start_ARG bold_italic_I + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG .

The hessian of logpt(𝒙t)logsubscript𝑝𝑡subscript𝒙𝑡\text{log}p_{t}(\bm{x}_{t})log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is symmetric, so we have:

𝒙t𝔼[𝒙0|𝒙t]=𝑰+(1αt)(𝒙t2logpt(𝒙t))αt=𝑰+(1αt)𝒙t2logpt(𝒙t)αt=𝒙t𝔼[𝒙0|𝒙t].subscriptsubscript𝒙𝑡superscript𝔼topdelimited-[]conditionalsubscript𝒙0subscript𝒙𝑡𝑰1subscript𝛼𝑡superscriptsubscriptsuperscript2subscript𝒙𝑡logsubscript𝑝𝑡subscript𝒙𝑡topsubscript𝛼𝑡𝑰1subscript𝛼𝑡subscriptsuperscript2subscript𝒙𝑡logsubscript𝑝𝑡subscript𝒙𝑡subscript𝛼𝑡subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\nabla_{\bm{x}_{t}}\mathbb{E}^{\top}[\bm{x}_{0}|\bm{x}_{t}]=\frac{\bm{I}+(1-% \alpha_{t})\left(\nabla^{2}_{\bm{x}_{t}}\text{log}p_{t}(\bm{x}_{t})\right)^{% \top}}{\sqrt{\alpha_{t}}}=\frac{\bm{I}+(1-\alpha_{t})\nabla^{2}_{\bm{x}_{t}}% \text{log}p_{t}(\bm{x}_{t})}{\sqrt{\alpha_{t}}}=\nabla_{\bm{x}_{t}}\mathbb{E}[% \bm{x}_{0}|\bm{x}_{t}].∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = divide start_ARG bold_italic_I + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG = divide start_ARG bold_italic_I + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG = ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] .

Notably, the symmetric of 𝒙t𝔼[𝒙0|𝒙t]subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\nabla_{\bm{x}_{t}}\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] holds without the 1. ∎

Proof of Theorem 1.

First, let’s prove the low-rankness of the posterior mean. From Lemma 2,

𝒙t𝔼[𝒙0|𝒙t]subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\displaystyle\nabla_{\bm{x}_{t}}\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] =αt𝑨+αtαt(1αt)𝑩αtαt(1αt)𝑪absentsubscript𝛼𝑡𝑨subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑡𝑩subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑡𝑪\displaystyle=\sqrt{\alpha_{t}}\bm{A}+\dfrac{\alpha_{t}\sqrt{\alpha_{t}}}{% \left(1-\alpha_{t}\right)}\bm{B}-\dfrac{\alpha_{t}\sqrt{\alpha_{t}}}{\left(1-% \alpha_{t}\right)}\bm{C}= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_A + divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG bold_italic_B - divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG bold_italic_C
=k=1K𝑴k𝑴k(αt𝑨+αtαt(1αt)𝑩αtαt(1αt)𝑪),absentsuperscriptsubscript𝑘1𝐾subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝛼𝑡𝑨subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑡𝑩subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑡𝑪\displaystyle=\sum_{k=1}^{K}\bm{M}_{k}\bm{M}_{k}^{\top}\left(\sqrt{\alpha_{t}}% \bm{A}+\dfrac{\alpha_{t}\sqrt{\alpha_{t}}}{\left(1-\alpha_{t}\right)}\bm{B}-% \dfrac{\alpha_{t}\sqrt{\alpha_{t}}}{\left(1-\alpha_{t}\right)}\bm{C}\right),= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_A + divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG bold_italic_B - divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG bold_italic_C ) ,

where the second equation is obtained due to the fact that k=1K𝑴k𝑴k𝑨=𝑨,k=1K𝑴k𝑴k𝑩=𝑩,k=1K𝑴k𝑴k𝑪=𝑪formulae-sequencesuperscriptsubscript𝑘1𝐾subscript𝑴𝑘superscriptsubscript𝑴𝑘top𝑨𝑨formulae-sequencesuperscriptsubscript𝑘1𝐾subscript𝑴𝑘superscriptsubscript𝑴𝑘top𝑩𝑩superscriptsubscript𝑘1𝐾subscript𝑴𝑘superscriptsubscript𝑴𝑘top𝑪𝑪\sum_{k=1}^{K}\bm{M}_{k}\bm{M}_{k}^{\top}\bm{A}=\bm{A},\sum_{k=1}^{K}\bm{M}_{k% }\bm{M}_{k}^{\top}\bm{B}=\bm{B},\sum_{k=1}^{K}\bm{M}_{k}\bm{M}_{k}^{\top}\bm{C% }=\bm{C}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_A = bold_italic_A , ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_B = bold_italic_B , ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_C = bold_italic_C. Therefore, we have:

rank(𝒙t𝔼[𝒙0|𝒙t])𝑟𝑎𝑛𝑘subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\displaystyle rank\left(\nabla_{\bm{x}_{t}}\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{% t}\right]\right)italic_r italic_a italic_n italic_k ( ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) =rank(k=1K𝑴k𝑴k(αt𝑨+αtαt(1αt)𝑩αtαt(1αt)𝑪))absent𝑟𝑎𝑛𝑘superscriptsubscript𝑘1𝐾subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝛼𝑡𝑨subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑡𝑩subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑡𝑪\displaystyle=rank\left(\sum_{k=1}^{K}\bm{M}_{k}\bm{M}_{k}^{\top}\left(\sqrt{% \alpha_{t}}\bm{A}+\dfrac{\alpha_{t}\sqrt{\alpha_{t}}}{\left(1-\alpha_{t}\right% )}\bm{B}-\dfrac{\alpha_{t}\sqrt{\alpha_{t}}}{\left(1-\alpha_{t}\right)}\bm{C}% \right)\right)= italic_r italic_a italic_n italic_k ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_A + divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG bold_italic_B - divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG bold_italic_C ) ) (15)
rank(k=1K𝑴k𝑴k)=k=1Krkabsent𝑟𝑎𝑛𝑘superscriptsubscript𝑘1𝐾subscript𝑴𝑘superscriptsubscript𝑴𝑘topsuperscriptsubscript𝑘1𝐾subscript𝑟𝑘\displaystyle\leq rank\left(\sum_{k=1}^{K}\bm{M}_{k}\bm{M}_{k}^{\top}\right)=% \sum_{k=1}^{K}r_{k}≤ italic_r italic_a italic_n italic_k ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT

Then, we prove the linearity:

\tiny1⃝:||𝔼[𝒙0|𝒙t+λΔ𝒙]𝔼[𝒙0|𝒙t]λ𝒙t𝔼[𝒙0|𝒙t]Δ𝒙||2\displaystyle\large{\raisebox{0.5pt}{\tiny1⃝}}:||\mathbb{E}\left[\bm{x}_{0}|% \bm{x}_{t}+\lambda\Delta\bm{x}\right]-\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}% \right]-\lambda\nabla_{\bm{x}_{t}}\mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]\Delta\bm{x% }||_{2}\tiny1⃝ : | | roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_Δ bold_italic_x ] - roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] - italic_λ ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] roman_Δ bold_italic_x | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
=\displaystyle== αtk=1K(ωk(𝒙t+λΔ𝒙)ωk(𝒙t))𝑴k𝑴k(𝒙t+λΔ𝒙)λk=1K𝒙tωk(𝒙t)𝒙t𝑴k𝑴kΔ𝒙2subscriptnormsubscript𝛼𝑡superscriptsubscript𝑘1𝐾subscript𝜔𝑘subscript𝒙𝑡𝜆Δ𝒙subscript𝜔𝑘subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡𝜆Δ𝒙𝜆superscriptsubscript𝑘1𝐾subscriptsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡subscriptsuperscript𝒙top𝑡subscript𝑴𝑘subscriptsuperscript𝑴top𝑘Δ𝒙2\displaystyle||\sqrt{\alpha_{t}}\sum_{k=1}^{K}\left(\omega_{k}(\bm{x}_{t}+% \lambda\Delta\bm{x})-\omega_{k}(\bm{x}_{t})\right)\bm{M}_{k}\bm{M}_{k}^{\top}% \left(\bm{x}_{t}+\lambda\Delta\bm{x}\right)-\lambda\sum_{k=1}^{K}\nabla_{\bm{x% }_{t}}\omega_{k}(\bm{x}_{t})\bm{x}^{\top}_{t}\bm{M}_{k}\bm{M}^{\top}_{k}\Delta% \bm{x}||_{2}| | square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_Δ bold_italic_x ) - italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_Δ bold_italic_x ) - italic_λ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_Δ bold_italic_x | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
=\displaystyle== αtk=1K(λ𝒙tωk(𝒙t+λ1Δ𝒙)Δ𝒙)𝑴k𝑴k(𝒙t+λΔ𝒙)λk=1K𝒙tωk(𝒙t)𝒙t𝑴k𝑴kΔ𝒙2subscriptnormsubscript𝛼𝑡superscriptsubscript𝑘1𝐾𝜆subscriptsuperscripttopsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡subscript𝜆1Δ𝒙Δ𝒙subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡𝜆Δ𝒙𝜆superscriptsubscript𝑘1𝐾subscriptsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡subscriptsuperscript𝒙top𝑡subscript𝑴𝑘subscriptsuperscript𝑴top𝑘Δ𝒙2\displaystyle||\sqrt{\alpha_{t}}\sum_{k=1}^{K}\left(\lambda\nabla^{\top}_{\bm{% x}_{t}}\omega_{k}(\bm{x}_{t}+\lambda_{1}\Delta\bm{x})\Delta\bm{x}\right)\bm{M}% _{k}\bm{M}_{k}^{\top}\left(\bm{x}_{t}+\lambda\Delta\bm{x}\right)-\lambda\sum_{% k=1}^{K}\nabla_{\bm{x}_{t}}\omega_{k}(\bm{x}_{t})\bm{x}^{\top}_{t}\bm{M}_{k}% \bm{M}^{\top}_{k}\Delta\bm{x}||_{2}| | square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_λ ∇ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ bold_italic_x ) roman_Δ bold_italic_x ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_Δ bold_italic_x ) - italic_λ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_Δ bold_italic_x | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
\displaystyle\leq λ(k=1Kαt𝒙tωk(𝒙t+λ1Δ𝒙)Δ𝒙𝑴k(𝒙t+λΔ𝒙)2+𝒙t𝑴k𝑴kΔ𝒙𝒙tωk(𝒙t)2)𝜆superscriptsubscript𝑘1𝐾subscript𝛼𝑡subscriptsuperscripttopsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡subscript𝜆1Δ𝒙Δ𝒙subscriptnormsuperscriptsubscript𝑴𝑘topsubscript𝒙𝑡𝜆Δ𝒙2subscriptsuperscript𝒙top𝑡subscript𝑴𝑘subscriptsuperscript𝑴top𝑘Δ𝒙subscriptnormsubscriptsuperscripttopsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡2\displaystyle\lambda\left(\sum_{k=1}^{K}\sqrt{\alpha_{t}}\nabla^{\top}_{\bm{x}% _{t}}\omega_{k}(\bm{x}_{t}+\lambda_{1}\Delta\bm{x})\Delta\bm{x}||\bm{M}_{k}^{% \top}\left(\bm{x}_{t}+\lambda\Delta\bm{x}\right)||_{2}+\bm{x}^{\top}_{t}\bm{M}% _{k}\bm{M}^{\top}_{k}\Delta\bm{x}||\nabla^{\top}_{\bm{x}_{t}}\omega_{k}(\bm{x}% _{t})||_{2}\right)italic_λ ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∇ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ bold_italic_x ) roman_Δ bold_italic_x | | bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_Δ bold_italic_x ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_Δ bold_italic_x | | ∇ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
\displaystyle\leq λk=1K(αt𝒙tωk(𝒙t+λ1Δ𝒙)2𝑴k(𝒙t+λΔ𝒙)2+𝒙tωk(𝒙t)2𝑴k𝒙t2)𝜆superscriptsubscript𝑘1𝐾subscript𝛼𝑡subscriptnormsubscriptsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡subscript𝜆1Δ𝒙2subscriptnormsuperscriptsubscript𝑴𝑘topsubscript𝒙𝑡𝜆Δ𝒙2subscriptnormsubscriptsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡2subscriptnormsuperscriptsubscript𝑴𝑘topsubscript𝒙𝑡2\displaystyle\lambda\sum_{k=1}^{K}\left(\sqrt{\alpha_{t}}||\nabla_{\bm{x}_{t}}% \omega_{k}(\bm{x}_{t}+\lambda_{1}\Delta\bm{x})||_{2}||\bm{M}_{k}^{\top}\left(% \bm{x}_{t}+\lambda\Delta\bm{x}\right)||_{2}+||\nabla_{\bm{x}_{t}}\omega_{k}(% \bm{x}_{t})||_{2}||\bm{M}_{k}^{\top}\bm{x}_{t}||_{2}\right)italic_λ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG | | ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ bold_italic_x ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_Δ bold_italic_x ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + | | ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )

where the first equation plug in the formula of 𝒙t𝔼[𝒙0|𝒙t]=αtk=1Kωk(𝒙t)𝑴k𝑴k+αtk=1K𝒙tωk(𝒙t)𝒙t𝑴k𝑴ksubscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡subscript𝛼𝑡superscriptsubscript𝑘1𝐾subscript𝜔𝑘subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝛼𝑡superscriptsubscript𝑘1𝐾subscriptsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡superscriptsubscript𝒙𝑡topsubscript𝑴𝑘superscriptsubscript𝑴𝑘top\nabla_{\bm{x}_{t}}\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]=\sqrt{\alpha_{% t}}\sum_{k=1}^{K}\omega_{k}(\bm{x}_{t})\bm{M}_{k}\bm{M}_{k}^{\top}+\sqrt{% \alpha_{t}}\sum_{k=1}^{K}\nabla_{\bm{x}_{t}}\omega_{k}(\bm{x}_{t})\bm{x}_{t}^{% \top}\bm{M}_{k}\bm{M}_{k}^{\top}∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and the second equation use the mean value theorem ωk(𝒙t+λΔ𝒙)ωk(𝒙t)=λ𝒙tωk(𝒙t+λ1Δ𝒙)Δ𝒙subscript𝜔𝑘subscript𝒙𝑡𝜆Δ𝒙subscript𝜔𝑘subscript𝒙𝑡𝜆subscriptsuperscripttopsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡subscript𝜆1Δ𝒙Δ𝒙\omega_{k}(\bm{x}_{t}+\lambda\Delta\bm{x})-\omega_{k}(\bm{x}_{t})=\lambda% \nabla^{\top}_{\bm{x}_{t}}\omega_{k}(\bm{x}_{t}+\lambda_{1}\Delta\bm{x})\Delta% \bm{x}italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_Δ bold_italic_x ) - italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_λ ∇ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ bold_italic_x ) roman_Δ bold_italic_x, λ1(0,λ)subscript𝜆10𝜆\lambda_{1}\in(0,\lambda)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ ( 0 , italic_λ ).

\tiny2⃝:𝒙tωk(𝒙t+λ1Δ𝒙)2:\tiny2⃝subscriptnormsubscriptsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡subscript𝜆1Δ𝒙2\displaystyle\large{\raisebox{0.5pt}{\tiny2⃝}}:||\nabla_{\bm{x}_{t}}\omega_{k}% (\bm{x}_{t}+\lambda_{1}\Delta\bm{x})||_{2}\tiny2⃝ : | | ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ bold_italic_x ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
=\displaystyle== αt(1αt)ωk𝑴k𝑴k(𝒙t+λ1Δ𝒙)l=1Kωl𝑴l𝑴l(𝒙t+λ1Δ𝒙)2subscript𝛼𝑡1subscript𝛼𝑡subscript𝜔𝑘subscriptnormsubscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡subscript𝜆1Δ𝒙superscriptsubscript𝑙1𝐾subscript𝜔𝑙subscript𝑴𝑙superscriptsubscript𝑴𝑙topsubscript𝒙𝑡subscript𝜆1Δ𝒙2\displaystyle\dfrac{\alpha_{t}}{\left(1-\alpha_{t}\right)}\omega_{k}||\bm{M}_{% k}\bm{M}_{k}^{\top}(\bm{x}_{t}+\lambda_{1}\Delta\bm{x})-\sum_{l=1}^{K}\omega_{% l}\bm{M}_{l}\bm{M}_{l}^{\top}(\bm{x}_{t}+\lambda_{1}\Delta\bm{x})||_{2}divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | | bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ bold_italic_x ) - ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ bold_italic_x ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
\displaystyle\leq αt(1αt)ωk(𝑴k𝒙t2+l=1Kωl𝑴l𝒙t2+λ1𝑴kΔ𝒙2+λ1l=1Kωl𝑴lΔ𝒙2)subscript𝛼𝑡1subscript𝛼𝑡subscript𝜔𝑘subscriptnormsuperscriptsubscript𝑴𝑘topsubscript𝒙𝑡2superscriptsubscript𝑙1𝐾subscript𝜔𝑙subscriptnormsuperscriptsubscript𝑴𝑙topsubscript𝒙𝑡2subscript𝜆1subscriptnormsuperscriptsubscript𝑴𝑘topΔ𝒙2subscript𝜆1superscriptsubscript𝑙1𝐾subscript𝜔𝑙subscriptnormsuperscriptsubscript𝑴𝑙topΔ𝒙2\displaystyle\dfrac{\alpha_{t}}{\left(1-\alpha_{t}\right)}\omega_{k}\left(||% \bm{M}_{k}^{\top}\bm{x}_{t}||_{2}+\sum_{l=1}^{K}\omega_{l}||\bm{M}_{l}^{\top}% \bm{x}_{t}||_{2}+\lambda_{1}||\bm{M}_{k}^{\top}\Delta\bm{x}||_{2}+\lambda_{1}% \sum_{l=1}^{K}\omega_{l}||\bm{M}_{l}^{\top}\Delta\bm{x}||_{2}\right)divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( | | bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | bold_italic_M start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Δ bold_italic_x | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | bold_italic_M start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Δ bold_italic_x | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
\displaystyle\leq αt(1αt)ωk(𝑴kF𝒙t2+l=1Kωl𝑴lF𝒙t2+λ1𝑴kF+λ1l=1Kωl𝑴lF)subscript𝛼𝑡1subscript𝛼𝑡subscript𝜔𝑘subscriptnormsuperscriptsubscript𝑴𝑘top𝐹subscriptnormsubscript𝒙𝑡2superscriptsubscript𝑙1𝐾subscript𝜔𝑙subscriptnormsuperscriptsubscript𝑴𝑙top𝐹subscriptnormsubscript𝒙𝑡2subscript𝜆1subscriptnormsuperscriptsubscript𝑴𝑘top𝐹subscript𝜆1superscriptsubscript𝑙1𝐾subscript𝜔𝑙subscriptnormsuperscriptsubscript𝑴𝑙top𝐹\displaystyle\dfrac{\alpha_{t}}{\left(1-\alpha_{t}\right)}\omega_{k}\left(||% \bm{M}_{k}^{\top}||_{F}||\bm{x}_{t}||_{2}+\sum_{l=1}^{K}\omega_{l}||\bm{M}_{l}% ^{\top}||_{F}||\bm{x}_{t}||_{2}+\lambda_{1}||\bm{M}_{k}^{\top}||_{F}+\lambda_{% 1}\sum_{l=1}^{K}\omega_{l}||\bm{M}_{l}^{\top}||_{F}\right)divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( | | bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT | | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | bold_italic_M start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT | | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | bold_italic_M start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT )
\displaystyle\leq αt(1αt)ωk(rk+l=1Kωlrl)(2max{𝒙02,ϵ2}+λ1)subscript𝛼𝑡1subscript𝛼𝑡subscript𝜔𝑘subscript𝑟𝑘superscriptsubscript𝑙1𝐾subscript𝜔𝑙subscript𝑟𝑙2subscriptnormsubscript𝒙02subscriptnormbold-italic-ϵ2subscript𝜆1\displaystyle\dfrac{\alpha_{t}}{\left(1-\alpha_{t}\right)}\omega_{k}\left(r_{k% }+\sum_{l=1}^{K}\omega_{l}r_{l}\right)\left(\sqrt{2}\max\{||\bm{x}_{0}||_{2},|% |\bm{\epsilon}||_{2}\}+\lambda_{1}\right)divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ( square-root start_ARG 2 end_ARG roman_max { | | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , | | bold_italic_ϵ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
\displaystyle\leq αt(1αt)ωk(𝒙t+λ1Δ𝒙)2maxkrk(2max{𝒙02,ϵ2}+λ1)C1,subscript𝛼𝑡1subscript𝛼𝑡subscript𝜔𝑘subscript𝒙𝑡subscript𝜆1Δ𝒙subscriptabsent2subscript𝑘subscript𝑟𝑘2subscriptnormsubscript𝒙02subscriptnormbold-italic-ϵ2subscript𝜆1subscript𝐶1absent\displaystyle\dfrac{\alpha_{t}}{\left(1-\alpha_{t}\right)}\omega_{k}(\bm{x}_{t% }+\lambda_{1}\Delta\bm{x})\underbrace{\cdot 2\cdot\max_{k}{r_{k}}\cdot\left(% \sqrt{2}\max\{||\bm{x}_{0}||_{2},||\bm{\epsilon}||_{2}\}+\lambda_{1}\right)}_{% C_{1}\coloneqq},divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ bold_italic_x ) under⏟ start_ARG ⋅ 2 ⋅ roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ ( square-root start_ARG 2 end_ARG roman_max { | | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , | | bold_italic_ϵ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≔ end_POSTSUBSCRIPT ,

where the third inequality use the fact that 𝒙t2=αt𝒙0+1αtϵ2αt𝒙02+1αtϵ22max{𝒙02,ϵ2}subscriptnormsubscript𝒙𝑡2subscriptnormsubscript𝛼𝑡subscript𝒙01subscript𝛼𝑡bold-italic-ϵ2subscriptnormsubscript𝛼𝑡subscript𝒙02subscriptnorm1subscript𝛼𝑡bold-italic-ϵ22subscriptnormsubscript𝒙02subscriptnormbold-italic-ϵ2||\bm{x}_{t}||_{2}=||\sqrt{\alpha_{t}}\bm{x}_{0}+\sqrt{1-\alpha_{t}}\bm{% \epsilon}||_{2}\leq||\sqrt{\alpha_{t}}\bm{x}_{0}||_{2}+||\sqrt{1-\alpha_{t}}% \bm{\epsilon}||_{2}\leq\sqrt{2}\max\{||\bm{x}_{0}||_{2},||\bm{\epsilon}||_{2}\}| | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = | | square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ | | square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + | | square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ square-root start_ARG 2 end_ARG roman_max { | | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , | | bold_italic_ϵ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }, we simplified ωk(𝒙t+λ1Δ𝒙)subscript𝜔𝑘subscript𝒙𝑡subscript𝜆1Δ𝒙\omega_{k}(\bm{x}_{t}+\lambda_{1}\Delta\bm{x})italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ bold_italic_x ) as ωksubscript𝜔𝑘\omega_{k}italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in this prove, and C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT defined in the last inequality is independent of t𝑡titalic_t. Similarly, we could prove that:

\tiny3⃝:𝑴k𝑴k(𝒙t+λΔ𝒙)2maxkrk(2max{𝒙02,ϵ2}+λ)C2,:\tiny3⃝subscriptnormsubscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡𝜆Δ𝒙2subscriptsubscript𝑘subscript𝑟𝑘2subscriptnormsubscript𝒙02subscriptnormbold-italic-ϵ2𝜆subscript𝐶2absent\displaystyle\large{\raisebox{0.5pt}{\tiny3⃝}}:||\bm{M}_{k}\bm{M}_{k}^{\top}% \left(\bm{x}_{t}+\lambda\Delta\bm{x}\right)||_{2}\leq\underbrace{\max_{k}{r_{k% }}\cdot\left(\sqrt{2}\max\{||\bm{x}_{0}||_{2},||\bm{\epsilon}||_{2}\}+\lambda% \right)}_{C_{2}\coloneqq},\tiny3⃝ : | | bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_Δ bold_italic_x ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ under⏟ start_ARG roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ ( square-root start_ARG 2 end_ARG roman_max { | | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , | | bold_italic_ϵ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } + italic_λ ) end_ARG start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≔ end_POSTSUBSCRIPT ,
\tiny4⃝:𝒙tωk(𝒙t)2αt(1αt)ωk(𝒙t)22maxkrkmax{𝒙02,ϵ2}C3,:\tiny4⃝subscriptnormsubscriptsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡2subscript𝛼𝑡1subscript𝛼𝑡subscript𝜔𝑘subscript𝒙𝑡subscript22subscript𝑘subscript𝑟𝑘subscriptnormsubscript𝒙02subscriptnormbold-italic-ϵ2subscript𝐶3absent\displaystyle\large{\raisebox{0.5pt}{\tiny4⃝}}:||\nabla_{\bm{x}_{t}}\omega_{k}% (\bm{x}_{t})||_{2}\leq\dfrac{\alpha_{t}}{\left(1-\alpha_{t}\right)}\omega_{k}(% \bm{x}_{t})\underbrace{2\sqrt{2}\cdot\max_{k}{r_{k}}\cdot\max\{||\bm{x}_{0}||_% {2},||\bm{\epsilon}||_{2}\}}_{C_{3}\coloneqq},\tiny4⃝ : | | ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) under⏟ start_ARG 2 square-root start_ARG 2 end_ARG ⋅ roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ roman_max { | | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , | | bold_italic_ϵ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } end_ARG start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ≔ end_POSTSUBSCRIPT ,
\tiny5⃝:𝑴k𝑴k𝒙t22maxkrkmax{𝒙02,ϵ2}C4.:\tiny5⃝subscriptnormsubscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscript𝒙𝑡2subscript2subscript𝑘subscript𝑟𝑘subscriptnormsubscript𝒙02subscriptnormbold-italic-ϵ2subscript𝐶4absent\displaystyle\large{\raisebox{0.5pt}{\tiny5⃝}}:||\bm{M}_{k}\bm{M}_{k}^{\top}% \bm{x}_{t}||_{2}\leq\underbrace{\sqrt{2}\max_{k}{r_{k}}\cdot\max\{||\bm{x}_{0}% ||_{2},||\bm{\epsilon}||_{2}\}}_{C_{4}\coloneqq}.\tiny5⃝ : | | bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ under⏟ start_ARG square-root start_ARG 2 end_ARG roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ roman_max { | | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , | | bold_italic_ϵ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } end_ARG start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ≔ end_POSTSUBSCRIPT .

Here, C1=𝒪(λ),C2=𝒪(λ),C3=𝒪(λ),C4=𝒪(λ)formulae-sequencesubscript𝐶1𝒪𝜆formulae-sequencesubscript𝐶2𝒪𝜆formulae-sequencesubscript𝐶3𝒪𝜆subscript𝐶4𝒪𝜆C_{1}=\mathcal{O}(\lambda),C_{2}=\mathcal{O}(\lambda),C_{3}=\mathcal{O}(% \lambda),C_{4}=\mathcal{O}(\lambda)italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = caligraphic_O ( italic_λ ) , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = caligraphic_O ( italic_λ ) , italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = caligraphic_O ( italic_λ ) , italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = caligraphic_O ( italic_λ ). After plugin \tiny2⃝,\tiny3⃝,\tiny4⃝,\tiny5⃝\tiny2⃝\tiny3⃝\tiny4⃝\tiny5⃝\large{\raisebox{0.5pt}{\tiny2⃝}},\large{\raisebox{0.5pt}{\tiny3⃝}},\large{% \raisebox{0.5pt}{\tiny4⃝}},\large{\raisebox{0.5pt}{\tiny5⃝}}\tiny2⃝ , \tiny3⃝ , \tiny4⃝ , \tiny5⃝ to \tiny1⃝, we could obtain:

||𝔼[𝒙0|𝒙t+λΔ𝒙]𝔼[𝒙0|𝒙t]λ𝒙t𝔼[𝒙0|𝒙t]Δ𝒙||2\displaystyle||\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}+\lambda\Delta\bm{x}\right% ]-\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]-\lambda\nabla_{\bm{x}_{t}}% \mathbb{E}[\bm{x}_{0}|\bm{x}_{t}]\Delta\bm{x}||_{2}| | roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ roman_Δ bold_italic_x ] - roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] - italic_λ ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] roman_Δ bold_italic_x | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
\displaystyle\leq λαtk=1Kαt(1αt)ωk(𝒙t+λ1Δ𝒙)C1C2+λk=1Kαt(1αt)ωk(𝒙t)C3C4𝜆subscript𝛼𝑡superscriptsubscript𝑘1𝐾subscript𝛼𝑡1subscript𝛼𝑡subscript𝜔𝑘subscript𝒙𝑡subscript𝜆1Δ𝒙subscript𝐶1subscript𝐶2𝜆superscriptsubscript𝑘1𝐾subscript𝛼𝑡1subscript𝛼𝑡subscript𝜔𝑘subscript𝒙𝑡subscript𝐶3subscript𝐶4\displaystyle\lambda\sqrt{\alpha_{t}}\sum_{k=1}^{K}\dfrac{\alpha_{t}}{\left(1-% \alpha_{t}\right)}\omega_{k}(\bm{x}_{t}+\lambda_{1}\Delta\bm{x})C_{1}C_{2}+% \lambda\sum_{k=1}^{K}\dfrac{\alpha_{t}}{\left(1-\alpha_{t}\right)}\omega_{k}(% \bm{x}_{t})C_{3}C_{4}italic_λ square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ bold_italic_x ) italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT
=\displaystyle== λαt(1αt)𝒪(λ)𝜆subscript𝛼𝑡1subscript𝛼𝑡𝒪𝜆\displaystyle\lambda\dfrac{\alpha_{t}}{\left(1-\alpha_{t}\right)}\mathcal{O}(\lambda)italic_λ divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG caligraphic_O ( italic_λ )

Finally, let’s prove the property of the left singular vector of xt𝔼[x0|xt]subscriptsubscript𝑥𝑡𝔼delimited-[]conditionalsubscript𝑥0subscript𝑥𝑡\nabla_{\bm{x}_{t}}\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]:

From Lemma 3, the eigenvalue decomposition of 𝒙t𝔼[𝒙0|𝒙t]subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\nabla_{\bm{x}_{t}}\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] could be written as 𝒙t𝔼[𝒙0|𝒙t]=𝑼tΛt𝑼tsubscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡subscript𝑼𝑡subscriptΛ𝑡subscriptsuperscript𝑼top𝑡\nabla_{\bm{x}_{t}}\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]=\bm{U}_{t}% \Lambda_{t}\bm{U}^{\top}_{t}∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = bold_italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where Λt=diag(λt,1,,λt,r,,0)subscriptΛ𝑡diagsubscript𝜆𝑡1subscript𝜆𝑡𝑟0\Lambda_{t}=\mathrm{diag}(\lambda_{t,1},\dots,\lambda_{t,r},\dots,0)roman_Λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_diag ( italic_λ start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT , … , 0 ), and the relation between eigenvalue decomposition and singular value decomposition of 𝒙t𝔼[𝒙0|𝒙t]subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\nabla_{\bm{x}_{t}}\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] could be summarized as for all i[r]𝑖delimited-[]𝑟i\in[r]italic_i ∈ [ italic_r ]:

σt,i=|λt,i|,𝒗i=sign(λt,i)𝒖i,formulae-sequencesubscript𝜎𝑡𝑖subscript𝜆𝑡𝑖subscript𝒗𝑖signsubscript𝜆𝑡𝑖subscript𝒖𝑖\displaystyle\sigma_{t,i}=|\lambda_{t,i}|,\;\;\bm{v}_{i}=\text{sign}\left(% \lambda_{t,i}\right)\bm{u}_{i},italic_σ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = | italic_λ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT | , bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = sign ( italic_λ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

where sign()sign\text{sign}\left(\cdot\right)sign ( ⋅ ) is the sign function. Therefore, we have:

𝑼t,1𝑼t,1=𝑽t,1𝑽t,1,subscript𝑼𝑡1subscriptsuperscript𝑼top𝑡1subscript𝑽𝑡1subscriptsuperscript𝑽top𝑡1\bm{U}_{t,1}\bm{U}^{\top}_{t,1}=\bm{V}_{t,1}\bm{V}^{\top}_{t,1},bold_italic_U start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT bold_italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT = bold_italic_V start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT bold_italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , (16)

given 𝑽t,1:=[𝒗t,1,𝒗t,2,,𝒗t,r]assignsubscript𝑽𝑡1subscript𝒗𝑡1subscript𝒗𝑡2subscript𝒗𝑡𝑟\bm{V}_{t,1}:=\left[\bm{v}_{t,1},\bm{v}_{t,2},\ldots,\bm{v}_{t,r}\right]bold_italic_V start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT := [ bold_italic_v start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT , … , bold_italic_v start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT ]. From Lemma 2, we define:

𝒙t𝔼[𝒙0|𝒙t]subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\displaystyle\nabla_{\bm{x}_{t}}\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] =αtk=1Kωk(𝒙t)𝑴k𝑴k+αtk=1K𝒙tωk(𝒙t)𝒙t𝑴k𝑴k𝚫t:=.absentsubscript𝛼𝑡superscriptsubscript𝑘1𝐾subscript𝜔𝑘subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘topsubscriptsubscript𝛼𝑡superscriptsubscript𝑘1𝐾subscriptsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡superscriptsubscript𝒙𝑡topsubscript𝑴𝑘superscriptsubscript𝑴𝑘topassignsubscript𝚫𝑡absent\displaystyle=\sqrt{\alpha_{t}}\sum_{k=1}^{K}\omega_{k}(\bm{x}_{t})\bm{M}_{k}% \bm{M}_{k}^{\top}+\underbrace{\sqrt{\alpha_{t}}\sum_{k=1}^{K}\nabla_{\bm{x}_{t% }}\omega_{k}(\bm{x}_{t})\bm{x}_{t}^{\top}\bm{M}_{k}\bm{M}_{k}^{\top}}_{\bm{% \Delta}_{t}:=}.= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + under⏟ start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT bold_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := end_POSTSUBSCRIPT .

From the full singular value decomposition of 𝒙t𝔼[𝒙0|𝒙t]subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡\nabla_{\bm{x}_{t}}\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] and αtk=1Kωk(𝒙t)𝑴k𝑴ksubscript𝛼𝑡superscriptsubscript𝑘1𝐾subscript𝜔𝑘subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘top\sqrt{\alpha_{t}}\sum_{k=1}^{K}\omega_{k}(\bm{x}_{t})\bm{M}_{k}\bm{M}_{k}^{\top}square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT:

𝒙t𝔼[𝒙0|𝒙t]=[𝑼t,1𝑼t,2][𝚺t,1𝟎𝟎𝚺t,2][𝑽t,1𝑽t,2],subscriptsubscript𝒙𝑡𝔼delimited-[]conditionalsubscript𝒙0subscript𝒙𝑡matrixsubscript𝑼𝑡1subscript𝑼𝑡2matrixsubscript𝚺𝑡100subscript𝚺𝑡2superscriptmatrixsubscript𝑽𝑡1subscript𝑽𝑡2top\displaystyle\nabla_{\bm{x}_{t}}\mathbb{E}\left[\bm{x}_{0}|\bm{x}_{t}\right]=% \begin{bmatrix}\bm{U}_{t,1}&\bm{U}_{t,2}\\ \end{bmatrix}\begin{bmatrix}\bm{\Sigma}_{t,1}&\bm{0}\\ \bm{0}&\bm{\Sigma}_{t,2}\\ \end{bmatrix}\begin{bmatrix}\bm{V}_{t,1}\\ \bm{V}_{t,2}\end{bmatrix}^{\top},∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_𝔼 [ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = [ start_ARG start_ROW start_CELL bold_italic_U start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT end_CELL start_CELL bold_italic_U start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL bold_Σ start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_Σ start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL bold_italic_V start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_V start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ,
αtk=1Kωk(𝒙t)𝑴k𝑴k=[𝑼^t,1𝑼^t,2][𝚺^t,1𝟎𝟎𝚺^t,2][𝑽^t,1𝑽^t,2].subscript𝛼𝑡superscriptsubscript𝑘1𝐾subscript𝜔𝑘subscript𝒙𝑡subscript𝑴𝑘superscriptsubscript𝑴𝑘topmatrixsubscript^𝑼𝑡1subscript^𝑼𝑡2matrixsubscript^𝚺𝑡100subscript^𝚺𝑡2superscriptmatrixsubscript^𝑽𝑡1subscript^𝑽𝑡2top\displaystyle\sqrt{\alpha_{t}}\sum_{k=1}^{K}\omega_{k}(\bm{x}_{t})\bm{M}_{k}% \bm{M}_{k}^{\top}=\begin{bmatrix}\hat{\bm{U}}_{t,1}&\hat{\bm{U}}_{t,2}\\ \end{bmatrix}\begin{bmatrix}\hat{\bm{\Sigma}}_{t,1}&\bm{0}\\ \bm{0}&\hat{\bm{\Sigma}}_{t,2}\\ \end{bmatrix}\begin{bmatrix}\hat{\bm{V}}_{t,1}\\ \hat{\bm{V}}_{t,2}\end{bmatrix}^{\top}.square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL over^ start_ARG bold_italic_U end_ARG start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT end_CELL start_CELL over^ start_ARG bold_italic_U end_ARG start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL over^ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL over^ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

where:

𝚺t,1=[σt,1σt,r],𝚺t,2=[σt,r+1σt,n],formulae-sequencesubscript𝚺𝑡1matrixsubscript𝜎𝑡1missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscript𝜎𝑡𝑟subscript𝚺𝑡2matrixsubscript𝜎𝑡𝑟1missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscript𝜎𝑡𝑛\displaystyle\bm{\Sigma}_{t,1}=\begin{bmatrix}\sigma_{t,1}&&\\ &\ddots&\\ &&\sigma_{t,r}\end{bmatrix},\bm{\Sigma}_{t,2}=\begin{bmatrix}\sigma_{t,r+1}&&% \\ &\ddots&\\ &&\sigma_{t,n}\end{bmatrix},bold_Σ start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_σ start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL italic_σ start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , bold_Σ start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_σ start_POSTSUBSCRIPT italic_t , italic_r + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL italic_σ start_POSTSUBSCRIPT italic_t , italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ,
𝚺^t,1=[σ^t,1σ^t,r],𝚺^t,2=[σ^t,r+1σ^t,n]formulae-sequencesubscript^𝚺𝑡1matrixsubscript^𝜎𝑡1missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscript^𝜎𝑡𝑟subscript^𝚺𝑡2matrixsubscript^𝜎𝑡𝑟1missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscript^𝜎𝑡𝑛\displaystyle\hat{\bm{\Sigma}}_{t,1}=\begin{bmatrix}\hat{\sigma}_{t,1}&&\\ &\ddots&\\ &&\hat{\sigma}_{t,r}\end{bmatrix},\hat{\bm{\Sigma}}_{t,2}=\begin{bmatrix}\hat{% \sigma}_{t,r+1}&&\\ &\ddots&\\ &&\hat{\sigma}_{t,n}\end{bmatrix}over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t , italic_r + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t , italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]
σt,1σt,2σt,rσt,d,σ^t,1σ^t,2σ^t,rσ^t,d,r=k=1Krk.formulae-sequencesubscript𝜎𝑡1subscript𝜎𝑡2subscript𝜎𝑡𝑟subscript𝜎𝑡𝑑subscript^𝜎𝑡1subscript^𝜎𝑡2subscript^𝜎𝑡𝑟subscript^𝜎𝑡𝑑𝑟superscriptsubscript𝑘1𝐾subscript𝑟𝑘\displaystyle\sigma_{t,1}\geq\sigma_{t,2}\geq\ldots\geq\sigma_{t,r}\geq\ldots% \geq\sigma_{t,d},\ \ \ \ \hat{\sigma}_{t,1}\geq\hat{\sigma}_{t,2}\geq\ldots% \geq\hat{\sigma}_{t,r}\geq\ldots\geq\hat{\sigma}_{t,d},\ \ r=\sum_{k=1}^{K}r_{% k}.italic_σ start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ≥ italic_σ start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ≥ … ≥ italic_σ start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT ≥ … ≥ italic_σ start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT , over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ≥ over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT ≥ … ≥ over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT ≥ … ≥ over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT , italic_r = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT .

From Equation 15, we know that σt,r+1==σt,d=0subscript𝜎𝑡𝑟1subscript𝜎𝑡𝑑0\sigma_{t,r+1}=\ldots=\sigma_{t,d}=0italic_σ start_POSTSUBSCRIPT italic_t , italic_r + 1 end_POSTSUBSCRIPT = … = italic_σ start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT = 0. It is easy to show that:

𝑴^Vt,1=[𝑴s1𝑴s2𝑴sK],𝑴bold-^absentsubscript𝑉𝑡1matrixsubscript𝑴subscript𝑠1subscript𝑴subscript𝑠2subscript𝑴subscript𝑠𝐾\displaystyle\bm{M}\coloneqq\bm{\hat{}}{V}_{t,1}=\begin{bmatrix}\bm{M}_{s_{1}}% &\bm{M}_{s_{2}}&\ldots&\bm{M}_{s_{K}}\end{bmatrix},bold_italic_M ≔ overbold_^ start_ARG end_ARG italic_V start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL bold_italic_M start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL bold_italic_M start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL bold_italic_M start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ,

where {s1,s2,,sK}={1,2,,K}subscript𝑠1subscript𝑠2subscript𝑠𝐾12𝐾\{s_{1},s_{2},\ldots,s_{K}\}=\{1,2,\ldots,K\}{ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } = { 1 , 2 , … , italic_K } satisfied ωs1(𝒙t)ωs2(𝒙t)ωsK(𝒙t)subscript𝜔subscript𝑠1subscript𝒙𝑡subscript𝜔subscript𝑠2subscript𝒙𝑡subscript𝜔subscript𝑠𝐾subscript𝒙𝑡\omega_{s_{1}}(\bm{x}_{t})\geq\omega_{s_{2}}(\bm{x}_{t})\geq\ldots\geq\omega_{% s_{K}}(\bm{x}_{t})italic_ω start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ italic_ω start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ … ≥ italic_ω start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). And σ^t,r=αtωsK(𝒙t)=αtminkωk(𝒙t)subscript^𝜎𝑡𝑟subscript𝛼𝑡subscript𝜔subscript𝑠𝐾subscript𝒙𝑡subscript𝛼𝑡subscript𝑘subscript𝜔𝑘subscript𝒙𝑡\hat{\sigma}_{t,r}=\sqrt{\alpha_{t}}\omega_{s_{K}}(\bm{x}_{t})=\sqrt{\alpha_{t% }}\min_{k}\omega_{k}(\bm{x}_{t})over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ω start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG roman_min start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Based on the Davis-Kahan theorem [81], we have:

(𝑰d𝑽t,1𝑽t,1)𝑴Fsubscriptnormsubscript𝑰𝑑subscript𝑽𝑡1superscriptsubscript𝑽𝑡1top𝑴𝐹\displaystyle||\left(\bm{I}_{d}-\bm{V}_{t,1}\bm{V}_{t,1}^{\top}\right)\bm{M}||% _{F}| | ( bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - bold_italic_V start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_italic_M | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT 𝚫tFmin1ir,r+1jd|σ^t,iσt,j|absentsubscriptnormsubscript𝚫𝑡𝐹subscriptformulae-sequence1𝑖𝑟𝑟1𝑗𝑑subscript^𝜎𝑡𝑖subscript𝜎𝑡𝑗\displaystyle\leq\frac{||\bm{\Delta}_{t}||_{F}}{\min_{1\leq i\leq r,r+1\leq j% \leq d}|\hat{\sigma}_{t,i}-\sigma_{t,j}|}≤ divide start_ARG | | bold_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG roman_min start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_r , italic_r + 1 ≤ italic_j ≤ italic_d end_POSTSUBSCRIPT | over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT | end_ARG
=αtk=1K𝒙tωk(𝒙t)𝒙t𝑴k𝑴kFαtminkωk(𝒙t)absentsubscriptnormsubscript𝛼𝑡superscriptsubscript𝑘1𝐾subscriptsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡superscriptsubscript𝒙𝑡topsubscript𝑴𝑘superscriptsubscript𝑴𝑘top𝐹subscript𝛼𝑡subscript𝑘subscript𝜔𝑘subscript𝒙𝑡\displaystyle=\dfrac{||\sqrt{\alpha_{t}}\sum_{k=1}^{K}\nabla_{\bm{x}_{t}}% \omega_{k}(\bm{x}_{t})\bm{x}_{t}^{\top}\bm{M}_{k}\bm{M}_{k}^{\top}||_{F}}{% \sqrt{\alpha_{t}}\min_{k}\omega_{k}(\bm{x}_{t})}= divide start_ARG | | square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG roman_min start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG
k=1K𝒙tωk(𝒙t)F𝒙t𝑴k𝑴kFminkωk(𝒙t)absentsuperscriptsubscript𝑘1𝐾subscriptnormsubscriptsubscript𝒙𝑡subscript𝜔𝑘subscript𝒙𝑡𝐹subscriptnormsuperscriptsubscript𝒙𝑡topsubscript𝑴𝑘superscriptsubscript𝑴𝑘top𝐹subscript𝑘subscript𝜔𝑘subscript𝒙𝑡\displaystyle\leq\dfrac{\sum_{k=1}^{K}||\nabla_{\bm{x}_{t}}\omega_{k}(\bm{x}_{% t})||_{F}||\bm{x}_{t}^{\top}\bm{M}_{k}\bm{M}_{k}^{\top}||_{F}}{\min_{k}\omega_% {k}(\bm{x}_{t})}≤ divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT | | ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT | | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG roman_min start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG
=αt1αtC3C4minkωk(𝒙t)absentsubscript𝛼𝑡1subscript𝛼𝑡subscript𝐶3subscript𝐶4subscript𝑘subscript𝜔𝑘subscript𝒙𝑡\displaystyle=\frac{\alpha_{t}}{1-\alpha_{t}}\frac{C_{3}C_{4}}{\min_{k}\omega_% {k}(\bm{x}_{t})}= divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_ARG start_ARG roman_min start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG

.

Because limt1minkωk(𝒙t)=1Ksubscript𝑡1subscript𝑘subscript𝜔𝑘subscript𝒙𝑡1𝐾\lim_{t\rightarrow 1}\min_{k}\omega_{k}(\bm{x}_{t})=\dfrac{1}{K}roman_lim start_POSTSUBSCRIPT italic_t → 1 end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG, limt1αt1αt=0subscript𝑡1subscript𝛼𝑡1subscript𝛼𝑡0\lim_{t\rightarrow 1}\dfrac{\alpha_{t}}{1-\alpha_{t}}=0roman_lim start_POSTSUBSCRIPT italic_t → 1 end_POSTSUBSCRIPT divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = 0, so:

limt1(𝑰d𝑽t,1𝑽t,1)𝑴F=0.subscript𝑡1subscriptnormsubscript𝑰𝑑subscript𝑽𝑡1superscriptsubscript𝑽𝑡1top𝑴𝐹0\displaystyle\lim_{t\rightarrow 1}||\left(\bm{I}_{d}-\bm{V}_{t,1}\bm{V}_{t,1}^% {\top}\right)\bm{M}||_{F}=0.roman_lim start_POSTSUBSCRIPT italic_t → 1 end_POSTSUBSCRIPT | | ( bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - bold_italic_V start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_italic_M | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = 0 .

And from Equation 16, we have:

limt1(𝑰d𝑼t,1𝑼t,1)𝑴F=0.subscript𝑡1subscriptnormsubscript𝑰𝑑subscript𝑼𝑡1superscriptsubscript𝑼𝑡1top𝑴𝐹0\displaystyle\lim_{t\rightarrow 1}||\left(\bm{I}_{d}-\bm{U}_{t,1}\bm{U}_{t,1}^% {\top}\right)\bm{M}||_{F}=0.roman_lim start_POSTSUBSCRIPT italic_t → 1 end_POSTSUBSCRIPT | | ( bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - bold_italic_U start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT bold_italic_U start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_italic_M | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = 0 .

Appendix D Imaging Editing Experiment Details

D.1 Editing in Unconditional Diffusion Models of Different Datasets

Datasets.

We demonstrate the unconditional editing method in various dataset: FFHQ [63], CelebaA-HQ [52], AFHQ [62], Flowers [61], MetFace [82], and LSUN-church [60].

Models.

Following [30], we use DDPM [1] for CelebaA-HQ and LSUN-church, and DDPM trained with P2 weighting [83] for FFHQ, AFHQ, Flowers, and MetFaces. We download the official pre-trained checkpoints of resolution 256×256256256256\times 256256 × 256, and keep all model parameters frozen. We use the same linear schedule including 100 DDIM inversion steps [3] as [30]. Further, we apply quanlity boosting after t=0.2𝑡0.2t=0.2italic_t = 0.2 as proposed in [84].

Edit Time Steps.

We empirically choose the edit time step t𝑡titalic_t for different datasets in the range [0.5,0.8]0.50.8[0.5,0.8][ 0.5 , 0.8 ]. In practice, we found time steps within the above range give similar editing results. In most of the experiments, the edit time steps chosen are: 0.50.50.50.5 for FFHQ, 0.60.60.60.6 for CelebaA-HQ and LSUN-church, 0.70.70.70.7 for AFHQ, Flowers, and MetFace.

Editing Strength.

In the empirical study of local linearity, we observed that the local linearity is well-preserved even with a strength of 300300300300. In practice, we choose the edit strength λ𝜆\lambdaitalic_λ in the range of [15.0,15.0]15.015.0[-15.0,15.0][ - 15.0 , 15.0 ], where a larger α𝛼\alphaitalic_α leads to stronger semantic editing and a negative α𝛼\alphaitalic_α leads to the change of semantics in the opposite direction.

D.2 Comparing with Alternative Manifolds and Methods

Existing Methods

We compare with four existing methods: NoiseCLR [23], BlendedDiffusion [24], Pullback [30], and Asyrp [29].

Alternative Manifolds.

There are two alternative manifolds where similar training-free approaches can be applied, and each of the alternative involves evaluation of the Jacobians ϵt𝒉tsubscriptbold-italic-ϵ𝑡subscript𝒉𝑡\dfrac{\partial\bm{\epsilon}_{t}}{\partial\bm{h}_{t}}divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG (equivalently 𝒙^0𝒉tsubscriptbold-^𝒙0subscript𝒉𝑡\dfrac{\partial\bm{\hat{x}}_{0}}{\partial\bm{h}_{t}}divide start_ARG ∂ overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG), and ϵt𝒙tsubscriptbold-italic-ϵ𝑡subscript𝒙𝑡\dfrac{\partial\bm{\epsilon}_{t}}{\partial\bm{x}_{t}}divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG accordingly.

  • ϵt𝒉tsubscriptbold-italic-ϵ𝑡subscript𝒉𝑡\dfrac{\partial\bm{\epsilon}_{t}}{\partial\bm{h}_{t}}divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG (or equivalently 𝒙^0,t𝒉tsubscriptbold-^𝒙0𝑡subscript𝒉𝑡\dfrac{\partial\bm{\hat{x}}_{0,t}}{\partial\bm{h}_{t}}divide start_ARG ∂ overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG up to a scale) calculates the Jacobian of the noise residual ϵtsubscriptbold-italic-ϵ𝑡\bm{\epsilon}_{t}bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with respect to the bottleneck feature of 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

  • ϵt𝒙tsubscriptbold-italic-ϵ𝑡subscript𝒙𝑡\dfrac{\partial\bm{\epsilon}_{t}}{\partial\bm{x}_{t}}divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG calculates the Jacobian of the noise residual ϵtsubscriptbold-italic-ϵ𝑡\bm{\epsilon}_{t}bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with respect to the input 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Notably, ϵt𝒉tsubscriptbold-italic-ϵ𝑡subscript𝒉𝑡\dfrac{\partial\bm{\epsilon}_{t}}{\partial\bm{h}_{t}}divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG has hardly notable editing results on images, and hence we present the editing results of ϵt𝒙tsubscriptbold-italic-ϵ𝑡subscript𝒙𝑡\dfrac{\partial\bm{\epsilon}_{t}}{\partial\bm{x}_{t}}divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG. Besides, with masking and nullspace projection, ϵt𝒙tsubscriptbold-italic-ϵ𝑡subscript𝒙𝑡\dfrac{\partial\bm{\epsilon}_{t}}{\partial\bm{x}_{t}}divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG also leads to hardly notable changes on images, thus the final comparison is without masking and nullspace projection.

Evaluation Dataset Setup.

In human evaluation, for each method, we randomly select 15151515 editing direction on 15151515 images. Each direction is transferred to 3333 other images along both the negative and positive directions, in total 90909090 transferability testing cases. Learning time and transfer edit time are averaged over 100 examples. LPIPS [64] and SSIM [65] are calculated over 400400400400 images for each method.

Human Evaluation Metrics.

We measure both Local Edit Success Rate and Transfer Success Rate via human evaluation on CelebA-HQ. i) Local Edit Success Rate: The subject will be given the source image with the edited one, if the subject judges only one major feature among {"eyes", "nose", "hair", "skin", "mouth", "views", "Eyebrows"} are edited, the subject will respond a success, otherwise a failure. ii) Transfer Success Rate: The subject will be given the source image with the edited one, and another image with the edited one via transferring the editing direction from the source image. The subject will respond a success if the two edited images have the same features changed, otherwise a failure. We calculate the average success rate among all subjects for both Local Edit Success Rate and Transfer Success Rate. Lastly, we have ensured no harmful contents are generated and presented to the human subjects.

Learning Time.

Learning time is a measure of the time it takes to compute local basis(training free approaches), to train an implicit function, or to optimize certain variables that help achieve editing for a specific edit method.

D.3 Editing in T2I Diffusion Models

Models.

We generalize our method to three types of T2I diffusion models: DeepFloyd [DeepFloyd], Stable Diffusion [4], and Latent Consistency Model [38]. We download the official checkpoints and keep all model parameters frozen. The same scheduling as that in the unconditional models is applied to DeepFloyd and Stable Diffusion, except that no quality boosting is applied. We follow the original schedule for Latent Consistency Model [38] with the number of inference steps set as 4444.

Edit Time Steps.

We empirically choose the the edit time step t𝑡titalic_t as 0.750.750.750.75 for DeepFloyd and 0.70.70.70.7 for Stable DIffusin. As for Latent Consistency Model, image editing is performed at the second inference step.

Editing Strength.

For unsupervised image editing, we choose λ[5.0,5.0]𝜆5.05.0\lambda\in[-5.0,5.0]italic_λ ∈ [ - 5.0 , 5.0 ] in Stable Diffusion, λ[15.0,15.0]𝜆15.015.0\lambda\in[-15.0,15.0]italic_λ ∈ [ - 15.0 , 15.0 ] in DeepFloyd, and λ[5.0,5.0]𝜆5.05.0\lambda\in[-5.0,5.0]italic_λ ∈ [ - 5.0 , 5.0 ] in Latent Consistency Model. For text-supervised image editing, we choose λ[10.0,10.0]𝜆10.010.0\lambda\in[-10.0,10.0]italic_λ ∈ [ - 10.0 , 10.0 ] in Stable Diffusion, λ[50.0,50.0]𝜆50.050.0\lambda\in[-50.0,50.0]italic_λ ∈ [ - 50.0 , 50.0 ] in DeepFloyd, and λ[10.0,10.0]𝜆10.010.0\lambda\in[-10.0,10.0]italic_λ ∈ [ - 10.0 , 10.0 ] in Latent Consistency Model.