HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.02827v1 [cs.CV] 05 Mar 2024

Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation

Weijie Li,   Litong Gong,   Yiran Zhu,   Fanda Fan,   Biao Wang,   Tiezheng Ge,   Bo Zheng   
Alimama Tech, Alibaba Group
Beijing, China
{weijie.lwj0, gonglitong.glt, yizhu.zyr, fanda.ffd,
eric.wb, tiezheng.gtz, bozheng}@alibaba-inc.com
Abstract

Image-to-video (I2V) generation tasks always suffer from keeping high fidelity in the open domains. Traditional image animation techniques primarily focus on specific domains such as faces or human poses, making them difficult to generalize to open domains. Several recent I2V frameworks based on diffusion models can generate dynamic content for open domain images but fail to maintain fidelity. We found that two main factors of low fidelity are the loss of image details and the noise prediction biases during the denoising process. To this end, we propose an effective method that can be applied to mainstream video diffusion models. This method achieves high fidelity based on supplementing more precise image information and noise rectification. Specifically, given a specified image, our method first adds noise to the input image latent to keep more details, then denoises the noisy latent with proper rectification to alleviate the noise prediction biases. Our method is tuning-free and plug-and-play. The experimental results demonstrate the effectiveness of our approach in improving the fidelity of generated videos. For more image-to-video generated results, please refer to the project website: https://noise-rectification.github.io/.

1 Introduction

With the remarkable breakthroughs of diffusion models in generating exquisite images [7, 37, 36, 40, 39], researchers are exploring the further potential of diffusion models to achieve more coherent video generation. Some recent works [16, 43, 15, 49, 51] have made incremental progress in the text-to-video (T2V) task to generate videos that align with the input text. However, a textual description can correspond to various imaginable videos, which may not necessarily meet people’s specific expectations. Therefore, the reference image is proposed to guide the video generation process, aiming to generate videos that closely align with the given image or even strictly start from the still image, which is called the image-to-video (I2V) task.

The concept of image-to-video is not novel and has long existed in traditional computer vision tasks, such as facial animation [50, 42], body motion synthesis [33], nature-driven animation [28, 17], and video prediction [24, 9, 18], which can all be considered as I2V tasks. However, these tasks were either limited to specific domains (such as faces, human poses, and simple natural scenes) or focused on relatively simple scenarios (such as animating fonts [9], drawing [44] or moving rigid objects [19]). The proposed solutions for these specific tasks are difficult to be applied to open-domain images. Moreover, previous studies [28, 53, 19, 18, 48, 23] adopted the autoregressive approach to generate the video sequences, which is computationally expensive and still faces challenges in complex open-domain scenarios. Recently, the emerged diffusion models have demonstrated strong generative capabilities and significant extensibility by learning the data distribution from noise. As a video can be considered as a temporal sequence (batch) of highly correlated images, it is feasible to process videos in batches using diffusion models. Consequently, there is a growing focus on leveraging diffusion models for image-to-video task, attracting significant attention from both research and industry.

However, current I2V research [11, 4, 51, 59] primarily relies on enhancing the supervision of image signals to guide the video diffusion model. As a result, the generated videos are only able to resemble the given image. In our view, existing video diffusion models in these works already exhibit strong capabilities for generating dynamic motions, but they struggle to maintain fidelity, which can be attributed to two main factors. One is the loss of image details, such as adopting IP-Adapter [54] or ControlNet [58] only extracts partial image representation. Another is the noise prediction biases during the denoise process, due to the unattainable perfect zero loss in training the video diffusion model, even when the entire image information has been injected or concatenated. Inspired by the transition refinement with the pivotal noise vector in recent image editing work [6, 29, 30], we propose to set the direction of initial noise as the pivotal reference in the denoising process. Specifically, we design an image-to-video pipeline, which adopts the “noising and rectified denoising” reverse process to improve the fidelity of generated video. Our method utilizes pre-trained video latent diffusion models (VLDM) to generate fluent motion between frames. During the inference, we first add initial noise to the given image to alleviate the loss of detail information, denoted as “noising” stage. Then we properly rectify the predicted noise using the pivotal reference noise in the reverse denoise timesteps to alleviate the noise prediction biases, denote as “rectified-denoising” stage. Additionally, in order to control the retention degree of the reference image, we further introduce a practical step-adaptive intervention strategy based on noise rectification.

In general, we propose an effective method to utilize the existing pre-trained video diffusion models for image-to-video tasks. The comparison experiments with current public I2V works and several I2V attempts in the active community have demonstrated the effectiveness of our methods in generating videos with higher fidelity. Moreover, our method does not require extra training and is simple to implement, which can be seamlessly integrated with current pre-trained open-domain video diffusion models in a plug-and-play manner, enabling high-fidelity I2V generation in open domains.

2 Related Work

The diffusion models have achieved great success in generative tasks in recent years. Due to the high correlation between image and video modalities, many ideas and insights in current video generation work have been inspired by extensive image generation work. Therefore, we introduce the related work on image and video generation here.

2.1 Image Generation with Diffusion Model

Compared to the traditional GAN [10] and VAE [22] based methods, the diffusion models  [45, 14, 7, 47, 46, 36, 40] have demonstrated more powerful capabilities to produce high-quality images with realistic textures and fine details. The U-Net [38] with the attention layer is the widely adopted structure in image diffusion models to predict noise. To save computation costs, Stable Diffusion (SD) [37] proposed the latent diffusion model (LDM), which utilized VAE [22] to encode the image into a latent space and perform the diffusion process on the latent space. To enhance the controllability and support various control conditions such as depth, reference image, normal map and canny map, ControlNet [58] and T2I-Adapter [31] introduced a flexible adapter based on the SD [37] for controlled generation. Recently, IP-Adapter [54] also proposed an image prompt adapter for T2I models to guide the image generation with the reference image. Besides, SDEdit [29] added noise to the input stroke image and progressively denoised the resulting image to increase the realism of the synthesized image. In our image-to-video work, we adopt an inflated 3D U-Net similar to the T2I task and our noise rectification also takes inspiration from the transition refinement in the image editing works [6, 29, 30].

2.2 Video Generation with Diffusion Model

Thanks to the significant progress of text-to-image generation, video generation has also started to develop from the text-to-video (T2V) task. VDM [16] introduced a pioneering video diffusion model that extends the 2D U-Net to a 3D U-Net structure, jointly training both images and videos in the pixel space. Subsequent methods [60, 13, 49, 1, 12, 57, 3, 52, 26] mostly adopted the latent space to reduce memory requirements and speed up the training and inference. To optimize the running time required for video generation, most works (Make-A-Video [43], ModelScopeT2V [49], Latent-Shift [1], AnimateDiff [12]) were built upon the pre-trained T2I models and incorporated temporal modules, enabling batch generation of all video frames simultaneously. Particularly, AnimateDiff [12] only trains a motion module that can be adapted to various personalized T2I models. Text2Video-Zero [21] proposed a training-free sampling method to enrich motion dynamics and maintain temporal consistency with cross-frame attention. Besides, the cascade framework of video diffusion models is also used to generate high-resolution [15, 3, 52] and longer videos [13, 56].

Similar to image generation, introducing more control conditions in video generation is also crucial. Recently, to make the generated videos more controllable, recent work has introduced various conditions into the video diffusion models, including depth [43, 8], pose [27, 20], guided motion from trajectory [55], stroke painting [5] or frequency analysis  [25]. As to the image condition, existing video generation work mainly draws on experiences from the image generation field, e.g., enhancing the image guidance using ControlNet [58] and IP-Adapter [54]. Besides, Seer [11] concatenated the conditional image latent with the noisy latent and employed causal attention in the temporal module of 3D U-Net for image-to-video tasks. VideoComposer [51] proposed to concatenate the image embedding with the noisy latent along the feature channel, as well as support forwarding the style of the given image into the video latent diffusion model (VLDM). Recently, VideoCrafter [4] extracted the image feature into the VLDM for the image-to-video task. Similarly, I2VGen-XL [59] both added the image latent with the noisy latent in the input layer and built a global encoder to extract the image CLIP feature into the VLDM. However, these image-to-video works either have limited fidelity or require fine-tuning the whole VLDM. In comparison, our noise rectification method is tuning-free and maintains high fidelity.

3 Preliminary

Refer to caption
Figure 1: The general framework of image diffusion model and video diffusion model with inflated 3D U-Net structure.

3.1 Image-to-Video Task Definition

All video generation tasks require generating coherent frames that maintain both visual consistency and logical motion. Specifically, image-to-video (I2V) task is defined as generating a video from a specified reference image. Its goal is to transform the static nature of an image into the dynamic visual representation, adding motion and fluidity to the content. Compared to the text-to-video (T2V) task, I2V prioritizes high fidelity with the conditional image, while dynamic motion in the video can be learned through common prior knowledge or driven by the given conditions like the text description or other data forms. Here we focus on the text-conditioned image-to-video task, and this definition can be formulated as, given a still image I𝐼Iitalic_I and a text description c𝑐citalic_c, the generative system outputs a predicted video V0:L1={I¯0,,I¯L1}superscript𝑉:0𝐿1superscript¯𝐼0superscript¯𝐼𝐿1{V}^{0:L-1}=\left\{\bar{I}^{0},\dots,\bar{I}^{L-1}\right\}italic_V start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT = { over¯ start_ARG italic_I end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , … , over¯ start_ARG italic_I end_ARG start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT }, where L𝐿Litalic_L represents the video length. The objective is to keep appearance consistent with the given initial image I𝐼Iitalic_I, as well as ensure the generated video aligns with the text description c𝑐citalic_c.

3.2 Video Latent Diffusion Models

The diffusion models [45, 14, 47, 46] are a class of generative models inspired by non-equilibrium thermodynamics, which define the Markov chain to perturb data to noise in the diffusion process and then learn to convert noise back to data in the reverse process. Formally, in the diffusion process, given a data distribution z0q(z0)similar-tosubscript𝑧0𝑞subscript𝑧0z_{0}\sim q(z_{0})italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), the forward Markov chain gradually adds Gaussian noise to the data sample in the T𝑇Titalic_T timesteps, thus obtaining a sequence of noisy data {z1,z2,,zT}subscript𝑧1subscript𝑧2subscript𝑧𝑇\left\{z_{1},z_{2},\dots,z_{T}\right\}{ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } conditioned on z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, following the transition formula, which can be denoted as:

q(z1:T|z0)𝑞conditionalsubscript𝑧:1𝑇subscript𝑧0\displaystyle q(z_{1:T}|z_{0})italic_q ( italic_z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =t=1Tq(zt|zt1),absentsuperscriptsubscriptproduct𝑡1𝑇𝑞conditionalsubscript𝑧𝑡subscript𝑧𝑡1\displaystyle=\prod\limits_{t=1}^{T}q(z_{t}|z_{t-1}),= ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , (1)
q(zt|zt1)𝑞conditionalsubscript𝑧𝑡subscript𝑧𝑡1\displaystyle q(z_{t}|z_{t-1})italic_q ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) =𝒩(zt;αtzt1,(1αt)𝐈),absent𝒩subscript𝑧𝑡subscript𝛼𝑡subscript𝑧𝑡11subscript𝛼𝑡𝐈\displaystyle=\mathcal{N}(z_{t};\sqrt{\alpha_{t}}z_{t-1},(1-\alpha_{t})\mathbf% {I}),= caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) , (2)

where {αt(0,1)}t=1Tsubscriptsuperscriptsubscript𝛼𝑡01𝑇𝑡1\left\{\alpha_{t}\in(0,1)\right\}^{T}_{t=1}{ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , 1 ) } start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT is a variance schedule to control the step size. In the reverse process, a model pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is learned to denoise from the noisy prior zT𝒩(𝟎,𝐈)similar-tosubscript𝑧𝑇𝒩0𝐈z_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I ) to gradually generate the desired data iteratively following:

pθ(zt1|zt)=𝒩(zt1;μθ(zt,t),Σθ(zt,t)),subscript𝑝𝜃conditionalsubscript𝑧𝑡1subscript𝑧𝑡𝒩subscript𝑧𝑡1subscript𝜇𝜃subscript𝑧𝑡𝑡subscriptΣ𝜃subscript𝑧𝑡𝑡p_{\theta}(z_{t-1}|z_{t})=\mathcal{N}(z_{t-1};\mu_{\theta}(z_{t},t),\Sigma_{% \theta}(z_{t},t)),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) , (3)

where θ𝜃\thetaitalic_θ is the model parameters, μθ(zt,t)subscript𝜇𝜃subscript𝑧𝑡𝑡\mu_{\theta}(z_{t},t)italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) and Σθ(zt,t)subscriptΣ𝜃subscript𝑧𝑡𝑡\Sigma_{\theta}(z_{t},t)roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) denote the predicted mean and variance by the model.

In the image generative tasks, the denoising model is usually designed as the U-Net network architecture and learned with the objective function

minθ𝔼z0pdata,t,ϵ𝒩(𝟎,𝐈)[ϵϵθ(zt,c,t)22],subscript𝜃subscript𝔼formulae-sequencesimilar-tosubscript𝑧0subscript𝑝𝑑𝑎𝑡𝑎𝑡similar-toitalic-ϵ𝒩0𝐈delimited-[]subscriptsuperscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑐𝑡22\min\limits_{\theta}\mathbb{E}_{z_{0}\sim p_{data},t,\epsilon\sim\mathcal{N}(% \mathbf{0},\mathbf{I})}[{\|\epsilon-\epsilon_{\theta}(z_{t},c,t)\|}^{2}_{2}],roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_d italic_a italic_t italic_a end_POSTSUBSCRIPT , italic_t , italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] , (4)

where ϵitalic-ϵ\epsilonitalic_ϵ and ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT are the actual and predicted noise respectively, c𝑐citalic_c represents various conditions like text, image, or other control signals. Furthermore, to reduce the computational complexity, diffusion models are utilized in a lower-dimensional latent space rather than the pixel space, which is denoted as the latent diffusion model [37].

Refer to caption
Figure 2: Two basic approaches in existing research and community regarding image-to-video generation.
Refer to caption
Figure 3: The framework of our tuning-free image-to-video method. (a) represents the inference pipeline, where the input image is noised into the initial latent and the predicted noise of the inflated 3D U-Net will be rectified during the denoising process. (b) illustrates the detailed generation process of our method. The visualization of intermediate steps shows our method can effectively refine the denoising direction, making intermediate results closer to the given image.

Similar to image diffusion generation, video diffusion generation can be regarded as dealing with a batch of images together (see Fig.1). Recently, video latent diffusion models (VLDM) were also developed upon the text-to-image generation and followed the aforementioned diffusion process, aiming to model the video data from Gaussian noise. Formally, a given video data V0:L1L×3×H×Wsuperscript𝑉:0𝐿1superscript𝐿3𝐻𝑊V^{0:L-1}\in\mathbb{R}^{L\times 3\times H\times W}italic_V start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × 3 × italic_H × italic_W end_POSTSUPERSCRIPT will be converted to the latent representation z00:L1L×C×H×Wsuperscriptsubscript𝑧0:0𝐿1superscript𝐿𝐶superscript𝐻superscript𝑊z_{0}^{0:L-1}\in\mathbb{R}^{L\times C\times H^{{}^{\prime}}\times W^{{}^{% \prime}}}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_C × italic_H start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT through a VAE encoder [22], where C𝐶Citalic_C is the number of feature channels. Besides, due to the temporal consistency and content relevance requirements in the video, the VLDM often involves the additional temporal module [16, 49, 1, 12], thus inflating the denoising model from 2D U-Net to the 3D U-Net. Through the diffusion process zt0:L1=q(z00:L1,t)superscriptsubscript𝑧𝑡:0𝐿1𝑞superscriptsubscript𝑧0:0𝐿1𝑡z_{t}^{0:L-1}=q(z_{0}^{0:L-1},t)italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT = italic_q ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT , italic_t ) and reverse process zt10:L1=pθ(zt0:L1,t)superscriptsubscript𝑧𝑡1:0𝐿1subscript𝑝𝜃superscriptsubscript𝑧𝑡:0𝐿1𝑡z_{t-1}^{0:L-1}=p_{\theta}(z_{t}^{0:L-1},t)italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT , italic_t ), the finally denoised video latent z¯00:L1superscriptsubscript¯𝑧0:0𝐿1\bar{z}_{0}^{0:L-1}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT will be processed via the VAE decoder to generate the video.

Inspired by the mainstream text-to-video framework, to generate a video from a still image, we also model the video motion with temporal attention in the inflated 3D U-Net [12]. As shown in Fig. 1(b), to improve the computation efficiency, the video’s frame dimension is treated as the batch axis in the forward of the spatial modules, and the video’s spatial dimensions are treated as the batch axis in the forward of the temporal modules.

4 Method

4.1 Enhance Image Condition Analysis

Although the text-to-video framework can generate a video clip with relatively coherent motion, the semantic content of the generated video is mainly aligned with the given text description at a coarse-grained level. To control the content consistency between the generated video and the reference image, the mainstream I2V works in existing research and the community can be summarized into two basic types (see Fig.2): One is to incorporate the image condition at the beginning of the reverse process. This approach is mainly inspired by the image generation field like the img2img tasks, such as the image editing task [6, 29, 30], whose idea is to inject the image latent into the initial noise latent. In this way, the reverse denoising process could be implicitly guided towards the direction of the image latent in the latent space. However, this approach can only achieve a resemblance to the given image and there is still a certain gap to high fidelity. A different method involves concatenating the full clean image with the initial noise to introduce more fine-grained details [11, 51, 59]. While this approach improves fidelity, the entire generation framework must be retrained, leading to low scalability and challenges in integrating with existing pre-trained modules like ControlNet [58]. Another method to enhance image fidelity introduces more image feature signals and conditions into the internal computation of the diffusion model at each timestep [4], such as using various ControlNets [58] and IP-Adapter [54]. The image features act as strong supervision to improve the fidelity. However, since feature extraction inevitably loses image details, these approaches tend to learn the overall style or general layout from the original image, making it difficult to achieve high fidelity in terms of fine details.

All the above methods aim to enhance the guidance and control of the initial image in video generation to improve fidelity. However, as shown in Fig.3(b), the denoising process (represented in gray arrow) can not restore the given image even when the initial noisy latents are obtained by adding noise to the given image (represented in dashed blue arrow), we analyzed that the reason why these methods [4, 51, 59, 11] fail to achieve perfect fidelity lies in the accumulated noise biases during the denoising process, causing the generated frame latents to deviate from the given image latent. In the training process, although the MSE loss function is utilized to make the predicted noise close to the initial input noise, the training process cannot completely achieve a perfect loss of 0. Therefore, there will always be a discrepancy between the predicted noise and the true noise. To further improve fidelity, we draw inspiration from the noise latent and aim to alleviate the noise gap during the denoising process.

4.2 Noise Rectification Strategy

Our method includes the “noising and rectified denoising” process. Similar to [29], our approach starts by injecting the image latent into the initial noise. Without introducing any additional operations, such a setting could generate a coherent video that resembles the given image in the whole style and layout. Taking a different perspective, if the denoising process adopts the known initial noise rather than the predicted biased noise at each timestep, it would result in a video sequence that is entirely faithful but also lacks any motion or dynamics. Therefore, to strike a balance between complete fidelity and dynamics, we propose a noise rectification method. The pipeline of our inference process is shown in Fig.3(a), in some intermediate steps of the denoising process, we rectify the predicted noises by adaptively compensating them with the known initial noise, which is formulated as

n~~t0:L1=Rectify(n~t0:L1,n0:L1,t,ω0:L1,τ),superscriptsubscript~~𝑛𝑡:0𝐿1𝑅𝑒𝑐𝑡𝑖𝑓𝑦superscriptsubscript~𝑛𝑡:0𝐿1superscript𝑛:0𝐿1𝑡superscript𝜔:0𝐿1𝜏{\color[rgb]{0.5234375,0.7421875,0.52734375}\widetilde{\color[rgb]{% 0.97265625,0.58984375,0.55859375}\widetilde{\color[rgb]{0,0,0}n}}}_{t}^{0:L-1}% =Rectify({\color[rgb]{0.97265625,0.58984375,0.55859375}\widetilde{\color[rgb]{% 0,0,0}n}}_{t}^{0:L-1},n^{0:L-1},t,\omega^{0:L-1},\tau),over~ start_ARG over~ start_ARG italic_n end_ARG end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT = italic_R italic_e italic_c italic_t italic_i italic_f italic_y ( over~ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT , italic_n start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT , italic_t , italic_ω start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT , italic_τ ) , (5)

where n~~t0:L1superscriptsubscript~~𝑛𝑡:0𝐿1{\color[rgb]{0.5234375,0.7421875,0.52734375}\widetilde{\color[rgb]{% 0.97265625,0.58984375,0.55859375}\widetilde{\color[rgb]{0,0,0}n}}}_{t}^{0:L-1}over~ start_ARG over~ start_ARG italic_n end_ARG end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT denotes the rectified noise at tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT timestep, n~t0:L1superscriptsubscript~𝑛𝑡:0𝐿1{\color[rgb]{0.97265625,0.58984375,0.55859375}\widetilde{\color[rgb]{0,0,0}n}}% _{t}^{0:L-1}over~ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT denotes the predicted noise of 3D-UNet, n0:L1superscript𝑛:0𝐿1n^{0:L-1}italic_n start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT denotes the initial sampled noise that is added to a given image, ω0:L1superscript𝜔:0𝐿1\omega^{0:L-1}italic_ω start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT and τ𝜏\tauitalic_τ denote the rectification weight and timestep period.

Algorithm 1 Noise Rectification for Image-to-Video
0:  The given image latent z0superscript𝑧0z^{0}italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, optional text embedding c𝑐citalic_c, video length L𝐿Litalic_L, rectification weight ω0:L1superscript𝜔:0𝐿1\omega^{0:L-1}italic_ω start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT and timestep period τ𝜏\tauitalic_τ.
0:  The generated video latent z00:L1superscriptsubscript𝑧0:0𝐿1z_{0}^{0:L-1}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT.
1:  n0:L1𝒩(0,𝐈)similar-tosuperscript𝑛:0𝐿1𝒩0𝐈n^{0:L-1}\sim\mathcal{N}(0,\mathbf{I})italic_n start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , bold_I )
2:  zT0:L1AddNoise(Repeat(z0),n0:L1,T)superscriptsubscript𝑧𝑇:0𝐿1𝐴𝑑𝑑𝑁𝑜𝑖𝑠𝑒𝑅𝑒𝑝𝑒𝑎𝑡superscript𝑧0superscript𝑛:0𝐿1𝑇z_{T}^{0:L-1}\leftarrow AddNoise(Repeat(z^{0}),n^{0:L-1},T)italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT ← italic_A italic_d italic_d italic_N italic_o italic_i italic_s italic_e ( italic_R italic_e italic_p italic_e italic_a italic_t ( italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , italic_n start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT , italic_T )
3:  for t=T,,1𝑡𝑇1t=T,…,1italic_t = italic_T , … , 1 do
4:     Predict noise n~t0:L1=ϵθ(zt0:L1,c,t)superscriptsubscript~𝑛𝑡:0𝐿1subscriptitalic-ϵ𝜃superscriptsubscript𝑧𝑡:0𝐿1𝑐𝑡{\color[rgb]{0.97265625,0.58984375,0.55859375}\widetilde{\color[rgb]{0,0,0}n}}% _{t}^{0:L-1}=\epsilon_{\theta}(z_{t}^{0:L-1},c,t)over~ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT , italic_c , italic_t )
5:     Compute noise gap Δt0:L1=n0:L1n~t0:L1subscriptsuperscriptΔ:0𝐿1𝑡superscript𝑛:0𝐿1superscriptsubscript~𝑛𝑡:0𝐿1\Delta^{0:L-1}_{t}={n^{0:L-1}-{\color[rgb]{0.97265625,0.58984375,0.55859375}% \widetilde{\color[rgb]{0,0,0}n}}_{t}^{0:L-1}}roman_Δ start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_n start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT - over~ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT
6:     if t𝑡titalic_t in τ𝜏\tauitalic_τ then
7:        Rectify n~~t0:L1=n~t0:L1+ω0:L1Repeat(Δt0)superscriptsubscript~~𝑛𝑡:0𝐿1superscriptsubscript~𝑛𝑡:0𝐿1superscript𝜔:0𝐿1𝑅𝑒𝑝𝑒𝑎𝑡subscriptsuperscriptΔ0𝑡{\color[rgb]{0.5234375,0.7421875,0.52734375}\widetilde{\color[rgb]{% 0.97265625,0.58984375,0.55859375}\widetilde{\color[rgb]{0,0,0}n}}}_{t}^{0:L-1}% ={\color[rgb]{0.97265625,0.58984375,0.55859375}\widetilde{\color[rgb]{0,0,0}n}% }_{t}^{0:L-1}+\omega^{0:L-1}\cdot Repeat(\Delta^{0}_{t})over~ start_ARG over~ start_ARG italic_n end_ARG end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT = over~ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT + italic_ω start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT ⋅ italic_R italic_e italic_p italic_e italic_a italic_t ( roman_Δ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) +(1ω0:L1)Δt0:L11superscript𝜔:0𝐿1subscriptsuperscriptΔ:0𝐿1𝑡\hskip 31.2982pt+(1-\omega^{0:L-1})\cdot\Delta^{0:L-1}_{t}+ ( 1 - italic_ω start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT ) ⋅ roman_Δ start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
8:     else
9:        n~~t0:L1=n~t0:L1superscriptsubscript~~𝑛𝑡:0𝐿1superscriptsubscript~𝑛𝑡:0𝐿1{\color[rgb]{0.5234375,0.7421875,0.52734375}\widetilde{\color[rgb]{% 0.97265625,0.58984375,0.55859375}\widetilde{\color[rgb]{0,0,0}n}}}_{t}^{0:L-1}% ={\color[rgb]{0.97265625,0.58984375,0.55859375}\widetilde{\color[rgb]{0,0,0}n}% }_{t}^{0:L-1}over~ start_ARG over~ start_ARG italic_n end_ARG end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT = over~ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT
10:     end if
11:     zt10:L1Sample(zt0:L1,n~~t0:L1)superscriptsubscript𝑧𝑡1:0𝐿1𝑆𝑎𝑚𝑝𝑙𝑒superscriptsubscript𝑧𝑡:0𝐿1superscriptsubscript~~𝑛𝑡:0𝐿1z_{t-1}^{0:L-1}\leftarrow Sample(z_{t}^{0:L-1},{\color[rgb]{% 0.5234375,0.7421875,0.52734375}\widetilde{\color[rgb]{% 0.97265625,0.58984375,0.55859375}\widetilde{\color[rgb]{0,0,0}n}}}_{t}^{0:L-1})italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT ← italic_S italic_a italic_m italic_p italic_l italic_e ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT , over~ start_ARG over~ start_ARG italic_n end_ARG end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT )
12:  end for
13:  return  z00:L1superscriptsubscript𝑧0:0𝐿1z_{0}^{0:L-1}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT

Concretely, in our noise rectification strategy, the noise n~t0:L1superscriptsubscript~𝑛𝑡:0𝐿1{\color[rgb]{0.97265625,0.58984375,0.55859375}\widetilde{\color[rgb]{0,0,0}n}}% _{t}^{0:L-1}over~ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT predicted by U-Net at each step t𝑡titalic_t is first obtained:

n~t0:L1=ϵθ(zt0:L1,c,t),superscriptsubscript~𝑛𝑡:0𝐿1subscriptitalic-ϵ𝜃superscriptsubscript𝑧𝑡:0𝐿1𝑐𝑡{\color[rgb]{0.97265625,0.58984375,0.55859375}\widetilde{\color[rgb]{0,0,0}n}}% _{t}^{0:L-1}=\epsilon_{\theta}(z_{t}^{0:L-1},c,t),over~ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT , italic_c , italic_t ) , (6)

where zt0:L1superscriptsubscript𝑧𝑡:0𝐿1z_{t}^{0:L-1}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT is the input latent map at step t𝑡titalic_t and ϵθ()subscriptitalic-ϵ𝜃\epsilon_{\theta}(\cdot)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) is the denoise model (an inflated 3D U-Net). c𝑐citalic_c and L𝐿Litalic_L are the text embedding and video length respectively. Then, we can naturally calculate the noise gap (dubbed Δt0:L1subscriptsuperscriptΔ:0𝐿1𝑡\Delta^{0:L-1}_{t}roman_Δ start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) between the initial sampled noise in our noising process and the noise predicted during the denoising process.

Δt0:L1=n0:L1n~t0:L1.subscriptsuperscriptΔ:0𝐿1𝑡superscript𝑛:0𝐿1superscriptsubscript~𝑛𝑡:0𝐿1\Delta^{0:L-1}_{t}={n^{0:L-1}-{\color[rgb]{0.97265625,0.58984375,0.55859375}% \widetilde{\color[rgb]{0,0,0}n}}_{t}^{0:L-1}}.roman_Δ start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_n start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT - over~ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT . (7)

We further calibrate the predicted biased noise, which is the key procedure of our method. By introducing the rectification weight factor ω0:L1superscript𝜔:0𝐿1\omega^{0:L-1}italic_ω start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT, we balance the first frame noise gap and the subsequent frames’ noise gap to obtain the weighted rectification offset, which is then used to frame-wise update the originally predicted noise.

n~~t0:L1superscriptsubscript~~𝑛𝑡:0𝐿1\displaystyle{\color[rgb]{0.5234375,0.7421875,0.52734375}\widetilde{\color[rgb% ]{0.97265625,0.58984375,0.55859375}\widetilde{\color[rgb]{0,0,0}n}}}_{t}^{0:L-1}over~ start_ARG over~ start_ARG italic_n end_ARG end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT =n~t0:L1+ω0:L1Repeat(Δt0)absentsuperscriptsubscript~𝑛𝑡:0𝐿1superscript𝜔:0𝐿1𝑅𝑒𝑝𝑒𝑎𝑡subscriptsuperscriptΔ0𝑡\displaystyle={\color[rgb]{0.97265625,0.58984375,0.55859375}\widetilde{\color[% rgb]{0,0,0}n}}_{t}^{0:L-1}+\omega^{0:L-1}\cdot Repeat(\Delta^{0}_{t})= over~ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT + italic_ω start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT ⋅ italic_R italic_e italic_p italic_e italic_a italic_t ( roman_Δ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (8)
+(1ω0:L1)Δt0:L1,1superscript𝜔:0𝐿1subscriptsuperscriptΔ:0𝐿1𝑡\displaystyle+(1-\omega^{0:L-1})\cdot\Delta^{0:L-1}_{t},+ ( 1 - italic_ω start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT ) ⋅ roman_Δ start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,

where Repeat()𝑅𝑒𝑝𝑒𝑎𝑡Repeat(\cdot)italic_R italic_e italic_p italic_e italic_a italic_t ( ⋅ ) is the broadcasting operation to align the temporal dimension.

Refer to caption
Figure 4: Visual comparison with current image-to-video methods. We use AnimateDiff [12] as the VLDM. “Ctrl.R.O.” means ControlNet Reference-Only method [58]. Our method achieves higher fidelity in the video sequences with the given image.

The whole process of our image-to-video method is detailed in the Algorithm 1. Such a noise rectification method is simple but effective. As illustrated in Fig.3(b), through noise rectification (represented in the green arrow), the accumulation noise gap could be effectively alleviated and thus the noisy latent of generated frames are closer to the image latent. In this way, the fine-grained content details of the reference image can be well preserved in the generated video. In addition, to control the retention degree of the reference image, we further introduce a step-adaptive intervention strategy based on noise rectification. Specifically, by adjusting the parameter of rectification steps τ𝜏\tauitalic_τ and weight ω0:L1superscript𝜔:0𝐿1\omega^{0:L-1}italic_ω start_POSTSUPERSCRIPT 0 : italic_L - 1 end_POSTSUPERSCRIPT, our method could control the fidelity degree of the generated video. It is worth mentioning that our method is tuning-free and can be applied to most current video diffusion models.

5 Experiments

5.1 Experimental Setup

Dataset. We utilized two public datasets WebVid10M [2] and LAION-Aesthetic in  [41] to evaluate our method. As for the WebVid10M dataset, we randomly sampled 1000 video-text pairs in proportion to the different categories in its validation subset. For quantitative evaluation, to avoid buffering frames at the beginning of the videos, we selected the 10th frame as the image input for the video generation, along with the video’s text description. As for the LAION-Aesthetic dataset, we also randomly chose 1000 image-text pairs for the qualitative evaluation.

Methods Image fidelity\uparrow Temporal coherence\uparrow Video-text alignment\uparrow
VLDM [12] + SDEdit [29] 0.7425 0.9888 0.2548
VLDM [12] + ConcateImage 0.6944 0.9427 0.2084
VLDM [12] + Ctrl.R.O. [58] 0.7689 0.9919 0.2466
VLDM [12] + IP-Adapter [54] 0.7650 0.9918 0.2287
VideoComposer [51] 0.7483 0.9352 0.2447
VideoCrafter1-I2V [4] 0.7695 0.9689 0.2206
I2VGen-XL [59] 0.7717 0.9560 0.2208
Ours 0.7907 0.9882 0.2517
Ours + IP-Adapter [54] 0.8042 0.9934 0.2405
Table 1: Quantitative comparison results on the WebVid dataset [2].

Evaluation Metrics. For the image-to-video generation task, the focus lies on the fidelity and smoothness of the generated videos. Therefore, we assess the generated video from three aspects: image fidelity, temporal coherence, and video-text alignment. Specifically, to evaluate the fidelity between the generated video and the given image, we calculate the CLIP [35] image similarity for each generated frame and the given image. Considering the temporal consistency in the video, we evaluate the CLIP score between the generated frames. Besides, since the text description is input as a condition, we also calculate the CLIP text-image similarity to evaluate the semantic relevance between the generated video and the text description.

5.2 Comparisons

Comparison Methods. We categorized the comparison methods into these two types as Fig.2. One is to incorporate the image condition into the input layer. (1) SDEdit [29], a semantic image translation method, which can also be used for I2V tasks by simply adding noises to the given image and then denoising. (2) ConcateImage, another simple baseline to concatenate the image condition on initialization noises, which needs to be finetuned to learn the structural information of the given image. Another type of approach is to perform image condition injection at each layer of VLDM. (3) ControlNet Reference-Only [58], an effective way to directly link the attention layer of the VLDM to the reference image. (4) IP-Adapter [54], using an additional cross-attention layer for image prompts to achieve semantic and structural preservation. (5) VideoCrafter1-I2V [4], similar to IP-Adapter, is another implementation of image prompts injection into the VLDM. Besides, (6) VideoComposer [51] and (7) I2VGen-XL [59] combine the above two types of ideas for image injection both at the input and middle layers of VLDM.

Benefiting from plug-and-play and tuning-free properties, our method can combine with other image-condition enhancing modules mentioned above. In order to make an intuitive and fair comparison, we conduct our method on both the two above types, denoted as Ours and Ours+IP-Adapter [54]. For fairness, we select AnimateDiff [12] as the pre-trained VLDM.

Refer to caption
Figure 5: Ablation study on the weight of noise rectification. We fix the rectification timestep τ𝜏\tauitalic_τ and change the rectification weights ω𝜔\omegaitalic_ω for different frames.

Qualitative Comparison. As shown in Fig.4, the method [29] and ConcateImage which only incorporate the image condition with the noisy latent at the beginning of the reverse stage are only able to maintain a similar style of the given image. In contrast, those methods [58, 54, 51, 4, 59] that iteratively utilize the image information in the model’s intermediate computation process can preserve more visual features of the given image. In comparison, our method maintains more visual details and achieves high fidelity to the input image. For clearer video samples please refer to the project website.

Refer to caption
Figure 6: Ablation study on the timestep of noise rectification. We fix the rectification weight ω𝜔\omegaitalic_ω and the green panels show the the rectification start and end timesteps τ𝜏\tauitalic_τ.
Refer to caption
Figure 7: A plug-and-play extension of our method on current T2V frameworks to realize I2V. (a) Text-to-video generation results for different T2V models. (b) Different T2V frameworks combined with our method for high-fidelity image-to-video generation.

Quantitative Comparison. As shown in Tab.1, our noise rectification method effectively improves the fidelity. Combined with the additional image prompt module [54], our method can obtain the highest video-image fidelity value of 0.8042 and temporal coherence value of 0.9934. Besides, our method still obtains acceptable video-text alignment although we mainly focus on the high fidelity image-to-video task.

5.3 Ablation Study

Our method rectifies the predicted noise in the reverse steps and contains two adjustable parameters: rectification weight ω𝜔\omegaitalic_ω and rectification timestep τ𝜏\tauitalic_τ as introduced in Algorithm 1. Therefore, we take the ablation study on these two parameters, respectively. Specifically, τ=[s,e]𝜏𝑠𝑒\tau=[s,e]italic_τ = [ italic_s , italic_e ] indicates that noise rectification is performed from s𝑠sitalic_s ratio to e𝑒eitalic_e ratio of the total timestep.

Rectification Weight. We fix the rectification timestep τ=[0%,60%]𝜏percent0percent60\tau=[0\%,60\%]italic_τ = [ 0 % , 60 % ] and change the rectification weights ω𝜔\omegaitalic_ω for different frames. The ablation results on ωisuperscript𝜔𝑖\omega^{i}italic_ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT are shown in Fig.5, where the plots above the video sequences indicate the rectification weights ωisuperscript𝜔𝑖\omega^{i}italic_ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT frame. It can be observed that ωisuperscript𝜔𝑖\omega^{i}italic_ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT could affect the fidelity and temporal consistency of subsequent frames. For example, the results in the third or fourth column may result in abrupt changes in the image detail or motion effects. Therefore, we empirically select the setting of the second column for maintaining high fidelity and natural motion.

Rectification Timestep. The rectification timestep period τ𝜏\tauitalic_τ determines in which denoising steps the predicted noise needs to be corrected. As shown in Fig.6, we fix the rectification weights ω𝜔\omegaitalic_ω and the green panels show the rectification start and end timestep. If noise rectification is not performed, i.e. the first column τ=[0%,0%]𝜏percent0percent0\tau=[0\%,0\%]italic_τ = [ 0 % , 0 % ], the fidelity of the generated video will be poor. Starting from the initial denoising, as the noise rectification period increases (from τ=[0%,20%]𝜏percent0percent20\tau=[0\%,20\%]italic_τ = [ 0 % , 20 % ] to τ=[0%,100%]𝜏percent0percent100\tau=[0\%,100\%]italic_τ = [ 0 % , 100 % ]), the fidelity will gradually be improved. However, if the rectification only happens on the latter denoising process (i.e.,τ=[30%,70%]𝜏percent30percent70\tau=[30\%,70\%]italic_τ = [ 30 % , 70 % ] or τ=[60%,100%]𝜏percent60percent100\tau=[60\%,100\%]italic_τ = [ 60 % , 100 % ]), the generated video will still get poor fidelity. These results indicate that accurately predicting noise at the start of the reverse process is crucial for maintaining image fidelity. Considering that a perfect fidelity will scarify the motion intensity, we strike a balance to set τ=[0%,60%]𝜏percent0percent60\tau=[0\%,60\%]italic_τ = [ 0 % , 60 % ] for all experiments.

Extension to More VLDMs. Our method utilizes the motion prior of VLDM to model the dynamic motion, which is actually tuning-free and can be adapted to other video diffusion models. To evaluate the extension performance of our method, we selected several recent T2V models and applied our noise rectification method to implement I2V. Besides AnimateDiff [12], ModelScopeT2V [49] is a diffusion-based text-to-video model that utilizes the spatio-temporal block to model dynamics. Hotshot-XL [32] is an open-sourced text-to-GIF model developed to work alongside the Stable Diffusion XL (SD-XL) model [34]. We evaluate these three T2V models and extend them to I2V using our plug-and-play noise rectification method. As shown in Fig.7, based on pre-trained motion priors, our method can maintain high fidelity and consistent animation.

6 Conclusion

In this work, we propose a simple but effective noise rectification method for image-to-video generation in open domains. We deeply analyze the challenges in I2V and propose a tuning-free approach to ensure high fidelity through a noising and rectified denoising process. Our method is plug-and-play and can be applied to other video latent diffusion models to realize I2V. Experimental results demonstrate the effectiveness of our method. We hope that our method provides a new idea to improve fidelity in the video synthesis field. Notably, our method achieves higher fidelity while losing some motion intensity. Therefore, in future exploration, we will continue to focus on increasing the motion intensity while maintaining high fidelity.

References

  • An et al. [2023] Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia-Bin Huang, Jiebo Luo, and Xi Yin. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
  • Bain et al. [2021] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021.
  • Blattmann et al. [2023] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575. IEEE, 2023.
  • Chen et al. [2023a] Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
  • Chen et al. [2023b] Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffusion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023b.
  • Choi et al. [2021] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. ILVR: conditioning method for denoising diffusion probabilistic models. In ICCV, pages 14347–14356. IEEE, 2021.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. In NIPS, pages 8780–8794, 2021.
  • Esser et al. [2023] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In ICCV, pages 7346–7356, 2023.
  • Franceschi et al. [2020] Jean-Yves Franceschi, Edouard Delasalles, Mickaël Chen, Sylvain Lamprier, and Patrick Gallinari. Stochastic latent residual video prediction. In ICML, pages 3233–3246. PMLR, 2020.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. NIPS, 27, 2014.
  • Gu et al. [2023] Xianfan Gu, Chuan Wen, Jiaming Song, and Yang Gao. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897, 2023.
  • Guo et al. [2023] Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  • He et al. [2022] Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity video generation. arXiv preprint arXiv:2211.13221, 2022.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NIPS, 2020.
  • Ho et al. [2022a] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  • Ho et al. [2022b] Jonathan Ho, Tim Salimans, Alexey A. Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. In NIPS, 2022b.
  • Holynski et al. [2021] Aleksander Holynski, Brian L Curless, Steven M Seitz, and Richard Szeliski. Animating pictures with eulerian motion fields. In CVPR, pages 5810–5819, 2021.
  • Höppe et al. [2022] Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022.
  • Hu et al. [2022] Yaosi Hu, Chong Luo, and Zhenzhong Chen. Make it move: controllable image-to-video generation with text descriptions. In CVPR, pages 18219–18228, 2022.
  • Karras et al. [2023] Johanna Karras, Aleksander Holynski, Ting-Chun Wang, and Ira Kemelmacher-Shlizerman. Dreampose: Fashion image-to-video synthesis via stable diffusion. arXiv preprint arXiv:2304.06025, 2023.
  • Khachatryan et al. [2023] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  • Kingma and Welling [2014] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
  • Le Moing et al. [2021] Guillaume Le Moing, Jean Ponce, and Cordelia Schmid. Ccvs: context-aware controllable video synthesis. NIPS, 34:14042–14055, 2021.
  • Li et al. [2018] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Flow-grounded spatial-temporal video prediction from still images. In ECCV, pages 600–615, 2018.
  • Li et al. [2023] Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Holynski. Generative image dynamics. arXiv preprint arXiv:2309.07906, 2023.
  • Luo et al. [2023] Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jinren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. arXiv preprint arXiv:2303.08320, 2023.
  • Ma et al. [2023] Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Ying Shan, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv preprint arXiv:2304.01186, 2023.
  • Mahapatra and Kulkarni [2022] Aniruddha Mahapatra and Kuldeep Kulkarni. Controllable animation of fluid elements in still images. In CVPR, pages 3667–3676, 2022.
  • Meng et al. [2022] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In ICLR. OpenReview.net, 2022.
  • Mokady et al. [2023] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In CVPR, pages 6038–6047. IEEE, 2023.
  • Mou et al. [2023] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  • Mullan et al. [2023] John Mullan, Duncan Crawbuck, and Aakash Sastry. Hotshot-XL, 2023.
  • Ni et al. [2023] Haomiao Ni, Changhao Shi, Kai Li, Sharon X Huang, and Martin Renqiang Min. Conditional image-to-video generation with latent flow diffusion models. In CVPR, pages 18444–18455, 2023.
  • Podell et al. [2023] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10674–10685. IEEE, 2022.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
  • Ruiz et al. [2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
  • Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In NIPS, 2022.
  • Schuhmann et al. [2021] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  • Siarohin et al. [2019] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. NIPS, 32, 2019.
  • Singer et al. [2023] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. In ICLR. OpenReview.net, 2023.
  • Smith et al. [2023] Harrison Jesse Smith, Qingyuan Zheng, Yifei Li, Somya Jain, and Jessica K Hodgins. A method for automatically animating children’s drawings of the human figure. arXiv preprint arXiv:2303.12741, 2023.
  • Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265. PMLR, 2015.
  • Song et al. [2020a] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  • Song et al. [2020b] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  • Voleti et al. [2022] Vikram Voleti, Alexia Jolicoeur-Martineau, and Chris Pal. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. NIPS, 35:23371–23385, 2022.
  • Wang et al. [2023a] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  • Wang et al. [2021] Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. In CVPR, pages 10039–10049, 2021.
  • Wang et al. [2023b] Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023b.
  • Wang et al. [2023c] Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023c.
  • Yang et al. [2023] Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Diffusion probabilistic modeling for video generation. Entropy, 25(10):1469, 2023.
  • Ye et al. [2023] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arxiv:2308.06721, 2023.
  • Yin et al. [2023a] Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023a.
  • Yin et al. [2023b] Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, Jianlong Fu, Ming Gong, Lijuan Wang, Zicheng Liu, Houqiang Li, and Nan Duan. NUWA-XL: diffusion over diffusion for extremely long video generation. In ACL, pages 1309–1320. Association for Computational Linguistics, 2023b.
  • Zhang et al. [2023a] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
  • Zhang et al. [2023b] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023b.
  • Zhang et al. [2023c] Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023c.
  • Zhou et al. [2022] Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.