Latent Video Diffusion Models for High-Fidelity Long Video Generation

He, Yingqing; Yang, Tianyu; Zhang, Yong; Shan, Ying; Chen, Qifeng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2211.13221 (cs)

[Submitted on 23 Nov 2022 (v1), last revised 20 Mar 2023 (this version, v2)]

Title:Latent Video Diffusion Models for High-Fidelity Long Video Generation

Authors:Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen

View PDF

Abstract:AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.

Comments:	Project Page: this https URL Github: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2211.13221 [cs.CV]
	(or arXiv:2211.13221v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2211.13221

Submission history

From: Yingqing He [view email]
[v1] Wed, 23 Nov 2022 18:58:39 UTC (24,033 KB)
[v2] Mon, 20 Mar 2023 17:29:45 UTC (4,712 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Latent Video Diffusion Models for High-Fidelity Long Video Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Latent Video Diffusion Models for High-Fidelity Long Video Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators