Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs

Fei, Hao; Wu, Shengqiong; Ji, Wei; Zhang, Hanwang; Chua, Tat-Seng

Computer Science > Artificial Intelligence

arXiv:2308.13812 (cs)

[Submitted on 26 Aug 2023 (v1), last revised 19 Mar 2024 (this version, v2)]

Title:Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs

Authors:Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua

View PDF HTML (experimental)

Abstract:Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches. While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the intricate temporal dynamics modeling, one of the crux of video synthesis. In this work, we investigate strengthening the awareness of video dynamics for DMs, for high-quality T2V generation. Inspired by human intuition, we design an innovative dynamic scene manager (dubbed as Dysen) module, which includes (step-1) extracting from input text the key actions with proper time-order arrangement, (step-2) transforming the action schedules into the dynamic scene graph (DSG) representations, and (step-3) enriching the scenes in the DSG with sufficient and reasonable details. Taking advantage of the existing powerful LLMs (e.g., ChatGPT) via in-context learning, Dysen realizes (nearly) human-level temporal dynamics understanding. Finally, the resulting video DSG with rich action scene details is encoded as fine-grained spatio-temporal features, integrated into the backbone T2V DM for video generating. Experiments on popular T2V datasets suggest that our Dysen-VDM consistently outperforms prior arts with significant margins, especially in scenarios with complex actions. Codes at this https URL

Comments:	CVPR 2024
Subjects:	Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2308.13812 [cs.AI]
	(or arXiv:2308.13812v2 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2308.13812

Submission history

From: Hao Fei [view email]
[v1] Sat, 26 Aug 2023 08:31:48 UTC (3,083 KB)
[v2] Tue, 19 Mar 2024 12:29:54 UTC (3,004 KB)

Computer Science > Artificial Intelligence

Title:Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators