MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Kim, Seyeon; Jin, Siyoon; Park, Jihye; Kim, Kihong; Kim, Jiyoung; Nam, Jisu; Kim, Seungryong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.19144 (cs)

[Submitted on 28 Mar 2024]

Title:MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Authors:Seyeon Kim, Siyoon Jin, Jihye Park, Kihong Kim, Jiyoung Kim, Jisu Nam, Seungryong Kim

View PDF HTML (experimental)

Abstract:Conventional GAN-based models for talking head generation often suffer from limited quality and unstable training. Recent approaches based on diffusion models aimed to address these limitations and improve fidelity. However, they still face challenges, including extensive sampling times and difficulties in maintaining temporal consistency due to the high stochasticity of diffusion models. To overcome these challenges, we propose a novel motion-disentangled diffusion model for high-quality talking head generation, dubbed MoDiTalker. We introduce the two modules: audio-to-motion (AToM), designed to generate a synchronized lip motion from audio, and motion-to-video (MToV), designed to produce high-quality head video following the generated motion. AToM excels in capturing subtle lip movements by leveraging an audio attention mechanism. In addition, MToV enhances temporal consistency by leveraging an efficient tri-plane representation. Our experiments conducted on standard benchmarks demonstrate that our model achieves superior performance compared to existing models. We also provide comprehensive ablation studies and user study results.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2403.19144 [cs.CV]
	(or arXiv:2403.19144v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.19144

Submission history

From: Seyeon Kim [view email]
[v1] Thu, 28 Mar 2024 04:35:42 UTC (33,040 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators