Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping

Kang, Minki; Han, Wooseok; Yang, Eunho

Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.05844 (cs)

[Submitted on 25 Sep 2023 (v1), last revised 28 Dec 2024 (this version, v2)]

Title:Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping

Authors:Minki Kang, Wooseok Han, Eunho Yang

View PDF HTML (experimental)

Abstract:Generating speech from a face image is crucial for developing virtual humans capable of interacting using their unique voices, without relying on pre-recorded human speech. In this paper, we propose Face-StyleSpeech, a zero-shot Text-To-Speech (TTS) synthesis model that generates natural speech conditioned on a face image rather than reference speech. We hypothesize that learning entire prosodic features from a face image poses a significant challenge. To address this, our TTS model incorporates both face and prosody encoders. The prosody encoder is specifically designed to model speech style characteristics that are not fully captured by the face image, allowing the face encoder to focus on extracting speaker-specific features such as timbre. Experimental results demonstrate that Face-StyleSpeech effectively generates more natural speech from a face image than baselines, even for unseen faces. Samples are available on our demo page.

Comments:	Accepted by ICASSP 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2311.05844 [cs.CV]
	(or arXiv:2311.05844v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.05844

Submission history

From: Wooseok Han [view email]
[v1] Mon, 25 Sep 2023 13:46:00 UTC (656 KB)
[v2] Sat, 28 Dec 2024 12:31:01 UTC (662 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators