Cross-Utterance Conditioned VAE for Speech Generation

Li, Yang; Yu, Cheng; Sun, Guangzhi; Zu, Weiqin; Tian, Zheng; Wen, Ying; Pan, Wei; Zhang, Chao; Wang, Jun; Yang, Yang; Sun, Fanglei

Computer Science > Sound

arXiv:2309.04156 (cs)

[Submitted on 8 Sep 2023 (v1), last revised 19 Sep 2024 (this version, v2)]

Title:Cross-Utterance Conditioned VAE for Speech Generation

Authors:Yang Li, Cheng Yu, Guangzhi Sun, Weiqin Zu, Zheng Tian, Ying Wen, Wei Pan, Chao Zhang, Jun Wang, Yang Yang, Fanglei Sun

View PDF HTML (experimental)

Abstract:Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.

Comments:	13 pages;
Subjects:	Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2309.04156 [cs.SD]
	(or arXiv:2309.04156v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2309.04156

Submission history

From: Yang Li [view email]
[v1] Fri, 8 Sep 2023 06:48:41 UTC (3,619 KB)
[v2] Thu, 19 Sep 2024 13:41:57 UTC (7,802 KB)

Computer Science > Sound

Title:Cross-Utterance Conditioned VAE for Speech Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Cross-Utterance Conditioned VAE for Speech Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators