U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

Jing, Xin; Chang, Yi; Yang, Zijiang; Xie, Jiangjian; Triantafyllopoulos, Andreas; Schuller, Bjoern W.

Computer Science > Sound

arXiv:2305.13195 (cs)

[Submitted on 22 May 2023]

Title:U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

Authors:Xin Jing, Yi Chang, Zijiang Yang, Jiangjian Xie, Andreas Triantafyllopoulos, Bjoern W. Schuller

View PDF

Abstract:Deep learning has led to considerable advances in text-to-speech synthesis. Most recently, the adoption of Score-based Generative Models (SGMs), also known as Diffusion Probabilistic Models (DPMs), has gained traction due to their ability to produce high-quality synthesized neural speech in neural speech synthesis systems. In SGMs, the U-Net architecture and its variants have long dominated as the backbone since its first successful adoption. In this research, we mainly focus on the neural network in diffusion-model-based Text-to-Speech (TTS) systems and propose the U-DiT architecture, exploring the potential of vision transformer architecture as the core component of the diffusion models in a TTS system. The modular design of the U-DiT architecture, inherited from the best parts of U-Net and ViT, allows for great scalability and versatility across different data scales. The proposed U-DiT TTS system is a mel spectrogram-based acoustic model and utilizes a pretrained HiFi-GAN as the vocoder. The objective (ie Frechet distance) and MOS results show that our DiT-TTS system achieves state-of-art performance on the single speaker dataset LJSpeech. Our demos are publicly available at: this https URL

Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2305.13195 [cs.SD]
	(or arXiv:2305.13195v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2305.13195

Submission history

From: Xin Jing [view email]
[v1] Mon, 22 May 2023 16:25:19 UTC (2,145 KB)

Computer Science > Sound

Title:U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators