Fine-Grained and Interpretable Neural Speech Editing

Morrison, Max; Churchwell, Cameron; Pruyne, Nathan; Pardo, Bryan

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2407.05471 (eess)

[Submitted on 7 Jul 2024]

Title:Fine-Grained and Interpretable Neural Speech Editing

Authors:Max Morrison, Cameron Churchwell, Nathan Pruyne, Bryan Pardo

View PDF HTML (experimental)

Abstract:Fine-grained editing of speech attributes$\unicode{x2014}$such as prosody (i.e., the pitch, loudness, and phoneme durations), pronunciation, speaker identity, and formants$\unicode{x2014}$is useful for fine-tuning and fixing imperfections in human and AI-generated speech recordings for creation of podcasts, film dialogue, and video game dialogue. Existing speech synthesis systems use representations that entangle two or more of these attributes, prohibiting their use in fine-grained, disentangled editing. In this paper, we demonstrate the first disentangled and interpretable representation of speech with comparable subjective and objective vocoding reconstruction accuracy to Mel spectrograms. Our interpretable representation, combined with our proposed data augmentation method, enables training an existing neural vocoder to perform fast, accurate, and high-quality editing of pitch, duration, volume, timbral correlates of volume, pronunciation, speaker identity, and spectral balance.

Comments:	Interspeech 2024
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2407.05471 [eess.AS]
	(or arXiv:2407.05471v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2407.05471

Submission history

From: Max Morrison [view email]
[v1] Sun, 7 Jul 2024 19:05:52 UTC (56 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Fine-Grained and Interpretable Neural Speech Editing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Fine-Grained and Interpretable Neural Speech Editing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators