Authored Publications
Sort By
Preview abstract
We propose a multitask training method for attention-basedend-to-end speech recognition models. We regularize the de-coder in a listen, attend, and spell model by multitask trainingon both audio-text and text-only data. Trained on the 100-hoursubset of LibriSpeech, the proposed method leads to an 11%relative performance improvement over the baseline and is com-parable to language model shallow fusion, without requiring anadditional neural network during decoding. We observe a simi-lar trend on the whole 960-hour LibriSpeech training set. Anal-yses of sample output sentences demonstrate that the proposedmethod can incorporate language level information, suggestingits effectiveness in real-world applications
View details
Sparse, Efficient, and Semantic MixIT: Taming In-the-Wild Unsupervised Sound Separation
Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2021)
Preview abstract
Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-the-wild data; however, it suffers from two outstanding problems. First, it produces models which tend to over-separate, producing more output sources than are present in the input. Second, the exponential computational complexity of the MixIT loss limits the number of feasible output sources. In this paper we address both issues. To combat over-separation we introduce new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs. We also experiment with a semantic classification loss by predicting weak class labels for each mixture. To handle larger numbers of sources, we introduce an efficient approximation using a fast least-squares solution, projected onto the MixIT constraint set. Our experiments show that the proposed losses curtail over-separation and improve overall performance. The best performance is achieved using larger numbers of output sources, enabled by our efficient MixIT loss, combined with sparsity losses to prevent over-separation. On the FUSS test set, we achieve over 13 dB in multi-source SI-SNR improvement, while boosting single-source reconstruction SI-SNR by over 17 dB.
View details
WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Nanxin Chen
Yu Zhang
Mohammad Norouzi
Najim Dehak
William Chan
Interspeech (2021)
Preview abstract
This paper introduces WaveGrad 2, an end-to-end non-autoregressive generative model for text-to-speech synthesis trained to estimate the gradients of the data density. Unlike recent TTS systems which are a cascade of separately learned models, during training the proposed model requires only text or phoneme sequence, learns all parameters end-to-end without intermediate features, and can generate natural speech audio with great varieties. This is achieved by the score matching objective, which optimizes the network to model the score function of the real data distribution. Output waveforms are generated using an iterative refinement process beginning from a random noise sample. Like our prior work, WaveGrad 2 offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps. Experiments reveal that the model can generate high fidelity audio, closing the gap between end-to-end and contemporary systems, approaching the performance of a state-of-the-art neural TTS system. We further carry out various ablations to study the impact of different model configurations.
View details
Preview abstract
This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density. The model is built on prior work on score matching and diffusion probabilistic models. It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. WaveGrad offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps, and bridges the gap between non-autoregressive and autoregressive models in terms of audio quality. We find that it can generate high fidelity audio samples using as few as six iterations. Experiments reveal WaveGrad to generate high fidelity audio, outperforming adversarial non-autoregressive baselines and matching a strong likelihood-based autoregressive baseline using fewer sequential operations. Audio samples are available at https://wavegrad.github.io/
View details
Preview abstract
We describe a sequence-to-sequence neural network which can directly generate speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow in the decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length frames, each one containing hundreds of samples. The inter-dependencies of waveform samples within each frame are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on its preceding frames. The model allows for straightforward optimization towards the maximum likelihood objective, without utilizing intermediate spectral features nor additional loss terms. Contemporary state-of-the-art TTS systems use a sequence of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectograms) from text, followed by a vocoder model (such as WaveRNN) which generates waveform samples from the intermediate features. The proposed system, in contrast, does not use a fixed intermediate representation ,and learns all parameters end-to-end. We demonstrate (to the best of our knowledge) the first system in the literature to do so successfully. Experiments show that the quality of speech generated from the proposed model is nearly competitive with the state-of-the-art neural TTS methods, with significantly improved generation speed.
View details
Preview abstract
Although neural end-to-end text-to-speech models can synthesizehighly natural speech, there is still a room for improvements in itsefficiency during inference. This paper proposes a non-autoregressiveneural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, calledParallel Tacotron, is highlyparallelizable during both training and inference, allowing efficientsynthesis on modern parallel hardware. The use of the variationalautoencoder helps to relax the one-to-many mapping nature of thetext-to-speech problem. To further improve the naturalness, weintroduce an iterative spectrogram loss, which is inspired by iterativerefinement, and lightweight convolution, which can efficiently capturelocal contexts. Experimental results show that Parallel Tacotronmatches a strong autoregressive baseline in subjective naturalnesswith significantly decreased inference time.
View details
Generating diverse and natural text-to-speech samples using quantized fine-grained VAE and autoregressive prosody prior
Guangzhi Sun
Yu Zhang
ICASSP (2020)
Preview abstract
Recently proposed approaches for fine-grained prosody control of end-to-end text-to-speech samples enable precise control of the prosody of synthesized speech.
Such models incorporate a fine-grained variational autoencoder (VAE) structure into a sequence-to-sequence model, extracting latent prosody features for each input token (e.g.\ phonemes).
Generating samples using the standard VAE prior, an independent Gaussian at each time step, results in very unnatural and discontinuous speech, with dramatic variation between phonemes.
In this paper we propose a sequential prior in the discrete latent space which can be used to generate more natural samples.
This is accomplished by discretizing the latent prosody features using vector quantization, and training an autoregressive (AR) prior model over the result.
The AR prior is learned separately from the training of the posterior.
We evaluate the approach using subjective listening tests, objective metrics of automatic speech recognition (ASR) performance, as well as measurements of prosody attributes including volume, pitch, and phoneme duration.
Compared to the fine-grained VAE baseline, the proposed model achieves equally good copy synthesis reconstruction performance, but significantly improves naturalness in sample generation.
The diversity of the prosody in random samples better matches that of the real speech.
Furthermore, initial experiments demonstrate that samples generated from the quantized latent sapce can be used as an effective data augmentation strategy to improve ASR performance.
View details
Preview abstract
In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon the degree of match between the training data and real-world audio, especially in terms of the acoustic conditions and distribution of sources. The acoustic properties can be challenging to accurately simulate, and the distribution of sound types may be hard to replicate. In this paper, we propose a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures. In MixIT, training examples are constructed by mixing together existing mixtures, and the model separates them into a variable number of latent sources, such that the separated sources can be remixed to approximate the original mixtures. We show that MixIT can achieve competitive performance compared to supervised methods on speech separation. Using MixIT in a semi-supervised learning setting enables unsupervised domain adaptation and learning from large amounts of real world data without ground-truth source waveforms. In particular, we significantly improve reverberant speech separation performance by incorporating reverberant mixtures, train a speech enhancement system from noisy mixtures, and improve universal sound separation by incorporating a large amount of in-the-wild data.
View details
Preview abstract
We propose a hierarchical, fine-grained and interpretable latent model for prosody based on the Tacotron~2. This model achieves multi-resolution modeling by conditioning finer level prosody representations on coarser level ones. In addition, the hierarchical conditioning is also imposed across all latent dimensions using a conditional VAE structure which exploits an auto-regressive structure. Reconstruction performance is evaluated with the $F_0$ frame error (FFE) and the mel-cepstral distortion (MCD) which illustrates the new structure does not degrade the model. Interpretations of prosody attributes are provided together with the comparison between word-level and phone-level prosody representations. Moreover, both qualitative and quantitative evaluations are used to demonstrate the improvement in the disentanglement of the latent dimensions.
View details
Unsupervised Speech Separation Using Mixtures of Mixtures
ICML 2020 Workshop on Self-Supervision for Audio and Speech
Preview abstract
Supervised approaches to single-channel speech separation rely on synthetic mixtures, so that the individual sources can be used as targets. Good performance depends upon how well the synthetic mixture data match real mixtures. However, matching synthetic data to the acoustic properties and distribution of sounds in a target domain can be challenging. Instead, we propose an unsupervised method that requires only singlechannel acoustic mixtures, without ground-truth source signals. In this method, existing mixtures are mixed together to form a mixture of mixtures, which the model separates into latent sources. We propose a novel loss that allows the latent sources
to be remixed to approximate the original mixtures. Experiments show that this method can achieve competitive performance on speech separation compared to supervised methods. In a semisupervised learning setting, our method enables domain adaptation by incorporating unsupervised mixtures from a matched domain. In particular, we demonstrate that significant improvement to reverberant speech separation performance can be achieved by incorporating reverberant mixtures.
View details