Unsupervised Speech Representation Learning Using Wavenet Autoencoders

1
Unsupervised speech representation learning

using WaveNet autoencoders
Jan Chorowski, Ron J. Weiss, Samy Bengio, Aäron van den Oord
Abstract—We consider the task of unsupervised extraction speech recognition (ASR), where only a small amount of
of meaningful latent representations of speech by applying labeled training data is available. In such scenario, limited
autoencoding neural networks to speech waveforms. The goal is to amounts of data may be sufficient to learn an acoustic model
learn a representation able to capture high level semantic content
on the representation discovered without supervision, but
arXiv:1901.08810v1 [cs.LG] 25 Jan 2019
from the signal, e.g. phoneme identities, while being invariant to

confounding low level details in the signal such as the underlying insufficient to learn the acoustic model and a data representation
pitch contour or background noise. The behavior of autoencoder in a fully supervised manner [15], [16].
models depends on the kind of constraint that is applied to We focus on representations learned with autoencoders
the latent representation. We compare three variants: a sim- applied to raw waveforms and spectrogram features and
ple dimensionality reduction bottleneck, a Gaussian Variational
Autoencoder (VAE), and a discrete Vector Quantized VAE (VQ- investigate the quality of learned representations on LibriSpeech
VAE). We analyze the quality of learned representations in terms [17]. We discover that best representations arise when typical
of speaker independence, the ability to predict phonetic content, ASR features, such as mel-frequency cepstral coefficients
and the ability to accurately reconstruct individual spectrogram (MFCCs) are used as inputs, while raw waveforms are used
frames. Moreover, for discrete encodings extracted using the VQ- as decoder targets. Furthermore, we observe that the Vector
VAE, we measure the ease of mapping them to phonemes. We
introduce a regularization scheme that forces the representations Quantized Variational Autoencoder (VQ-VAE) [18] yields the
to focus on the phonetic content of the utterance and report best separation between the acoustic content and speaker
performance comparable with the top entries in the ZeroSpeech information. We investigate the interpetability of VQ-VAE
2017 unsupervised acoustic unit discovery task. tokens by mapping them to phonemes, demonstrate the impact
Index Terms—autoencoder, speech representation learning, un- of model hyperparameters on interpretability and propose a
supervised learning, acoustic unit discovery new regularization scheme which improves the degree to which
the latent representation can be mapped to the phonetic content.
Finally, we demonstrate strong performance on the ZeroSpeech
I. I NTRODUCTION 2017 acoustic unit discovery task [19], which measures how
Creating good data representations is important. The deep discriminative a representation is to minimal phonetic changes
learning revolution was triggered by the development of within an utterance.
hierarchical representation learning algorithms, such as stacked
Restricted Boltzman Machines [1] and Denoising Autoencoders II. R EPRESENTATION L EARNING WITH N EURAL N ETWORKS
[2]. However, recent breakthroughs in computer vision [3], Neural networks are hierarchical information processing
[4], machine translation [5], [6], speech recognition [7], [8], models that are typically implemented using layers of computa-
and language understanding [9], [10] rely on large labeled tional units. Each layer can be interpreted as a feature extractor
datasets and make little to no use of unsupervised representation whose outputs are passed to upstream units [20]. Especially in
learning. This has two drawbacks: first, the requirement of large the visual domain, features learned with neural networks have
human labeled datasets often makes the development of deep been shown to create a hierarchy of visual atoms [11] that
learning models expensive. Second, while a deep model may match some properties of the visual cortex [21]. Similarly, when
excel at solving a given task, it yields limited insights into the applied to audio waveforms, neural networks have been shown
problem domain, with main intuitions typically consisting of to learn auditory-like frequency decompositions on music [22]
visualizations of salient input patterns [11], [12], a strategy that and speech [23], [24], [25], [26] in their lower layers.
is applicable only to problem domains that are easily solved
by humans.
In this paper we focus on evaluating and improving unsu- A. Supervised feature learning
pervised speech representations. Specifically, we focus on rep- Neural networks can learn useful data representations in both
resentations that separate speaker traits from phonetic content, supervised and unsupervised manners. In the supervised case,
properties which are consistent with internal representations features learned on large datasets are often directly useful
learned by speech recognizers [13], [14]. Such representations in similar but data-poor tasks. For instance, in the visual
are desired in several tasks, such as low resource automatic domain, features discovered on ImageNet [27] are routinely
used as input representations in other computer vision tasks [28].
J. Chorowski is with the Institute of Computer Science, University of Similarly, the speech community has used bottleneck features
Wrocław, Poland e-mail: [email protected].
R. Weiss and S. Bengio are with Google Research. A. van den Oord is with extracted from networks trained on phoneme prediction tasks
DeepMind email: {ronw, bengio, avdnoord}@google.com. [29], [30] as feature representations for speech recognition
2
systems. Likewise, in natural language processing, universal trained jointly to maximize a lower bound on the log-likelihood
text representations can be extracted from networks trained for of data point x [37], [38]:
machine translation [31] or language inference [32], [33].
JVAE (θ, φ; x) = Eq(z|x;φ) [log p(x|z; θ)] −
β DKL (q(z|x; φ) || p(z)) . (1)
B. Unsupervised feature learning We can interpret the two terms of Eq. (1) as the autoencoder’s
In this paper we focus on unsupervised feature learning. reconstruction cost augmented with a penalty term applied to
Since no training labels are available we investigate autoen- the hidden representation. In particular, the KL divergence
coders, i.e., networks which are tasked with reconstructing expresses the amount of information in nats which the latent
their inputs. Autoencoders use an encoding network to extract representation carries about the data sample. Thus, it acts as an
a latent representation, which is then passed through a decod- information bottleneck [39] on the latent representation, where
ing network to recover the original data. Ideally, the latent β controls the trade-off between reconstruction quality and the
representation preserves the salient features of the original representation simplicity.
data, while being easier to analyze and work with, e.g. by An alternative formulation of the VAE objective explicitly
disentangling different factors of variation in the data, and constrains the amount of information contained in the latent
discarding spurious patterns (noise). These desirable qualities representation [40]:
are typically obtained through a judicious application of JVAE (θ, φ; x) = Eq(z|x;φ) [log p(x|z; θ)] −
regularization techniques and constraints or bottlenecks (we
max (B, DKL (q(z|x; φ) || p(z))) , (2)
use the two terms interchangeably). The representation learned
by an autoencoder is thus subject to two competing forces. On where the constant B corresponds to the amount of free
the one hand, it should provide the decoder with information information in q, because the model is only penalized if it
necessary for perfect reconstruction and thus capture in the transmits more than B nats over the prior in the distribution
latents as much if the input data characteristics as possible. over the latents. Please note that for convenience we will often
On the other hand, the constraints force some information to refer to information content using units of bits instead of nats.
be discarded, preventing the latent representation from being A recently proposed modification of the VAE, called the
trivial to invert, e.g. by exactly passing through the input. Thus Vector Quantized VAE [18], replaces the stochastic continuous
the bottleneck is necessary to force the network to learn a latent variable with a deterministic discrete latent variable.
non-trivial data transformation. Inspired by vector quantization, VQ-VAE maintains a number
Reducing the dimensionality of the latent representation can of prototype vectors {ei , i = 1, . . . , K}. During the forward
serve as a basic constraint applied to the latent vectors, with pass, representations produced by the encoder are replaced
the autoencoder acting as a nonlinear variant of linear low- with their closest prototypes. Formally, let ze (x) be the output
rank data projections, such as PCA or SVD [34]. However, of the encoder prior to quantization. VQ-VAE finds the nearest
such representations may be difficult to interpret because the prototype q(x) = argmini kze (x)−ei k22 and uses it as the latent
reconstruction of an input depends on all latent features [35]. In representation zq (x) = eq(x) which is passed to the decoder.
contrast, dictionary learning techniques, such as sparse [36] and During the backward pass, the gradient of the loss with
non-negative [35] decompositions, express each input pattern respect to the pre-quantized embedding is approximated using
using a combination of a small number of selected features out the straight-through estimator [41], i.e., ∂z∂L
e (x)
≈ ∂z∂L
q (x)
1
. The
of a larger pool, which facilitates their interpretability. Discrete prototypes are trained by extending the learning objective
feature learning using vector quantization can be seen as an with terms which optimize quantization. Prototypes are forced
extreme form of sparseness in which the reconstruction uses to lie close to vectors which they replace with an auxiliary
only one element from the dictionary. cost, dubbed the commitment loss, introduced to encourage
The Variational Autoencoder (VAE) [37] proposes a different the encoder to produce vectors which lie close to prototypes.
interpretation of feature learning which follows a probabilistic Without the commitment loss VQ-VAE training can diverge by
framework. The autoencoding network is derived from a latent- emitting representations with unbounded magnitude. Therefore,
variable generative model. First, a latent vector z is sampled VQ-VAE is trained using a sum of three loss terms: the negative
from a prior distribution p(z) (typically a multidimensional log-likelihood of the reconstruction, which uses the straight-
normal distribution). Then the data sample x is generated through estimator to bring the gradient from the decoder to
using a deep decoder neural network with parameters θ that the encoder, and two VQ-related terms: the distance from each
computes p(x|z; θ). However, computing the exact posterior prototype to its assigned vectors and the commitment cost [18]:
distribution p(z|x) that is needed during maximum likelihood
training is difficult. Instead, the VAE introduces a variational L = log p x | zq (x)
+ ksg ze (x) − eq(x) k22 + γkze (x) − sg(eq(x) )k22 , (3)

approximation to the posterior, q(z|x; φ), which is modeled
using an encoder neural network with parameters φ. Thus the
where sg(·) denotes the stop-gradient operation which zeros
VAE resembles a traditional autoencoder, in which the encoder
the gradient with respect to its argument during backward pass.
produces distributions over latent representations, rather than
deterministic encodings, while the decoder is trained on samples 1 In TensorFlow this can be conveniently implemented using z (x) =
q
from this distribution. Encoding and decoding networks are ze (x) + stop gradient(eq(x) − ze (x))
3
The quantization within the VQ-VAE acts as an information VQ-VAE

Encoder penc pproj
bottleneck. The encoder can be interpreted as a probabilistic
+ Linear(64) VQ
model which puts all probability mass on the selected discrete 64D 50Hz
token (prototype id). Assuming a uniform prior distribution ReLU(768) or
over K tokens, the KL divergence is constant and equal to VAE pproj
log K. Therefore, the KL term does not need to be included in + µ
Linear(128) sample
the VQ-VAE training criterion in Eq. (3) and instead becomes σ
ReLU(768)
a hyperparameter tied to the size of the prototype inventory. or
The VQ-VAE was qualitatively shown to learn a representa- + AE
tion which separated the phonetic content within an utterance ReLU(768) Linear(64)
from the identity of the speaker [18]. Moreover the discovered
tokens could be mapped to phonemes in a limited setting. + pbn
ReLU(768) Decoder jitter(0.12)

C. Autoencoders for sequential data
+ Conv3 (128)
Sequential data, such as speech or text, often contain local
pcond 128D 50Hz
dependencies that can be exploited by generative models. In Conv3 (768)
fact, purely autoregressive models of sequential data, which upsample
+
predict the next observation based on recent history, are very 128D 16kHz
successful. For text, these correspond to n-gram models [42] Conv3 (768) WaveNet cycle concat
(10 layers)
and convolutional neural language models [43], [44]. Similarly, 768D 50Hz 128 +Ns
WaveNet [45] is a state-of-the-art autoregressive model of 16kHz
StridedConv4 (768)
time-domain waveform samples for text-to-speech synthesis. (stride = 2)
256D 16kHz
A downside of such autoregressive models is that they
do not explicitly produce latent representations of the data. + WaveNet cycle
However, it is possible to combine an autoregressive sequence (10 layers)
Conv3 (768)
generation model with an encoder tasked with extraction of 768D 100Hz
latent representations. Depending on the use case, the encoder
Conv3 (768)
can process the whole utterance, emit a single latent vector and + ReLU(256)
39D 100Hz
feed it to an autoregressive decoder [32], [46] or the encoder
can periodically emit vectors of latent features to be consumed MFCC + d + a ReLU(256)
by the decoder [18], [47]. We concentrate on the latter solution. feature extraction
Training mixed latent variable and autoregressive models 1D 16kHz sample softmax
is prone to latent space collapse, in which the decoder learns Ns
to ignore the constrained latent representations and only uses speaker
the unconstrained signal coming through the autoregressive waveform
one-hot
path. For the VAE, this collapse can be prevented by annealing
the weight of the KL term and using the free-information Fig. 1. The proposed model is conceptually divided into 3 parts: an encoder
formulation in Eq. (2). The VQ-VAE is naturally resilient to (green), made of a residual convnet that computes a stream of latent vectors
the latent collapse because the KL term is a hyperparameter (typically every 10ms or 20ms) from a time-domain waveform sampled at
16 kHz, which are passed through a bottleneck (red) before being used to
which is not optimized using gradient training of a given model. condition a WaveNet decoder (blue) which reconstructs the waveform using
We defer further discussion of this topic to Section V. two additional information streams: an autoregressive stream which predicts the
next sample based on past samples, and global conditioning which represents
the identity of the input speaker (one out of Ns total training speakers). We
III. M ODEL D ESCRIPTION experiment with three bottleneck variants: a simple dimensionality reduction
(AE), a sampling layer with an additional Kullback-Leibler penalty term (VAE),
The architecture of our model is presented in Figure 1. The or a discretization layer (VQ-VAE). Intuitively, this bottleneck encourages
encoder reads a sequence of either raw audio samples, or of the encoder to discard portions of the latent representation which the decoder
can infer from the two other information streams. For all layers, numbers in
audio features2 and extracts a sequence of hidden vectors, parentheses indicate the number of output channels, and subscripts denote
which are passed through a bottleneck to become a sequence the filter length. Locations of “probe” points which are used in Section IV to
of latent representations. The frequency at which the latent evaluate the quality of the learned representation are denoted with black dots.
vectors are extracted is governed by the number of strided
convolutions applied by the encoder.
The decoder reconstructs the utterance by conditioning a
from having to capture speaker-dependent information in the
WaveNet [45] network on the latent representation extracted by
latent representation. Specifically, the decoder (i) takes the en-
the encoder and, separately, on a speaker embedding. Explicitly
coder’s output, (ii) optionally applies a stochastic regularization
conditioning the decoder on speaker identity frees the encoder
to the latent vectors (see Section III-A), (iii) then combines
2 To keep the autoencoder viewpoint, the feature extractor can be interpreted latent vectors extracted at neighboring time steps using con-
as a fixed signal processing layer in the encoder. volutions and (iv) upsamples them to the output frequency.
4
Waveform samples are reconstructed with a WaveNet that IV. E XPERIMENTS

combines all conditioning sources: autoregressive information We evaluated models on two datasets: LibriSpeech [17]
about past samples, global information about the speaker, and (clean subset) and ZeroSpeech 2017 Contest Track 1 data [19].
latent information about past and future samples extracted Both datasets have similar characteristics: multiple speakers,
by the encoder. We find that the encoder’s bottleneck and clean, read speech (sourced from audio books) recorded at a
the proposed regularization is crucial in extracting nontrivial sampling rate of 16 kHz. Moreover the ZeroSpeech challenge
representations of data. With no bottleneck, the model is prone controls the amount of per-speaker data with the majority of
to learn a simple reconstruction strategy which makes verbatim the data being uttered by only a few speakers.
copies of future samples. We also note that the encoder is Initial experiments, presented in section IV-B, compare differ-
speaker independent and requires only speech data, while the ent bottleneck variants and establish what type of information
decoder also requires speaker information. from the input audio is preserved in the latent representations
We consider three forms of bottleneck: (i) simple dimension- produced by the model at the four different probe points
ality reduction, (ii) a Gaussian VAE with different latent repre- pictured in Figure 1. Using the representation computed at
sentation dimensionalities and different capacities expressed in each probe point, we measure performance on several predic-
bits using Eq. (2), and (iii) a VQ-VAE with different number tion tasks: phoneme prediction (per-frame accuracy), speaker
of discrete tokens. All bottlenecks are optionally followed identity and gender prediction accuracy, and L2 reconstruction
by the dropout inspired time-jitter regularization described error of spectrogram frames. We establish that the VQ-VAE
below. Furthermore, we experiment with different input and learns latent representations with strongest disentanglement
output representations, using raw waveforms, log-mel filterbank, between the phonetic content and speaker identity, and focus
and mel-frequency cepstral coefficient (MFCC) features which on this architecture in the following experiments.
discard pitch information present in the spectrogram. In section IV-C we analyze the interpretability of VQ-VAE
tokens by mapping each discrete token to the most frequent
corresponding phoneme in a forced alignment of a small labeled
A. Time-jitter regularization data set (LibriSpeech dev) and report the accuracy of the
We would like the model to learn a representation of speech mapping on a separate set (LibriSpeech test). Intuitively, this
which corresponds to the slowly-changing phonetic content captures the interpretability of individual tokens.
within an utterance: a mostly constant signal that can abruptly We then apply the VQ-VAE to the ZeroSpeech 2017 acoustic
change at phoneme boundaries. unit discovery task [19] in section IV-D. This task evaluates
how discriminative the representation is with respect to the
Inspired by the slow features analysis [48] we first exper-
phonetic class. Finally, in section IV-E we measure the impact
imented with penalizing time differences between encoder
of different hyperparameters on performance.
representation either before or after the bottleneck. However,
this regularization resulted in a collapse of the latent space
– the model learned to output a constant encoding. This is a A. Default model hyperparameters
common problem of sequential VAEs that use loss terms to Our best models used MFCCs as the encoder input, but
regularize the latent encoding [49]. reconstructed raw waveforms at the decoder output. We used
Reconsidering the problem we realized that we want each standard 13 MFCC features extracted every 10ms (i.e., at a
frame’s representation to correspond to a meaningful phonetic rate of 100 Hz) and augmented with their temporal first and
unit. Thus we want to prevent the system from using consecu- second derivatives. Such features were originally designed for
tive latent vectors as individual units. Put differently, we want speech recognition and are mostly invariant to pitch and similar
to prevent latent vector co-adaptation. We therefore introduce confounding detail in the audio signal. The encoder had 9 layers
a dropout-inspired [50] time-jitter regularizer, also reminiscent each using 768 units with ReLU activation, organized into the
of Zoneout [51] regularization for recurrent networks. During following groups: 2 preprocessing convolution layers with filter
training, each latent vector can replace either one or both of length 3 and residual connections, 1 strided convolution length
its neighbors. As in dropout, this prevents the model from reduction layer with filter length 4 and stride 2 (downsampling
relying on consistency across groups of tokens. Additionally, the signal by a factor of two), followed by 2 convolutional
this regularization also promotes latent representation stability layers with length 3 and residual connections, and finally
over time: a latent vector extracted at time step t must strive 4 feedforward ReLU layers with residual connections. The
to also be useful at time steps t − 1 or t + 1. In fact, the resulting latent vectors were extracted at 50 Hz (i.e., every
regularization was crucial for reaching good performance on second frame), with each latent vector depending on a receptive
ZeroSpeech at higher token extraction frequencies. field of 16 input frames. We also used an alternative encoder
The regularization layer is inserted right after the encoder’s with two length reduction layers, which extracted latent
bottleneck (i.e., after dimensionality reduction for regular representation at 25 Hz with a receptive field of 30 frames.
autoencoder, after sampling a realization of the latent layer for When unspecified, the latent representation was 64 dimen-
the VAE and after discretization for the VQ-VAE). It is only sional and when applicable constrained to 14 bits. Furthermore,
enabled during training. For each time step we independently for the VQ-VAE we used the recommended γ = 0.25 [18].
sample whether it is to be replaced with the token right after The decoder applied the randomized time-jitter regularization
or before it. We do not copy a token more than one timestep. (see Section III-A). During training each latent vector was
5
replaced with either of its neighbors with probability 0.12. A comparison of models using each of the three bottlenecks
The jittered latent sequence was passed through a single with different hyperparameters (latent dimensionality and de-
convolutional layer with filter length 3 and 128 hidden gree of regularization) is presented in Figure 3. Each bottleneck
units to mix information across neighboring timesteps. The type consistently discards information between the penc and pbn
representation was then upsampled 320 times (to match the probe locations, as evidenced by the reduced performance on
16kHz audio sampling rate) and concatenated with a one-hot each task. The bottleneck also impacts information content in
vector representing the current speaker to form the conditioning preceding layers. Especially for the vanilla autoencoder (AE),
input of an autoregressive WaveNet [45]. The WaveNet was which simply reduces dimensionality, the speaker prediction
composed of 20 causal dilated convolution layers, each using accuracy and filterbank reconstruction loss at penc depend on
368 gated units with residual connections, organized into two the width of the bottleneck, with narrower widths causing
“cycles” of 10 layers with dilation rates 1, 2, 4, . . . , 29 . The more information to be discarded in lower layers of the
conditioning signal was passed separately into each layer. The encoder. Likewise, VQ-VAEs and AEs yielded better filterbank
signal from each layer of the WaveNet was passed to the output reconstructions and speaker identity prediction at penc compared
using skip-connections. Finally, the signal was passed through 2 to VAEs with matching dimensionality and effective bitrate.
ReLU layers with 256 units. A Softmax was applied to compute As expected, AE discards the least information. At pcond the
the next sample probability. We used 256 quantization levels representation remains highly predictive about both speaker and
after mu-law companding [45]. phonemes, and its filterbank reconstructions are the best among
All models were trained on minibatches of 64 sequences of all configurations. However, from an unsupervised learning
length 5120 time-domain samples (320 ms) sampled uniformly standpoint, the AE latent representation is less useful because
from the training dataset. We used the Adam optimizer [52] it mixes all properties of the source signal.
with initial learning rate 4 × 10−4 which was halved after 400k, In contrast, VQ-VAE models produce a representation which
600k, and 800k steps. Polyak averaging [53] was applied to is highly predictive of the phonetic content of the signal while
all checkpoints used for model evaluation. it effectively discards speaker identity and gender information.
At higher bitrates, phoneme prediction is about as accurate as
B. Bottleneck comparison for the AE. Filterbank reconstructions are also less accurate.
We train models on LibriSpeech and analyze the informa- We observe that the speaker information is discarded primarily
tion captured in the hidden representations surrounding the during the quantization step between pproj and pbn . Combining
autoencoder bottleneck at each of the four probe points shown several latent vectors in the pcond representation results in more
in Figure 1: accurate phoneme predictions, but the additional context does
not help to recover speaker information. This phenomenon
penc (768 dim) encoder output prior to the bottleneck,
is clearly visible in Figure 4. VQ-VAE models showed little
pproj (64 dim) within the bottleneck after projecting to lower
dependence on the bottleneck dimension, so we present results
dimension,
at the default setting of 64.
pbn (64 dim) bottleneck output, corresponding to the quantized
Finally, VAE models separate speaker and phonetic infor-
representation in VQ-VAE, or a random sample from the
mation better than simple dimensionality reduction, but not as
variational posterior in VAE, and
well as VQ-VAE. The VAE discards phonetic and speaker infor-
pcond (128 dim) after passing pbn through a convolution layer
mation more uniformly than VQ-VAE: at pbn , VAE’s phoneme
which captures a larger receptive field over the latent encoding.
predictions are less accurate, while its gender predictions are
At each probe point, we train separate MLP networks with 2048
more accurate. Moreover, combining information across a wider
hidden units on each of four tasks: classifying speaker gender
receptive field at pcond does not improve phoneme recognition
and identity for the whole segment (after average pooling latent
as much as in VQ-VAE models. The sensitivity to the width
vectors across the full signal), predicting phoneme class at
is also surprising, with narrower VAE bottlenecks discarding
each frame (making several predictions per latent vector3 ), and
less information than the wider ones. This may be due to
reconstructing log-mel filterbank features in each frame (again
the stochastic operation of the VAE: to provide the same KL
predicting several consecutive frames from each latent vector).
divergence as at low bottleneck dimensions, more noise needs to
A representation which captures the high level semantic content
be added at high dimensions. This noise may mask information
from the signal, while being invariant to nuisance low-level
present in the representation.
signal details, will have a high phoneme prediction accuracy,
Based on these results we conclude that the VQ-VAE
and high spectrogram reconstruction error. A disentangled
bottleneck is most appropriate for learning latent representations
representation should additionally have low speaker prediction
which capture phonetic content while being invariant to the
accuracy, since this information is explicitly made available
underlying speaker identity.
to the decoder conditioning network, and therefore need not
be preserved in the latent encoding. Since we are primarily
interested in discovering what information is present in the C. VQ-VAE token interpretability
constructed representations we report the training performance Unlike the continuous vector representations learned by other
and do not tune probing networks for generalization. bottlenecks, an advantage of the discrete representation learned
3 Ground truth phoneme labels and filterbank features have a frame rate of by the VQ-VAE lies in the ability to directly interpret the
100 Hz, while the latent representation is computed at a lower rate. individual tokens. In this section we evaluate how well VQ-VAE
6
80
Mel channel
60
40
20
0
0.25 M IH S T ER K W IH L T ER IH Z DH IY AH P AA S
0.00
0.25
245
245
88
177
141
121
214
110
11
130
195
195
183
59
94
161
235
11
155
39
207
207
87
199
254
108
186
241
151
126
102
102
140
140
221
83
81
101
0.50
39
MFCC + d + a
26
13
0
0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
Time (sec)
Fig. 2. Example token sequence extracted by the proposed VQ-VAE model. Bottom: input MFCC features. Middle: Target waveform samples overlaid with
extracted token ids (bottom), and ground truth phoneme identities (top) with boundaries between plotted in red. Note the transient “T” phoneme at 0.75 and 1.1
seconds is consistently associated with token 11. The corresponding log-mel spectrogram is shown on the top.
TABLE I accuracy, while a model with no time-reduction layers set the

L IBRI S PEECH FRAME - WISE PHONEME RECOGNITION ACCURACY. VQ-VAE upper bound at 88%.
MODELS CONSUME MFCC FEATURES AND EXTRACTED TOKENS AT 25 H Z .
The mapping accuracy improved with the number of tokens,
Num tokens / bits with the best model reaching 64.5% accuracy with 32768
256 512 1024 2048 4096 8192 16384 32768 tokens. However, we observed the largest gains using 4092
Train steps 8 9 10 11 12 13 14 15
tokens, with diminishing returns as the number of tokens further
200k 56.7 58.3 59.7 60.3 60.7 61.2 61.4 61.7 increased. This result is in rough correspondence with the 5760
900k 58.6 61.0 61.9 63.3 63.8 63.9 64.3 64.5
tied triphone states used in the Kaldi tri6b model.
We also note that increasing the number of tokens does
not trivially lead to improved accuracies, because we measure
tokens can be mapped to phonemes, the underlying discrete generalization, and not cluster purity. In the limit of assigning
components of speech sounds. In contrast to experiments in a different token to each frame, the accuracy will be close to
the previous section, we focus on generalization and only random because of overfitting to the small development set on
utilize the discrete token identities, not the underlying vector which we establish the mapping. However, in our experiments
representations. Specifically, we measured the frame-wise we consistently observed improved accuracy.
phoneme recognition accuracy in which each token was mapped
to one out of 41 phonemes. We used the 460 hour clean
LibriSpeech training set for unsupervised training, and used
labels from the clean dev subset to associate each token with D. Unsupervised ZeroSpeech 2017 acoustic unit discovery
the most probable phoneme. We evaluated the mapping by
computing frame-wise phone recognition accuracy on the clean The ZeroSpeech 2017 phonetic unit discovery task [19] eval-
test set at a frame rate of 100 Hz. The ground-truth phoneme uates a representation’s ability to discriminate between different
boundaries were obtained from forced alignments using the sounds, rather than the ease of mapping the representation to
Kaldi tri6b model from the s5 LibriSpeech recipe [54]. predefined phonetic units. It is therefore complementary to the
The performance of the best LibriSpeech is shown in phoneme classification accuracy metric used in the previous
Table I. The phoneme prediction accuracy is given at two time section. The ZeroSpeech evaluation scheme uses the minimal
points: after 200k training steps, when the relative performance pair ABX test [55], [56] which assesses the model’s ability to
of models can be assessed and after 900k steps when the discriminate between pairs of three phoneme long segments
models have converged. We did not observe overfitting with of speech that differ only in the middle phone (e.g. “get” and
longer training times. Predicting the most frequent silence “got”). We trained the models on the provided training data
phoneme for all frames set an accuracy lower bound at 16%. (45 hours for English, 24 hours for French and 2.5 hours
A model discriminatively trained on the full 460 hour training for Mandarin) end evaluated them on the test data using the
set to predict phonemes with the same architecture as the official evaluation scripts. To ensure that we do not overfit to the
25 Hz encoder achieved 80% framewise phoneme recognition ZeroSpeech task we only considered the best hyperparameter
7
penc pproj pbn pcond

0.8
Bottleneck
Recon. Error
0.6
Filterbank
0.4 AE
0.2 VAE (D= 4)
VAE (D= 8)
0.7
Phoneme
0.6 VAE (D=16)
Accuracy
0.5 VAE (D=32)

0.4
VQ-VAE
0.9
Latent dimensions
0.8
Gender
Accuracy
4
0.7
8
0.6
0.6 16
0.4 32
Speaker
Accuracy
0.2 64
0
N/A
12
16
N/A
12
16
N/A
12
16
N/A
12
16
VAE free bits / VQ-VAE bits per token
Fig. 3. Accuracy of predicting signal characteristics at various probe locations in the network. Among the three bottlenecks evaluated, VQ-VAE discards the
most speaker-related information at the bottleneck, while preserving the most phonetic information. For all bottlenecks, the representation coming out of
the encoder yields over 70% accurate framewise phoneme predictions. Both the simple AE and VQ-VAE preserve this information in the bottleneck (the
accuracy drops to 50%-60% depending on the bottleneck’s strength). However, the VQ-VAE discards almost all speaker information (speaker classification
accuracy is close to 0% and gender prediction close to 50%). This causes the VQ-VAE representation to perform best on the acoustic unit discovery task – the
representation captures the phonetic content while being invariant to speaker identity.
Probe point come with sufficiently large training datasets, we achieve results
better than the top contestant [57], despite using a speaker
penc
Phoneme prediction accuracy
0.7 independent encoder.

pproj
The results are consistent with our analysis of informa-
pbn
0.6 tion separation performed by the VQ-VAE bottleneck: in
pcond
the more challenging across-speaker evaluationm, the best
0.5 Bottleneck performance uses the pcond representation, which combines
AE several neighboring frames of the bottleneck representation
0.4 VAE (D=32) (VQ-VAE, (per-language, pcond ) in Table II). Comparing within-
VQ-VAE and across-speaker results is similarly consistent with the
0.6 0.7 0.8 0.9 observations in Section IV-B. In the within-speaker case, it is
Gender prediction accuracy not necessary to disentangle speaker identity from phonetic
content so the quantization between pproj and pbn probe points
Fig. 4. Comparison of gender and phoneme prediction accuracy for different hurts performance (although on English this is corrected by
bottleneck types and probe points. The decoder is conditioned on the speaker, considering the broader context at pcond ). In the across-speaker
thus the gender information can be recovered and the bottleneck should discard
it. While information is present at the penc probe. The AE and VAE models case, quantization improves the scores on English and French
tend to similarly discard both gender and phoneme information at other probe because the gain from discarding the confounding speaker
points. On the other hand, VQ-VAE selectively discards gender information. information offsets the loss of some phonetic details. Moreover,
the discarded phonetic information can be recovered by mixing
neighboring timesteps at pcond .
settings found on LibriSpeech4 (c.f. Section IV-E). Moreover,
to maximally abide by the ZeroSpeech convention, we used Results of VQ-VAE on Mandarin are worse. This is in part
the same hyperparameters for all languages. due to the small size of the training data set, which had only
Results are shown in Table II. On English and French, which 2.4h causing overfitting (see Sec. IV-E7). Some improvements
are brought by multilingual training. However, Mandarin may
4 The comparison with other systems from the challenge is fair, because also require modeling changes since it is a pitched language,
according to the ZeroSpeech experimental protocol, all participants were while our input representation discards pitch (MFCCs) and
encouraged to tune their systems on the three languages that we use (English,
French, and Mandarin), while the final evaluation used two surprise languages the VQ-VAE was shown to not encode prosody in the latent
for which we do not have the labels required for evaluation. representation [18]. This is also visible in the large degradation
8
TABLE II
Z ERO S PEECH 2017 PHONETIC UNIT DISCOVERY ABX SCORES REPORTED ACROSS - AND WITHIN - SPEAKERS ( LOWER IS BETTER ). T HE VQ-VAE ENCODER
IS SPEAKER INDEPENDENT AND THUS ITS RESULTS DO NOT CHANGE WITH THE AMOUNT OF TEST SPEAKER DATA (1 S , 10 S , OR 2 M ), WHILE
SPEAKER - ADAPTIVE MODELS ( E . G . SUPERVISED TOPLINE ) SHOW IMPROVEMENTS WHEN THERE IS MORE TARGET SPEAKER DATA . W E REPORT THE TWO
REFERENCE POINTS FROM THE CHALLENGE , ALONG WITH THE CHALLENGE WINNER [57] AND THREE OTHER SUBMISSIONS THAT USED NEURAL NETWORK
IN AN UNSUPERVISED SETTING [58], [59], [60]. A LL VQ-VAE MODELS USE EXACTLY THE SAME HYPERPARAMETER SETUP (14 BIT TOKENS EXTRACTED
AT 50 H Z WITH TIME - JITTER PROBABILITY 0.5), REGARDLESS OF THE AMOUNT OF UNLABELED TRAINING DATA (45 H , 24 H OR 2.4 H ).
T HE TOP VQ-VAE RESULTS ROW (VQ-VAE TRAINED ON TARGET LANGUAGE , FEATURES EXTRACTED AT THE pCOND POINT ) GIVES BEST RESULTS
OVERALL . W E ALSO INCLUDE in italics RESULTS FOR DIFFERENT PROBE POINTS AND FOR VQ-VAE S JOINTLY TRAINED ON ALL LANGUAGES .
M ULTILINGUAL TRAINING HELPS M ANDARIN . W E ALSO OBSERVE THAT THE QUANTIZATION MOSTLY DISCARDS SPEAKER AND CONTEXT INFLUENCE . T HE
CONTEXT IS HOWEVER RECOVERED IN THE CONDITIONING SIGNAL WHICH COMBINES INFORMATION FROM LATENT VECTORS AT NEIGHBORING TIMESTEPS .
Within-speaker Across-speaker
English (45h) French (24h) Mandarin (2.4h) English (45h) French (24h) Mandarin (2.4h)
Model 1s 10s 2m 1s 10s 2m 1s 10s 2m 1s 10s 2m 1s 10s 2m 1s 10s 2m
Unsupervised baseline 12.0 12.1 12.1 12.5 12.6 12.6 11.5 11.5 11.5 23.4 23.4 23.4 25.2 25.5 25.2 21.3 21.3 21.3
Supervised topline 6.5 5.3 5.1 8.0 6.8 6.8 9.5 4.2 4.0 8.6 6.9 6.7 10.6 9.1 8.9 12.0 5.7 5.1
VQ-VAE (per lang, pcond ) 5.6 5.5 5.5 7.3 7.5 7.5 11.2 10.7 10.8 8.1 8.0 8.0 11.0 10.8 11.1 12.2 11.7 11.9
VQ-VAE (per lang, pbn ) 6.2 6.0 6.0 7.5 7.3 7.6 10.8 10.5 10.6 8.9 8.8 8.9 11.3 11.0 11.2 11.9 11.4 11.6
VQ-VAE (per lang, pproj ) 5.9 5.8 5.9 6.7 6.9 6.9 9.9 9.7 9.7 9.1 9.0 9.0 11.9 11.6 11.7 11.0 10.6 10.7
VQ-VAE (all lang, pcond ) 5.8 5.8 5.8 8.0 7.9 7.8 9.2 9.1 9.2 8.8 8.6 8.7 11.8 11.6 11.6 10.3 10.0 9.9
VQ-VAE (all lang, pbn ) 6.3 6.2 6.3 8.0 8.0 7.9 9.0 8.9 9.1 9.4 9.2 9.3 11.8 11.7 11.8 9.9 9.7 9.7
VQ-VAE (all lang, pproj ) 5.8 5.7 5.8 7.1 7.0 6.9 7.4 7.2 7.1 9.3 9.3 9.3 11.9 11.4 11.6 8.6 8.5 8.5
Heck et al. [57] 6.9 6.2 6.0 9.7 8.7 8.4 8.8 7.9 7.8 10.1 8.7 8.5 13.6 11.7 11.3 8.8 7.4 7.3
Chen et al. [58] 8.5 7.3 7.2 11.2 9.4 9.4 10.5 8.7 8.5 12.7 11.0 10.8 17.0 14.5 14.1 11.9 10.3 10.1
Ansari et al. [59] 7.7 6.8 N/A 10.4 N/A 8.8 10.4 9.3 9.1 13.2 12.0 N/A 17.2 N/A 15.4 13.0 12.2 12.3
Yuan et al. [60] 9.0 7.1 7.0 11.9 9.5 9.5 11.1 8.5 8.2 14.0 11.9 11.7 18.6 15.5 14.9 12.7 10.8 10.7
brought by quantization. In fact, the multilingual prequantized penc pproj pbn pcond
features are comparable to [57]. 0.6
Recon. Error
Filterbank
We do not consider the need for more unsupervised training 0.4
data to be a problem. Unlabeled data is abundant. We believe 0.2
that a more powerful model that requires and is able to make a 0
better use of large amounts of unlabeled training data is to be 0.6
Phoneme
Accuracy
preferred over a simpler model whose performance saturates on 0.4 Time-jitter

small datasets. However, it remains to be verified if increasing 0.2 probability
0
the amount of training data would help the Mandarin VQ-VAE 0
learn to discard less tonal information (the multilingual model 0.75 0.12
Gender
Accuracy
might have learned to do this in order to accommodate French 0.50

0.25
and English). 0
0.6
0.4
Speaker
Accuracy
E. Hyperparameter impact 0.2

All VQ-VAE autoencoder hyperparameters were tuned on 0
VAE
AE
VAE
AE
VAE
AE
VAE
AE
VQ-VAE
VQ-VAE
VQ-VAE
VQ-VAE
the LibriSpeech task, optimizing for the highest phoneme

recognition accuracy. We also validated these design choices Bottleneck
on the English part of the ZeroSpeech challenge task. Indeed,
we found that the proposed time-jitter regularization improved Fig. 5. Impact of the time-jitter regularization on information captured by
ZeroSpeech ABX scores for all input representations. Using representations at different probe points.
MFCC or filterbank features yields better scores that using
waveforms, and the model consistently obtains better scores
when more tokens are used. The proposed time-jitter regularization greatly improves
1) Time-jitter regularization: In Table III we analyze the token mapping accuracy and extends the range of token
effectiveness of the time-jitter regularization on VQ-VAE frame rates which perform well to include 50 Hz. While the
encodings and compare it to two variants of dropout: regular LibriSpeech token accuracies are comparable at 25 Hz and
dropout applied to individual dimensions of the encoding and 50 Hz, higher token emission frequencies are important for
dropout applied randomly to the full encoding at individual time the ZeroSpeech AUD task, on which the 50 Hz model was
steps. The two methods are conceptually similar. The token noticeably better. This behavior is due to the fact that the 25 Hz
copy probability of 0.12 keeps a given token with probability model is prone to omitting short phones (Sec. IV-E6), which
0.882 = 0.77 which roughly corresponds to a 0.23 per-timestep impacts the ABX results on the ZeroSpeech task.
dropout rate. We also analyzed information content at the four probe points
9
TABLE III
E FFECTS OF INPUT REPRESENTATION AND REGULARIZATION ON PHONEME 0.75
RECOGNITION ACCURACY ON L IBRI S PEECH , MEASURED AFTER 200 K
Prediction accuracy
TRAINING STEPS . A LL MODELS EXTRACT 256 TOKENS .
0.70 Pred. target
Input features Token rate Regularization Accuracy gender
MFCC 25 Hz None 52.5 0.65 phonemes
MFCC 25 Hz Regular dropout p = 0.1 50.7
MFCC 25 Hz Regular dropout p = 0.2 49.1 0.60
MFCC 25 Hz Per-time step dropout p = 0.2 55.3
MFCC 25 Hz Per-time step dropout p = 0.3 55.7
MFCC 25 Hz Per-time step dropout p = 0.4 55.1 1 10 100
MFCC 25 Hz Time-jitter p = 0.08 56.2 WaveNet Receptive Field [ms]
MFCC 25 Hz Time-jitter p = 0.12 56.2
MFCC 25 Hz Time-jitter p = 0.16 56.1 Fig. 6. Impact of decoder WaveNet receptive field on the properties of the
MFCC 50 Hz None 46.5 VQ-VAE conditioning signal. The representation is significantly more gender
MFCC 50 Hz Time-jitter p = 0.5 56.1 invariant when the receptive field is larger that 10ms. Frame-wise phoneme
recognition accuracy peaks at about 125ms. The depth and width of the
log-mel spectrogram 25 Hz None 50.1 WaveNet have a secondary effect (cf. points with the same RF).
log-mel spectrogram 25 Hz Time-jitter p = 0.12 53.6
raw waveform 30 Hz None 37.6
raw waveform 30 Hz Time-jitter p = 0.12 48.1 lower than the other architectures described in this paper. The
model was however an order of magnitude faster to train.
Finally, we analyzed the impact of the size of the decoding
for VQ-VAE, VAE, and simple dimensionality reduction AE WaveNet on the representation extracted by the VQ-VAE.
bottleneck, shown in Figure 5. For all bottleneck mechanisms, We have found that overall receptive field (RF) has a larger
the regularization limits the quality of filterbank reconstruc- impact than the depth or width of the WaveNet. In particular,
tions and increases the phoneme recognition accuracy in the a large change in the properties of the latent representation
constrained representation. However this benefit is smaller after happens when the decoder’s receptive field crosses than about
neighboring timesteps are combined in the pcond probe point. 10ms. For smaller RFs, the conditioning signal contains more
Moreover, for VQ-VAE and VAE the regularization decreases speaker information: gender prediction is close to 80%, while
gender prediction accuracy and makes the representation framewise phoneme prediction accuracy is only 55%. For larger
slightly less speaker-sensitive. RFs, gender prediction accuracy is about 60%, while phoneme
2) Input representation: In this set of experiments we prediction peaks near 65%. Finally, while the reconstruction log-
compared performance using different input representation: likelihood improved with WaveNet depth up to 30 layers, the
raw waveforms, log-mel spectrograms, or MFCCs. The raw phoneme recognition accuracy plateaued with 20 layers. Since
waveform encoder used 9 strided convolutional layers, which the WaveNet has the largest computational cost we decided to
resulted in token extraction frequency of 30 Hz. We then keep the 20 layer configuration.
replaced the waveform with a customary ASR data pipeline: 4) Decoder speaker conditioning: The WaveNet decoder
80 log-mel filterbank features extracted every 10ms from 25ms- generates samples based on three sources of information: the
long windows and 13 MFCC features extracted from the mel- previously emitted samples (via the autoregressive connection),
filterbank output, both augmented with their first and second global conditioning on speaker or other information which
temporal derivatives. Using two strided convolution layers in is stationary in time, and on the time-varying representation
the encoder led to a 25 Hz token rate for these models. extracted from the encoder. We found that disabling global
The results are reported in the bottom of Table III. High-level speaker conditioning reduces phoneme classification accuracy
features, especially MFCCs, perform better than waveforms, by 3 percentage points. This further corroborates our findings
because by design they discard information about pitch and about disentanglement induced by the VQ-VAE bottleneck,
provide a degree of speaker invariance. Using such a reduced which biases the model to discard information that is available
representation forces the encoder to transmit less information to in a more explicit form. Throughout our experiments we used
the decoder, acting as an inductive bias toward a more speaker a speaker-independent encoder. However, adapting the encoder
invariant latent encoding. to the speaker might further improve the results. In fact, [57]
3) Output representation: We constructed an autoregressive demonstrates improvements on the ZeroSpeech task using a
decoder network that reconstructed filterbank features rather speaker-adaptive approach.
than raw waveform samples. Inspired by recent progress in 5) Encoder hyperparameters: We experimented with tuning
text-to-speech systems, we implemented a Tacotron 2-like the number of encoder convolutional layers, as well as the
decoder [61] with a built-in information bottleneck on the number of filters, and the filter length. In general, performance
autoregressive information flow, which was found to be critical improved with larger encoders, however we established that
in TTS applications. Similarly to Tacotron 2 the filterbank the encoder’s receptive field must be carefully controlled, with
features were first processed by a small “pre-net”, we applied the best performing encoders seeing about 0.3 seconds of input
generous amounts of dropout and configured the decoder to signal for each generated token.
predict up to 4 frames in parallel. However, these modifications The effective receptive field can be controlled using two
yielded at best 42% phoneme recognition accuracy, significantly mechanisms: by carefully tuning the encoder architecture, or by
10
designing an encoder with a wide receptive field, but limiting was formed from the last hidden state of the encoder. The model
the duration of signal segments seen during training to the proved useful for natural language processing tasks. However, it
desired receptive field. In this way the model never learns to also demonstrated the problem of latent representation collapse:
use its full capacity. When the model was trained on 2.5s long when a powerful autoregressive decoder is used simultaneously
segments, an encoder with receptive field of 0.3s had framewise with a penalty on the latent encoding, such as the KL prior,
phoneme recognition accuracy of 56.5%, while and encoder the VAE has a tendency to ignore the prior and act as if it
with a receptive field of 0.8s scored only 54.3%. When trained were a purely autoregressive sequence model. This issue can
on segments of 0.3s, both models performed similarly. be mitigated by changing the weight of the KL term, and
6) Bottleneck bit rate: The speech VQ-VAE encoder can be limiting the amount of information on the autoregressive path
seen as encoding a signal using a very low bit rate. To achieve by using word dropout [49]. Latent collapse can also be avoided
a predetermined target bit rate, one can control both the token in deterministic autoencoders, such as [63], which coupled a
rate (i.e., by controlling the degree of downsampling down in convolutional encoder to a powerful autoregressive WaveNet
the encoder strided convolutions), and the number of tokens decoder [45] to learn a latent representation of music audio
(or equivalently the number of bits) extracted at every step. We consisting of isolated notes from a variety of instruments.
found that the token rate is a crucial parameter which must be When applied to audio, the VQ-VAE uses the WaveNet
chosen carefully, with the best results after 200k training steps decoder to free the latent representation from modeling
obtained at 50 Hz (56.0% phoneme recognition accuracy ) and information that is easily recoverable form the recent past
25 Hz (56.3%). Accuracy drops abruptly at higher token rates [18]. It avoids the problem of posterior collapse by using a
(49.3% at 100 Hz), while lower rates miss very short phones discrete latent code with a uniform prior which results in a
(53% accuracy at 12.5 Hz). constant KL penalty. We employ the same strategy to design
In contrast to the number of tokens, the dimensionality the latent representation regularizer: rather than extending the
of the VQ-VAE embedding has a secondary effect on the cost function with a penalty term that can cause the latent space
representation quality. We found 64 to be a good setting, with to collapse, we rely on random copies of the latent variables
much smaller dimensions deteriorating performance for models to prevent their co-adaptation and promote stability over time.
with a small number of tokens and higher dimensionalities The randomized time-jitter regularization introduced in this
negatively affecting performance for models with a large paper is inspired by slow representations of data [48] and
number of tokens. by dropout, which randomly removes during training neurons
For completeness, we observe that even for the model with to prevent their co-adaptation [50]. It is also very similar to
the largest inventory of tokens, the overall encoder bitrate is Zoneout [51] which relies on random time copies of selected
low: 14 bits at 50 Hz = 700 bps, which is on par with the neurons to regularize recurrent neural networks.
lowest bitrate of classical speech codecs [62]. Several authors have recently proposed to model sequences
7) Training corpus size: We experimented with training with VAEs that use a hierarchy of variables. [64] explore a
models on subsets of the LibriSpeech training set, varying hierarchical latent space which separates sequence-dependent
the size from 4.6 hours (1%) to 460 hours (100%). Training variables from those which are sequence-independent ones.
on 4.6 hours of data, phoneme recognition accuracy peaked Their model was shown to perform speaker conversion and to
at 50.5% at 100k steps and then deteriorated. Training on 9 improve automatic speech recognititon (ASR) performance in
hours led to a peak accuracy of 52.5% at 180k sets. When the the presence of domain mismatch. [65] introduce a stochastic
size of training set was increased past 23 hours the phoneme latent variable model for sequential data which also yields
recognition reached 54% after around 900k steps. No further disentangled representations and allows content swapping
improvements were found by training on the full 460 hours of between generated sequences. These other approaches could
data. We did not observe any overfitting, and for best results possibly benefit from regularizing the latent representation to
trained models until reaching 900k steps with no early stopping. achieve further information disentanglement.
An interesting future area for research would be investigating Acoustic unit discovery systems aim at transducing the
methods to increase the model capacity to make better use of acoustic signal into a sequence of interpretable units akin
larger amounts of unlabeled data. to phones. They often involve clustering of acoustic frames,
The influence of the size of the dataset is also visible in MFCC or neural network bottleneck features, regularized using
the ZeroSpeech Challenge results (Table II): VQ-VAE models a probabilistic prior. DP-GMM [66] imposes a Dirichlet Process
obtained good performance on English (45 hours of training prior over a Gaussian Mixture Model. Extending it with an
data) and French (24 hours), but performed poorly on Mandarin HMM temporal structure for sub-phonetic units leads to the
(2.5 hours). Moreover, on English and French we obtained the DP-HMM and the HDP-HMM [67], [68], [69]. HMM-VAE
best results with models trained on monolingual data. On proposes the use of a deep neural network instead of a GMM
Mandarin slightly better results were obtained using a model [70], [71]. These approaches enforce top-down constraints via
trained jointly on data from all languages. HMM temporal smoothing and temporal modeling. Linguistic
unit discovery models detect recurring speech patterns at a
word-like level, finding commonly repeated segments with a
V. R ELATED W ORK
constrained dynamic time warping [72].
VAEs for sequential data were introduced in [49]. The model In the segmental unsupervised speech recognition framework,
used LSTM encoder and decoder, while the latent representation neural autoencoders were used to embed variable length speech
11
segments into a common vector space where they could be particular, in our ZeroSpeech experiments we used the dense
clustered into word types [73]. [74] replace the segmental embedding representation of each token, which provided a
autoencoder with a model that instead predicts a nearby more nuanced token similarity measure than simply using the
speech segment and demonstrate that the representation shares token identity. Perhaps a more structured latent representation
many properties with word embeddings. Coupled with an is needed, in which a small set of units can be modulated in a
unsupervised word segmentation algorithm and unsupervised continuous fashion.
mapping of word embeddings discovered on separate corporas Extensive hyperparameter evaluation indicated that opti-
[75] the approach yielded an ASR system trained on unpaired mizing the receptive field sizes of the encoder and decoder
speech and text data [76]. networks is important for good model performance. A multi-
Several entries to the ZeroSpeech 2017 challenge relied scale modeling approach could furthermore separate the
on neural networks for phonetic unit discovery. [60] trains prosodic information. Our autoencoding approach could also
an autoencoder on pairs of speech segments found using an be combined with penalties that are more specialized to speech
unsupervised term discovery system [77]. [58] first clustered processing. Introducing a HMM prior as in [71] could promote
speech frames, then trained a neural network to predict the a latent representation which better mimics the temporal
cluster ids and used its hidden representation as features. phonetic structure of speech.
[59] extended this scheme with features discovered by an
autoencoder trained on MFCCs. ACKNOWLEDGMENTS
The authors thank Tara Sainath, Úlfar Erlingsson, Aren
VI. C ONCLUSIONS Jansen, Sander Dieleman, Jesse Engel, Łukasz Kaiser, Tom
We applied sequence autoencoders to speech modeling and Walters, Cristina Garbacea, and the Google Brain team for
compared autoencoders using different information bottlenecks, their helpful discussions and feedback.
including VAEs and VQ-VAEs. We carefully evaluated the
induced latent representation using interpretability criteria as R EFERENCES
well as the ability to discriminate between similar speech
sounds. The comparison of bottlenecks revealed that discrete [1] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
data with neural networks,” Science, vol. 313, no. 5786, 2006.
representations obtained using the VQ-VAE preserved the [2] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting
most phonetic information while also being the most speaker- and composing robust features with denoising autoencoders,” in Proc.
invariant. The extracted representation allowed for accurate International Conference on Machine Learning (ICML), 2008.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
mapping of the extracted symbols into phonemes and obtained with deep convolutional neural networks,” in Advances in Neural
competitive performance on the ZeroSpeech 2017 acoustic unit Information Processing Systems, 2012, pp. 1097–1105.
discovery task. [4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
We established that an information bottleneck is required for Proc. IEEE Conference on Computer Vision and Pattern Recognition,
the model to learn a representation that separates content from 2015.
speaker characteristics. Furthermore, we observe that the latent [5] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
jointly learning to align and translate,” in Proc. International Conference
collapse problem induced by using bottlenecks which are too on Learning Representations, 2015.
strong can be avoided by making the bottleneck strength a [6] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,
hyperparameter of the model, either removing it completely (as M. Krikun, Y. Cao, Q. Gao, K. Macherey, and et al, “Google’s neural
machine translation system: Bridging the gap between human and
in the VQ-VAE), or by using the free-information formulation machine translation,” arXiv preprint arXiv:1609.08144, 2016.
of the VAE objective. [7] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with
To further improve representation quality, we introduced a deep recurrent neural networks,” in Proc. International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6645–6649.
time-jitter regularization scheme which limits the capacity of [8] C.-C.Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen,
the latent code yet does not result in a collapse of the latent A. Kannan, R. J. Weiss, K. Rao, K. Gonina, N. Jaitly, B. Li, J. Chorowski,
space. We hope that this can similarly improve performance and M. Bacchiani, “State-of-the-art speech recognition with sequence-
to-sequence models,” in Proc. International Conference on Acoustics,
of latent variable models used with auto-regressive decoders Speech and Signal Processing (ICASSP), 2018.
in other problem domains. [9] W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou, “Gated self-matching
Both the VAE and VQ-VAE constrain the information networks for reading comprehension and question answering,” in Proc.
55th Annual Meeting of the Association for Computational Linguistics
bandwidth of the latent representation. However, the VQ-VAE (Volume 1: Long Papers), vol. 1, 2017, pp. 189–198.
uses a quantization mechanism, which deterministically forces [10] A. W. Yu, D. Dohan, M.-T. Luong, R. Zhao, K. Chen, M. Norouzi,
the encoding to be equal to a prototype, while the VAE limits and Q. V. Le, “QANet: Combining local convolution with global self-
attention for reading comprehension,” in Proc. International Conference
the amount of information by injecting noise. In our study, on Learning Representations, 2018.
the VQ-VAE resulted in better information separation than [11] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional
the VAE. However, further experiments are needed to fully networks,” in European Conference on Computer Vision, 2014.
[12] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep
understand this effect. In particular, is this a consequence of networks,” in Proc. International Conference on Machine Learning
the quantization, or of the deterministic operation? (ICML), 2017.
We also observe that while the VQ-VAE produces a discrete [13] T. Nagamine and N. Mesgarani, “Understanding the representation
and computation of multilayer perceptrons: A case study in speech
representation, for best results it uses a token set so large that recognition,” in Proc. International Conference on Machine Learning
it is impractical to assign a separate meaning to each one. In (ICML), 2017.
12
[14] J. Chorowski, R. J. Weiss, R. A. Saurous, and S. Bengio, “On using [39] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational
backpropagation for speech texture generation and voice conversion,” information bottleneck,” in Proc. International Conference on Learning
in Proc. International Conference on Acoustics, Speech and Signal Representations, 2017.
Processing (ICASSP), Apr. 2018. [40] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and
[15] P. Swietojanski, A. Ghoshal, and S. Renals, “Unsupervised cross-lingual M. Welling, “Improved variational inference with inverse autoregressive
knowledge transfer in DNN-based LVCSR,” in Proc. Spoken Language flow,” in Advances in Neural Information Processing Systems, 2016.
Technology Workshop (SLT), 2012, pp. 246–251. [41] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating
[16] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deep neural gradients through stochastic neurons for conditional computation,” arXiv
network features and semi-supervised training for low resource speech preprint arXiv:1308.3432, 2013.
recognition,” in Proc. International Conference on Acoustics, Speech [42] D. Jurafsky and J. H. Martin, Speech and Language Processing (2nd
and Signal Processing (ICASSP), 2013, pp. 6704–6708. Edition). Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 2009.
[17] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an [43] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling
ASR corpus based on public domain audio books,” in Proc. International with gated convolutional networks,” in Proc. International Conference
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, on Machine Learning (ICML), 2017.
pp. 5206–5210. [44] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic
[18] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete convolutional and recurrent networks for sequence modeling,” arXiv
representation learning,” in Advances in Neural Information Processing preprint arXiv:1803.01271, 2018.
Systems, 2017, pp. 6309–6318. [45] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
[19] E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Besacier, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet:
X. Anguera, and E. Dupoux, “The zero resource speech challenge 2017,” A generative model for raw audio,” arXiv preprint arXiv:1609.03499,
in Proc. Automatic Speech Recognition and Understanding Workshop 2016.
(ASRU), 2017. [46] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao,
[20] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning rep- Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical generative
resentations by back-propagating errors,” Nature, vol. 323, no. 6088, modeling for controllable speech synthesis,” in Proc. International
1986. Conference on Learning Representations, 2019.
[21] H. Lee, C. Ekanadham, and A. Ng, “Sparse deep belief net model for [47] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taı̈ga, F. Visin, D. Vázquez,
visual area V2,” in Advances in Neural Information Processing Systems, and A. Courville, “PixelVAE: A latent variable model for natural images,”
2008. in Proc. International Conference on Learning Representations, 2017.
[22] S. Dieleman and B. Schrauwen, “End-to-end learning for music audio,” [48] L. Wiskott and T. J. Sejnowski, “Slow feature analysis: Unsupervised
in Proc. International Conference on Acoustics, Speech and Signal learning of invariances,” Neural Computation, vol. 14, no. 4, 2002.
Processing (ICASSP), 2014, pp. 6964–6968. [49] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and
[23] N. Jaitly and G. Hinton, “Learning a better representation of speech S. Bengio, “Generating sentences from a continuous space,” in SIGNLL
soundwaves using restricted Boltzmann machines,” in Proc. International Conference on Computational Natural Language Learning, 2016.
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011.
[50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi-
[24] Z. Tüske, P. Golik, R. Schlüter, and H. Ney, “Acoustic modeling with deep
nov, “Dropout: A simple way to prevent neural networks from overfitting,”
neural networks using raw time signal for LVCSR,” in Proc. Interspeech,
Journal of Machine Learning Research, vol. 15, no. 1, 2014.
2014.
[51] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke,
[25] D. Palaz, M. Magima Doss, and R. Collobert, “Analysis of CNN-
A. Goyal, Y. Bengio, A. Courville, and C. Pal, “Zoneout: Regularizing
based speech recognition system using raw speech as input,” in Proc.
RNNs by randomly preserving hidden activations,” in Proc. International
Interspeech, Sep. 2015.
Conference on Learning Representations, 2017.
[26] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals,
“Learning the speech front-end with raw waveform CLDNNs,” in Proc. [52] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
Interspeech, Sep. 2015. in Proc. International Conference on Learning Representations, 2015.
[27] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: [53] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approximation
A Large-Scale Hierarchical Image Database,” in Proc. IEEE Conference by averaging,” SIAM Journal on Control and Optimization, vol. 30, no. 4,
on Computer Vision and Pattern Recognition, 2009. pp. 838–855, 1992.
[28] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are [54] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,
features in deep neural networks?” in Advances in Neural Information M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer,
Processing Systems, 2014, pp. 3320–3328. and K. Vesely, “The Kaldi speech recognition toolkit,” in Proc. Automatic
[29] K. Veselỳ, M. Karafiát, F. Grézl, M. Janda, and E. Egorova, “The Speech Recognition and Understanding Workshop (ASRU), 2011.
language-independent bottleneck features,” in Proc. Spoken Language [55] T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and E. Dupoux,
Technology Workshop (SLT), 2012, pp. 336–341. “Evaluating speech features with the minimal-pair ABX task: Analysis
[30] D. Yu and M. L. Seltzer, “Improved bottleneck features using pretrained of the classical MFC/PLP pipeline,” in Proc. Interspeech, 2013, pp. 1–5.
deep neural networks,” in Proc. Interspeech, 2011. [56] T. Schatz, V. Peddinti, X.-N. Cao, F. Bach, H. Hermansky, and
[31] B. McCann, J. Bradbury, C. Xiong, and R. Socher, “Learned in translation: E. Dupoux, “Evaluating speech features with the minimal-pair ABX
Contextualized word vectors,” in Advances in Neural Information task (ii): Resistance to noise,” in Proc. Interspeech, 2014.
Processing Systems, 2017, pp. 6294–6305. [57] M. Heck, S. Sakti, and S. Nakamura, “Unsupervised linear discriminant
[32] S. R. Bowman, G. Angeli, C. Potts, and C. Manning, “A large annotated analysis for supporting DPGMM clustering in the zero resource scenario,”
corpus for learning natural language inference,” in Proc. Conference on Procedia Computer Science, vol. 81, pp. 73–79, 2016.
Empirical Methods in Natural Language Processing, 2015. [58] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Multilingual bottle-
[33] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, neck feature learning from untranscribed speech,” in Proc. Automatic
“Supervised learning of universal sentence representations from natural Speech Recognition and Understanding Workshop (ASRU), 2017.
language inference data,” in Proc. Conference on Empirical Methods in [59] T. Ansari, R. Kumar, S. Singh, and S. Ganapathy, “Deep learning methods
Natural Language Processing (EMNLP), September 2017, pp. 670–680. for unsupervised acoustic modeling—leap submission to zerospeech chal-
[34] C. M. Bishop, “Continuous latent variables,” in Pattern Recognition and lenge 2017,” in Proc. Automatic Speech Recognition and Understanding
Machine Learning. Springer, 2006, ch. 12. Workshop (ASRU), 2017, pp. 754–761.
[35] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative [60] Y. Yuan, C. C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Extracting
matrix factorization,” Nature, vol. 401, no. 6755, p. 788, 1999. bottleneck features and word-like pairs from untranscribed speech for
[36] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive feature representation,” in Proc. Automatic Speech Recognition and
field properties by learning a sparse code for natural images,” Nature, Understanding Workshop (ASRU), Dec 2017, pp. 734–739.
vol. 381, no. 6583, p. 607, 1996. [61] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen,
[37] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgian-
Proc. International Conference on Learning Representations, 2014. nakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet
[38] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, on mel spectrogram predictions,” in Proc. International Conference on
S. Mohamed, and A. Lerchner, “Beta-VAE: Learning basic visual Acoustics, Speech and Signal Processing (ICASSP), 2018.
concepts with a constrained variational framework,” in Proc. International [62] X. Wang and C.-C. J. Kuo, “An 800 bps VQ-based LPC voice coder,”
Conference on Learning Representations, 2017. Journal of the Acoustical Society of America, vol. 103, no. 5, 1998.
13
[63] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and

K. Simonyan, “Neural audio synthesis of musical notes with wavenet
autoencoders,” in Proc. International Conference on Machine Learning
(ICML), 2017, pp. 1068–1077.
[64] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised learning of disentan-
gled and interpretable representations from sequential data,” in Advances
in Neural Information Processing Systems, 2017, pp. 1876–1887.
[65] Y. Li and S. Mandt, “Disentangled sequential autoencoder,” in Proc.
International Conference on Machine Learning (ICML), 2018.
[66] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel Inference of
Dirichlet Process Gaussian Mixture Models for Unsupervised Acoustic
Modeling: A Feasibility Study,” in Proc. Interspeech, 2015.
[67] C.-y. Lee and J. Glass, “A Nonparametric Bayesian Approach to Acoustic
Model Discovery,” in Proc. 50th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Jul. 2012, pp. 40–49.
[68] L. Ondel, L. Burget, and J. Černocký, “Variational Inference for Acoustic
Unit Discovery,” Procedia Computer Science, vol. 81, Jan. 2016.
[69] R. Marxer and H. Purwins, “Unsupervised Incremental Online Learning
and Prediction of Musical Audio Signals,” IEEE/ACM Transactions on
Audio, Speech, and Language Processing, vol. 24, no. 5, May 2016.
[70] J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach, and
B. Raj, “Hidden Markov Model Variational Autoencoder for Acoustic
Unit Discovery,” in Proc. Interspeech, Aug. 2017, pp. 488–492.
[71] T. Glarner, P. Hanebrink, J. Ebbers, and R. Haeb-Umbach, “Full Bayesian
Hidden Markov Model Variational Autoencoder for Acoustic Unit
Discovery,” in Proc. Interspeech, Sep. 2018, pp. 2688–2692.
[72] A. S. Park and J. R. Glass, “Unsupervised Pattern Discovery in Speech,”
IEEE Transactions on Audio, Speech, and Language Processing, vol. 16,
no. 1, pp. 186–197, Jan. 2008.
[73] H. Kamper, A. Jansen, and S. Goldwater, “A segmental framework
for fully-unsupervised large-vocabulary speech recognition,” Computer
Speech & Language, vol. 46, pp. 154–174, 2017.
[74] Y.-A. Chung and J. Glass, “Learning word embeddings from speech,”
arXiv preprint arXiv:1711.01515, 2017.
[75] G. Lample, A. Conneau, L. Denoyer, and M. Ranzato, “Unsupervised
Machine Translation Using Monolingual Corpora Only,” in Proc. Inter-
national Conference on Learning Representations, 2018.
[76] Y.-A. Chung, W.-H. Weng, S. Tong, and J. Glass, “Unsupervised cross-
modal alignment of speech and text embedding spaces,” Advances in
Neural Information Processing Systems, 2018.
[77] A. Jansen and B. Van Durme, “Efficient spoken term discovery using
randomized algorithms,” in Proc. Automatic Speech Recognition and
Understanding Workshop (ASRU), 2011, pp. 401–406.

Unsupervised Speech Representation Learning Using Wavenet Autoencoders

Uploaded by

Copyright:

Available Formats

Unsupervised Speech Representation Learning Using Wavenet Autoencoders

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unsupervised Speech Representation Learning Using Wavenet Autoencoders

Uploaded by

Copyright:

Available Formats

1

Unsupervised speech representation learning

from the signal, e.g. phoneme identities, while being invariant to

The quantization within the VQ-VAE acts as an information VQ-VAE

ReLU(768) Decoder jitter(0.12)

Waveform samples are reconstructed with a WaveNet that IV. E XPERIMENTS

TABLE I accuracy, while a model with no time-reduction layers set the

penc pproj pbn pcond

0.5 VAE (D=32)

0.7 independent encoder.

preferred over a simpler model whose performance saturates on 0.4 Time-jitter

might have learned to do this in order to accommodate French 0.50

E. Hyperparameter impact 0.2

the LibriSpeech task, optimizing for the highest phoneme

[63] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and

You might also like