Unsupervised Speech Representation Learning Using Wavenet Autoencoders
Unsupervised Speech Representation Learning Using Wavenet Autoencoders
Unsupervised Speech Representation Learning Using Wavenet Autoencoders
Abstract—We consider the task of unsupervised extraction speech recognition (ASR), where only a small amount of
of meaningful latent representations of speech by applying labeled training data is available. In such scenario, limited
autoencoding neural networks to speech waveforms. The goal is to amounts of data may be sufficient to learn an acoustic model
learn a representation able to capture high level semantic content
on the representation discovered without supervision, but
arXiv:1901.08810v1 [cs.LG] 25 Jan 2019
systems. Likewise, in natural language processing, universal trained jointly to maximize a lower bound on the log-likelihood
text representations can be extracted from networks trained for of data point x [37], [38]:
machine translation [31] or language inference [32], [33].
JVAE (θ, φ; x) = Eq(z|x;φ) [log p(x|z; θ)] −
β DKL (q(z|x; φ) || p(z)) . (1)
B. Unsupervised feature learning We can interpret the two terms of Eq. (1) as the autoencoder’s
In this paper we focus on unsupervised feature learning. reconstruction cost augmented with a penalty term applied to
Since no training labels are available we investigate autoen- the hidden representation. In particular, the KL divergence
coders, i.e., networks which are tasked with reconstructing expresses the amount of information in nats which the latent
their inputs. Autoencoders use an encoding network to extract representation carries about the data sample. Thus, it acts as an
a latent representation, which is then passed through a decod- information bottleneck [39] on the latent representation, where
ing network to recover the original data. Ideally, the latent β controls the trade-off between reconstruction quality and the
representation preserves the salient features of the original representation simplicity.
data, while being easier to analyze and work with, e.g. by An alternative formulation of the VAE objective explicitly
disentangling different factors of variation in the data, and constrains the amount of information contained in the latent
discarding spurious patterns (noise). These desirable qualities representation [40]:
are typically obtained through a judicious application of JVAE (θ, φ; x) = Eq(z|x;φ) [log p(x|z; θ)] −
regularization techniques and constraints or bottlenecks (we
max (B, DKL (q(z|x; φ) || p(z))) , (2)
use the two terms interchangeably). The representation learned
by an autoencoder is thus subject to two competing forces. On where the constant B corresponds to the amount of free
the one hand, it should provide the decoder with information information in q, because the model is only penalized if it
necessary for perfect reconstruction and thus capture in the transmits more than B nats over the prior in the distribution
latents as much if the input data characteristics as possible. over the latents. Please note that for convenience we will often
On the other hand, the constraints force some information to refer to information content using units of bits instead of nats.
be discarded, preventing the latent representation from being A recently proposed modification of the VAE, called the
trivial to invert, e.g. by exactly passing through the input. Thus Vector Quantized VAE [18], replaces the stochastic continuous
the bottleneck is necessary to force the network to learn a latent variable with a deterministic discrete latent variable.
non-trivial data transformation. Inspired by vector quantization, VQ-VAE maintains a number
Reducing the dimensionality of the latent representation can of prototype vectors {ei , i = 1, . . . , K}. During the forward
serve as a basic constraint applied to the latent vectors, with pass, representations produced by the encoder are replaced
the autoencoder acting as a nonlinear variant of linear low- with their closest prototypes. Formally, let ze (x) be the output
rank data projections, such as PCA or SVD [34]. However, of the encoder prior to quantization. VQ-VAE finds the nearest
such representations may be difficult to interpret because the prototype q(x) = argmini kze (x)−ei k22 and uses it as the latent
reconstruction of an input depends on all latent features [35]. In representation zq (x) = eq(x) which is passed to the decoder.
contrast, dictionary learning techniques, such as sparse [36] and During the backward pass, the gradient of the loss with
non-negative [35] decompositions, express each input pattern respect to the pre-quantized embedding is approximated using
using a combination of a small number of selected features out the straight-through estimator [41], i.e., ∂z∂L
e (x)
≈ ∂z∂L
q (x)
1
. The
of a larger pool, which facilitates their interpretability. Discrete prototypes are trained by extending the learning objective
feature learning using vector quantization can be seen as an with terms which optimize quantization. Prototypes are forced
extreme form of sparseness in which the reconstruction uses to lie close to vectors which they replace with an auxiliary
only one element from the dictionary. cost, dubbed the commitment loss, introduced to encourage
The Variational Autoencoder (VAE) [37] proposes a different the encoder to produce vectors which lie close to prototypes.
interpretation of feature learning which follows a probabilistic Without the commitment loss VQ-VAE training can diverge by
framework. The autoencoding network is derived from a latent- emitting representations with unbounded magnitude. Therefore,
variable generative model. First, a latent vector z is sampled VQ-VAE is trained using a sum of three loss terms: the negative
from a prior distribution p(z) (typically a multidimensional log-likelihood of the reconstruction, which uses the straight-
normal distribution). Then the data sample x is generated through estimator to bring the gradient from the decoder to
using a deep decoder neural network with parameters θ that the encoder, and two VQ-related terms: the distance from each
computes p(x|z; θ). However, computing the exact posterior prototype to its assigned vectors and the commitment cost [18]:
distribution p(z|x) that is needed during maximum likelihood
training is difficult. Instead, the VAE introduces a variational L = log p x | zq (x)
+ ksg ze (x) − eq(x) k22 + γkze (x) − sg(eq(x) )k22 , (3)
approximation to the posterior, q(z|x; φ), which is modeled
using an encoder neural network with parameters φ. Thus the
where sg(·) denotes the stop-gradient operation which zeros
VAE resembles a traditional autoencoder, in which the encoder
the gradient with respect to its argument during backward pass.
produces distributions over latent representations, rather than
deterministic encodings, while the decoder is trained on samples 1 In TensorFlow this can be conveniently implemented using z (x) =
q
from this distribution. Encoding and decoding networks are ze (x) + stop gradient(eq(x) − ze (x))
3
replaced with either of its neighbors with probability 0.12. A comparison of models using each of the three bottlenecks
The jittered latent sequence was passed through a single with different hyperparameters (latent dimensionality and de-
convolutional layer with filter length 3 and 128 hidden gree of regularization) is presented in Figure 3. Each bottleneck
units to mix information across neighboring timesteps. The type consistently discards information between the penc and pbn
representation was then upsampled 320 times (to match the probe locations, as evidenced by the reduced performance on
16kHz audio sampling rate) and concatenated with a one-hot each task. The bottleneck also impacts information content in
vector representing the current speaker to form the conditioning preceding layers. Especially for the vanilla autoencoder (AE),
input of an autoregressive WaveNet [45]. The WaveNet was which simply reduces dimensionality, the speaker prediction
composed of 20 causal dilated convolution layers, each using accuracy and filterbank reconstruction loss at penc depend on
368 gated units with residual connections, organized into two the width of the bottleneck, with narrower widths causing
“cycles” of 10 layers with dilation rates 1, 2, 4, . . . , 29 . The more information to be discarded in lower layers of the
conditioning signal was passed separately into each layer. The encoder. Likewise, VQ-VAEs and AEs yielded better filterbank
signal from each layer of the WaveNet was passed to the output reconstructions and speaker identity prediction at penc compared
using skip-connections. Finally, the signal was passed through 2 to VAEs with matching dimensionality and effective bitrate.
ReLU layers with 256 units. A Softmax was applied to compute As expected, AE discards the least information. At pcond the
the next sample probability. We used 256 quantization levels representation remains highly predictive about both speaker and
after mu-law companding [45]. phonemes, and its filterbank reconstructions are the best among
All models were trained on minibatches of 64 sequences of all configurations. However, from an unsupervised learning
length 5120 time-domain samples (320 ms) sampled uniformly standpoint, the AE latent representation is less useful because
from the training dataset. We used the Adam optimizer [52] it mixes all properties of the source signal.
with initial learning rate 4 × 10−4 which was halved after 400k, In contrast, VQ-VAE models produce a representation which
600k, and 800k steps. Polyak averaging [53] was applied to is highly predictive of the phonetic content of the signal while
all checkpoints used for model evaluation. it effectively discards speaker identity and gender information.
At higher bitrates, phoneme prediction is about as accurate as
B. Bottleneck comparison for the AE. Filterbank reconstructions are also less accurate.
We train models on LibriSpeech and analyze the informa- We observe that the speaker information is discarded primarily
tion captured in the hidden representations surrounding the during the quantization step between pproj and pbn . Combining
autoencoder bottleneck at each of the four probe points shown several latent vectors in the pcond representation results in more
in Figure 1: accurate phoneme predictions, but the additional context does
not help to recover speaker information. This phenomenon
penc (768 dim) encoder output prior to the bottleneck,
is clearly visible in Figure 4. VQ-VAE models showed little
pproj (64 dim) within the bottleneck after projecting to lower
dependence on the bottleneck dimension, so we present results
dimension,
at the default setting of 64.
pbn (64 dim) bottleneck output, corresponding to the quantized
Finally, VAE models separate speaker and phonetic infor-
representation in VQ-VAE, or a random sample from the
mation better than simple dimensionality reduction, but not as
variational posterior in VAE, and
well as VQ-VAE. The VAE discards phonetic and speaker infor-
pcond (128 dim) after passing pbn through a convolution layer
mation more uniformly than VQ-VAE: at pbn , VAE’s phoneme
which captures a larger receptive field over the latent encoding.
predictions are less accurate, while its gender predictions are
At each probe point, we train separate MLP networks with 2048
more accurate. Moreover, combining information across a wider
hidden units on each of four tasks: classifying speaker gender
receptive field at pcond does not improve phoneme recognition
and identity for the whole segment (after average pooling latent
as much as in VQ-VAE models. The sensitivity to the width
vectors across the full signal), predicting phoneme class at
is also surprising, with narrower VAE bottlenecks discarding
each frame (making several predictions per latent vector3 ), and
less information than the wider ones. This may be due to
reconstructing log-mel filterbank features in each frame (again
the stochastic operation of the VAE: to provide the same KL
predicting several consecutive frames from each latent vector).
divergence as at low bottleneck dimensions, more noise needs to
A representation which captures the high level semantic content
be added at high dimensions. This noise may mask information
from the signal, while being invariant to nuisance low-level
present in the representation.
signal details, will have a high phoneme prediction accuracy,
Based on these results we conclude that the VQ-VAE
and high spectrogram reconstruction error. A disentangled
bottleneck is most appropriate for learning latent representations
representation should additionally have low speaker prediction
which capture phonetic content while being invariant to the
accuracy, since this information is explicitly made available
underlying speaker identity.
to the decoder conditioning network, and therefore need not
be preserved in the latent encoding. Since we are primarily
interested in discovering what information is present in the C. VQ-VAE token interpretability
constructed representations we report the training performance Unlike the continuous vector representations learned by other
and do not tune probing networks for generalization. bottlenecks, an advantage of the discrete representation learned
3 Ground truth phoneme labels and filterbank features have a frame rate of by the VQ-VAE lies in the ability to directly interpret the
100 Hz, while the latent representation is computed at a lower rate. individual tokens. In this section we evaluate how well VQ-VAE
6
80
Mel channel
60
40
20
0
0.25 M IH S T ER K W IH L T ER IH Z DH IY AH P AA S
0.00
0.25
245
245
88
177
141
121
214
110
11
130
195
195
183
59
94
161
235
11
155
39
207
207
87
199
254
108
186
241
151
126
102
102
140
140
221
83
81
101
0.50
39
MFCC + d + a
26
13
0
0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
Time (sec)
Fig. 2. Example token sequence extracted by the proposed VQ-VAE model. Bottom: input MFCC features. Middle: Target waveform samples overlaid with
extracted token ids (bottom), and ground truth phoneme identities (top) with boundaries between plotted in red. Note the transient “T” phoneme at 0.75 and 1.1
seconds is consistently associated with token 11. The corresponding log-mel spectrogram is shown on the top.
0.6
Filterbank
0.4 AE
0.2 VAE (D= 4)
VAE (D= 8)
0.7
Phoneme
0.6 VAE (D=16)
Accuracy
Gender
Accuracy
4
0.7
8
0.6
0.6 16
0.4 32
Speaker
Accuracy
0.2 64
0
N/A
12
16
N/A
12
16
N/A
12
16
N/A
12
16
VAE free bits / VQ-VAE bits per token
Fig. 3. Accuracy of predicting signal characteristics at various probe locations in the network. Among the three bottlenecks evaluated, VQ-VAE discards the
most speaker-related information at the bottleneck, while preserving the most phonetic information. For all bottlenecks, the representation coming out of
the encoder yields over 70% accurate framewise phoneme predictions. Both the simple AE and VQ-VAE preserve this information in the bottleneck (the
accuracy drops to 50%-60% depending on the bottleneck’s strength). However, the VQ-VAE discards almost all speaker information (speaker classification
accuracy is close to 0% and gender prediction close to 50%). This causes the VQ-VAE representation to perform best on the acoustic unit discovery task – the
representation captures the phonetic content while being invariant to speaker identity.
Probe point come with sufficiently large training datasets, we achieve results
better than the top contestant [57], despite using a speaker
penc
Phoneme prediction accuracy
TABLE II
Z ERO S PEECH 2017 PHONETIC UNIT DISCOVERY ABX SCORES REPORTED ACROSS - AND WITHIN - SPEAKERS ( LOWER IS BETTER ). T HE VQ-VAE ENCODER
IS SPEAKER INDEPENDENT AND THUS ITS RESULTS DO NOT CHANGE WITH THE AMOUNT OF TEST SPEAKER DATA (1 S , 10 S , OR 2 M ), WHILE
SPEAKER - ADAPTIVE MODELS ( E . G . SUPERVISED TOPLINE ) SHOW IMPROVEMENTS WHEN THERE IS MORE TARGET SPEAKER DATA . W E REPORT THE TWO
REFERENCE POINTS FROM THE CHALLENGE , ALONG WITH THE CHALLENGE WINNER [57] AND THREE OTHER SUBMISSIONS THAT USED NEURAL NETWORK
IN AN UNSUPERVISED SETTING [58], [59], [60]. A LL VQ-VAE MODELS USE EXACTLY THE SAME HYPERPARAMETER SETUP (14 BIT TOKENS EXTRACTED
AT 50 H Z WITH TIME - JITTER PROBABILITY 0.5), REGARDLESS OF THE AMOUNT OF UNLABELED TRAINING DATA (45 H , 24 H OR 2.4 H ).
T HE TOP VQ-VAE RESULTS ROW (VQ-VAE TRAINED ON TARGET LANGUAGE , FEATURES EXTRACTED AT THE pCOND POINT ) GIVES BEST RESULTS
OVERALL . W E ALSO INCLUDE in italics RESULTS FOR DIFFERENT PROBE POINTS AND FOR VQ-VAE S JOINTLY TRAINED ON ALL LANGUAGES .
M ULTILINGUAL TRAINING HELPS M ANDARIN . W E ALSO OBSERVE THAT THE QUANTIZATION MOSTLY DISCARDS SPEAKER AND CONTEXT INFLUENCE . T HE
CONTEXT IS HOWEVER RECOVERED IN THE CONDITIONING SIGNAL WHICH COMBINES INFORMATION FROM LATENT VECTORS AT NEIGHBORING TIMESTEPS .
Within-speaker Across-speaker
English (45h) French (24h) Mandarin (2.4h) English (45h) French (24h) Mandarin (2.4h)
Model 1s 10s 2m 1s 10s 2m 1s 10s 2m 1s 10s 2m 1s 10s 2m 1s 10s 2m
Unsupervised baseline 12.0 12.1 12.1 12.5 12.6 12.6 11.5 11.5 11.5 23.4 23.4 23.4 25.2 25.5 25.2 21.3 21.3 21.3
Supervised topline 6.5 5.3 5.1 8.0 6.8 6.8 9.5 4.2 4.0 8.6 6.9 6.7 10.6 9.1 8.9 12.0 5.7 5.1
VQ-VAE (per lang, pcond ) 5.6 5.5 5.5 7.3 7.5 7.5 11.2 10.7 10.8 8.1 8.0 8.0 11.0 10.8 11.1 12.2 11.7 11.9
VQ-VAE (per lang, pbn ) 6.2 6.0 6.0 7.5 7.3 7.6 10.8 10.5 10.6 8.9 8.8 8.9 11.3 11.0 11.2 11.9 11.4 11.6
VQ-VAE (per lang, pproj ) 5.9 5.8 5.9 6.7 6.9 6.9 9.9 9.7 9.7 9.1 9.0 9.0 11.9 11.6 11.7 11.0 10.6 10.7
VQ-VAE (all lang, pcond ) 5.8 5.8 5.8 8.0 7.9 7.8 9.2 9.1 9.2 8.8 8.6 8.7 11.8 11.6 11.6 10.3 10.0 9.9
VQ-VAE (all lang, pbn ) 6.3 6.2 6.3 8.0 8.0 7.9 9.0 8.9 9.1 9.4 9.2 9.3 11.8 11.7 11.8 9.9 9.7 9.7
VQ-VAE (all lang, pproj ) 5.8 5.7 5.8 7.1 7.0 6.9 7.4 7.2 7.1 9.3 9.3 9.3 11.9 11.4 11.6 8.6 8.5 8.5
Heck et al. [57] 6.9 6.2 6.0 9.7 8.7 8.4 8.8 7.9 7.8 10.1 8.7 8.5 13.6 11.7 11.3 8.8 7.4 7.3
Chen et al. [58] 8.5 7.3 7.2 11.2 9.4 9.4 10.5 8.7 8.5 12.7 11.0 10.8 17.0 14.5 14.1 11.9 10.3 10.1
Ansari et al. [59] 7.7 6.8 N/A 10.4 N/A 8.8 10.4 9.3 9.1 13.2 12.0 N/A 17.2 N/A 15.4 13.0 12.2 12.3
Yuan et al. [60] 9.0 7.1 7.0 11.9 9.5 9.5 11.1 8.5 8.2 14.0 11.9 11.7 18.6 15.5 14.9 12.7 10.8 10.7
brought by quantization. In fact, the multilingual prequantized penc pproj pbn pcond
features are comparable to [57]. 0.6
Recon. Error
Filterbank
We do not consider the need for more unsupervised training 0.4
data to be a problem. Unlabeled data is abundant. We believe 0.2
that a more powerful model that requires and is able to make a 0
better use of large amounts of unlabeled training data is to be 0.6
Phoneme
Accuracy
Gender
Accuracy
VAE
AE
VAE
AE
VAE
AE
VQ-VAE
VQ-VAE
VQ-VAE
VQ-VAE
TABLE III
E FFECTS OF INPUT REPRESENTATION AND REGULARIZATION ON PHONEME 0.75
RECOGNITION ACCURACY ON L IBRI S PEECH , MEASURED AFTER 200 K
Prediction accuracy
TRAINING STEPS . A LL MODELS EXTRACT 256 TOKENS .
0.70 Pred. target
Input features Token rate Regularization Accuracy gender
MFCC 25 Hz None 52.5 0.65 phonemes
MFCC 25 Hz Regular dropout p = 0.1 50.7
MFCC 25 Hz Regular dropout p = 0.2 49.1 0.60
MFCC 25 Hz Per-time step dropout p = 0.2 55.3
MFCC 25 Hz Per-time step dropout p = 0.3 55.7
MFCC 25 Hz Per-time step dropout p = 0.4 55.1 1 10 100
MFCC 25 Hz Time-jitter p = 0.08 56.2 WaveNet Receptive Field [ms]
MFCC 25 Hz Time-jitter p = 0.12 56.2
MFCC 25 Hz Time-jitter p = 0.16 56.1 Fig. 6. Impact of decoder WaveNet receptive field on the properties of the
MFCC 50 Hz None 46.5 VQ-VAE conditioning signal. The representation is significantly more gender
MFCC 50 Hz Time-jitter p = 0.5 56.1 invariant when the receptive field is larger that 10ms. Frame-wise phoneme
recognition accuracy peaks at about 125ms. The depth and width of the
log-mel spectrogram 25 Hz None 50.1 WaveNet have a secondary effect (cf. points with the same RF).
log-mel spectrogram 25 Hz Time-jitter p = 0.12 53.6
raw waveform 30 Hz None 37.6
raw waveform 30 Hz Time-jitter p = 0.12 48.1 lower than the other architectures described in this paper. The
model was however an order of magnitude faster to train.
Finally, we analyzed the impact of the size of the decoding
for VQ-VAE, VAE, and simple dimensionality reduction AE WaveNet on the representation extracted by the VQ-VAE.
bottleneck, shown in Figure 5. For all bottleneck mechanisms, We have found that overall receptive field (RF) has a larger
the regularization limits the quality of filterbank reconstruc- impact than the depth or width of the WaveNet. In particular,
tions and increases the phoneme recognition accuracy in the a large change in the properties of the latent representation
constrained representation. However this benefit is smaller after happens when the decoder’s receptive field crosses than about
neighboring timesteps are combined in the pcond probe point. 10ms. For smaller RFs, the conditioning signal contains more
Moreover, for VQ-VAE and VAE the regularization decreases speaker information: gender prediction is close to 80%, while
gender prediction accuracy and makes the representation framewise phoneme prediction accuracy is only 55%. For larger
slightly less speaker-sensitive. RFs, gender prediction accuracy is about 60%, while phoneme
2) Input representation: In this set of experiments we prediction peaks near 65%. Finally, while the reconstruction log-
compared performance using different input representation: likelihood improved with WaveNet depth up to 30 layers, the
raw waveforms, log-mel spectrograms, or MFCCs. The raw phoneme recognition accuracy plateaued with 20 layers. Since
waveform encoder used 9 strided convolutional layers, which the WaveNet has the largest computational cost we decided to
resulted in token extraction frequency of 30 Hz. We then keep the 20 layer configuration.
replaced the waveform with a customary ASR data pipeline: 4) Decoder speaker conditioning: The WaveNet decoder
80 log-mel filterbank features extracted every 10ms from 25ms- generates samples based on three sources of information: the
long windows and 13 MFCC features extracted from the mel- previously emitted samples (via the autoregressive connection),
filterbank output, both augmented with their first and second global conditioning on speaker or other information which
temporal derivatives. Using two strided convolution layers in is stationary in time, and on the time-varying representation
the encoder led to a 25 Hz token rate for these models. extracted from the encoder. We found that disabling global
The results are reported in the bottom of Table III. High-level speaker conditioning reduces phoneme classification accuracy
features, especially MFCCs, perform better than waveforms, by 3 percentage points. This further corroborates our findings
because by design they discard information about pitch and about disentanglement induced by the VQ-VAE bottleneck,
provide a degree of speaker invariance. Using such a reduced which biases the model to discard information that is available
representation forces the encoder to transmit less information to in a more explicit form. Throughout our experiments we used
the decoder, acting as an inductive bias toward a more speaker a speaker-independent encoder. However, adapting the encoder
invariant latent encoding. to the speaker might further improve the results. In fact, [57]
3) Output representation: We constructed an autoregressive demonstrates improvements on the ZeroSpeech task using a
decoder network that reconstructed filterbank features rather speaker-adaptive approach.
than raw waveform samples. Inspired by recent progress in 5) Encoder hyperparameters: We experimented with tuning
text-to-speech systems, we implemented a Tacotron 2-like the number of encoder convolutional layers, as well as the
decoder [61] with a built-in information bottleneck on the number of filters, and the filter length. In general, performance
autoregressive information flow, which was found to be critical improved with larger encoders, however we established that
in TTS applications. Similarly to Tacotron 2 the filterbank the encoder’s receptive field must be carefully controlled, with
features were first processed by a small “pre-net”, we applied the best performing encoders seeing about 0.3 seconds of input
generous amounts of dropout and configured the decoder to signal for each generated token.
predict up to 4 frames in parallel. However, these modifications The effective receptive field can be controlled using two
yielded at best 42% phoneme recognition accuracy, significantly mechanisms: by carefully tuning the encoder architecture, or by
10
designing an encoder with a wide receptive field, but limiting was formed from the last hidden state of the encoder. The model
the duration of signal segments seen during training to the proved useful for natural language processing tasks. However, it
desired receptive field. In this way the model never learns to also demonstrated the problem of latent representation collapse:
use its full capacity. When the model was trained on 2.5s long when a powerful autoregressive decoder is used simultaneously
segments, an encoder with receptive field of 0.3s had framewise with a penalty on the latent encoding, such as the KL prior,
phoneme recognition accuracy of 56.5%, while and encoder the VAE has a tendency to ignore the prior and act as if it
with a receptive field of 0.8s scored only 54.3%. When trained were a purely autoregressive sequence model. This issue can
on segments of 0.3s, both models performed similarly. be mitigated by changing the weight of the KL term, and
6) Bottleneck bit rate: The speech VQ-VAE encoder can be limiting the amount of information on the autoregressive path
seen as encoding a signal using a very low bit rate. To achieve by using word dropout [49]. Latent collapse can also be avoided
a predetermined target bit rate, one can control both the token in deterministic autoencoders, such as [63], which coupled a
rate (i.e., by controlling the degree of downsampling down in convolutional encoder to a powerful autoregressive WaveNet
the encoder strided convolutions), and the number of tokens decoder [45] to learn a latent representation of music audio
(or equivalently the number of bits) extracted at every step. We consisting of isolated notes from a variety of instruments.
found that the token rate is a crucial parameter which must be When applied to audio, the VQ-VAE uses the WaveNet
chosen carefully, with the best results after 200k training steps decoder to free the latent representation from modeling
obtained at 50 Hz (56.0% phoneme recognition accuracy ) and information that is easily recoverable form the recent past
25 Hz (56.3%). Accuracy drops abruptly at higher token rates [18]. It avoids the problem of posterior collapse by using a
(49.3% at 100 Hz), while lower rates miss very short phones discrete latent code with a uniform prior which results in a
(53% accuracy at 12.5 Hz). constant KL penalty. We employ the same strategy to design
In contrast to the number of tokens, the dimensionality the latent representation regularizer: rather than extending the
of the VQ-VAE embedding has a secondary effect on the cost function with a penalty term that can cause the latent space
representation quality. We found 64 to be a good setting, with to collapse, we rely on random copies of the latent variables
much smaller dimensions deteriorating performance for models to prevent their co-adaptation and promote stability over time.
with a small number of tokens and higher dimensionalities The randomized time-jitter regularization introduced in this
negatively affecting performance for models with a large paper is inspired by slow representations of data [48] and
number of tokens. by dropout, which randomly removes during training neurons
For completeness, we observe that even for the model with to prevent their co-adaptation [50]. It is also very similar to
the largest inventory of tokens, the overall encoder bitrate is Zoneout [51] which relies on random time copies of selected
low: 14 bits at 50 Hz = 700 bps, which is on par with the neurons to regularize recurrent neural networks.
lowest bitrate of classical speech codecs [62]. Several authors have recently proposed to model sequences
7) Training corpus size: We experimented with training with VAEs that use a hierarchy of variables. [64] explore a
models on subsets of the LibriSpeech training set, varying hierarchical latent space which separates sequence-dependent
the size from 4.6 hours (1%) to 460 hours (100%). Training variables from those which are sequence-independent ones.
on 4.6 hours of data, phoneme recognition accuracy peaked Their model was shown to perform speaker conversion and to
at 50.5% at 100k steps and then deteriorated. Training on 9 improve automatic speech recognititon (ASR) performance in
hours led to a peak accuracy of 52.5% at 180k sets. When the the presence of domain mismatch. [65] introduce a stochastic
size of training set was increased past 23 hours the phoneme latent variable model for sequential data which also yields
recognition reached 54% after around 900k steps. No further disentangled representations and allows content swapping
improvements were found by training on the full 460 hours of between generated sequences. These other approaches could
data. We did not observe any overfitting, and for best results possibly benefit from regularizing the latent representation to
trained models until reaching 900k steps with no early stopping. achieve further information disentanglement.
An interesting future area for research would be investigating Acoustic unit discovery systems aim at transducing the
methods to increase the model capacity to make better use of acoustic signal into a sequence of interpretable units akin
larger amounts of unlabeled data. to phones. They often involve clustering of acoustic frames,
The influence of the size of the dataset is also visible in MFCC or neural network bottleneck features, regularized using
the ZeroSpeech Challenge results (Table II): VQ-VAE models a probabilistic prior. DP-GMM [66] imposes a Dirichlet Process
obtained good performance on English (45 hours of training prior over a Gaussian Mixture Model. Extending it with an
data) and French (24 hours), but performed poorly on Mandarin HMM temporal structure for sub-phonetic units leads to the
(2.5 hours). Moreover, on English and French we obtained the DP-HMM and the HDP-HMM [67], [68], [69]. HMM-VAE
best results with models trained on monolingual data. On proposes the use of a deep neural network instead of a GMM
Mandarin slightly better results were obtained using a model [70], [71]. These approaches enforce top-down constraints via
trained jointly on data from all languages. HMM temporal smoothing and temporal modeling. Linguistic
unit discovery models detect recurring speech patterns at a
word-like level, finding commonly repeated segments with a
V. R ELATED W ORK
constrained dynamic time warping [72].
VAEs for sequential data were introduced in [49]. The model In the segmental unsupervised speech recognition framework,
used LSTM encoder and decoder, while the latent representation neural autoencoders were used to embed variable length speech
11
segments into a common vector space where they could be particular, in our ZeroSpeech experiments we used the dense
clustered into word types [73]. [74] replace the segmental embedding representation of each token, which provided a
autoencoder with a model that instead predicts a nearby more nuanced token similarity measure than simply using the
speech segment and demonstrate that the representation shares token identity. Perhaps a more structured latent representation
many properties with word embeddings. Coupled with an is needed, in which a small set of units can be modulated in a
unsupervised word segmentation algorithm and unsupervised continuous fashion.
mapping of word embeddings discovered on separate corporas Extensive hyperparameter evaluation indicated that opti-
[75] the approach yielded an ASR system trained on unpaired mizing the receptive field sizes of the encoder and decoder
speech and text data [76]. networks is important for good model performance. A multi-
Several entries to the ZeroSpeech 2017 challenge relied scale modeling approach could furthermore separate the
on neural networks for phonetic unit discovery. [60] trains prosodic information. Our autoencoding approach could also
an autoencoder on pairs of speech segments found using an be combined with penalties that are more specialized to speech
unsupervised term discovery system [77]. [58] first clustered processing. Introducing a HMM prior as in [71] could promote
speech frames, then trained a neural network to predict the a latent representation which better mimics the temporal
cluster ids and used its hidden representation as features. phonetic structure of speech.
[59] extended this scheme with features discovered by an
autoencoder trained on MFCCs. ACKNOWLEDGMENTS
The authors thank Tara Sainath, Úlfar Erlingsson, Aren
VI. C ONCLUSIONS Jansen, Sander Dieleman, Jesse Engel, Łukasz Kaiser, Tom
We applied sequence autoencoders to speech modeling and Walters, Cristina Garbacea, and the Google Brain team for
compared autoencoders using different information bottlenecks, their helpful discussions and feedback.
including VAEs and VQ-VAEs. We carefully evaluated the
induced latent representation using interpretability criteria as R EFERENCES
well as the ability to discriminate between similar speech
sounds. The comparison of bottlenecks revealed that discrete [1] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
data with neural networks,” Science, vol. 313, no. 5786, 2006.
representations obtained using the VQ-VAE preserved the [2] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting
most phonetic information while also being the most speaker- and composing robust features with denoising autoencoders,” in Proc.
invariant. The extracted representation allowed for accurate International Conference on Machine Learning (ICML), 2008.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
mapping of the extracted symbols into phonemes and obtained with deep convolutional neural networks,” in Advances in Neural
competitive performance on the ZeroSpeech 2017 acoustic unit Information Processing Systems, 2012, pp. 1097–1105.
discovery task. [4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
We established that an information bottleneck is required for Proc. IEEE Conference on Computer Vision and Pattern Recognition,
the model to learn a representation that separates content from 2015.
speaker characteristics. Furthermore, we observe that the latent [5] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
jointly learning to align and translate,” in Proc. International Conference
collapse problem induced by using bottlenecks which are too on Learning Representations, 2015.
strong can be avoided by making the bottleneck strength a [6] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,
hyperparameter of the model, either removing it completely (as M. Krikun, Y. Cao, Q. Gao, K. Macherey, and et al, “Google’s neural
machine translation system: Bridging the gap between human and
in the VQ-VAE), or by using the free-information formulation machine translation,” arXiv preprint arXiv:1609.08144, 2016.
of the VAE objective. [7] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with
To further improve representation quality, we introduced a deep recurrent neural networks,” in Proc. International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6645–6649.
time-jitter regularization scheme which limits the capacity of [8] C.-C.Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen,
the latent code yet does not result in a collapse of the latent A. Kannan, R. J. Weiss, K. Rao, K. Gonina, N. Jaitly, B. Li, J. Chorowski,
space. We hope that this can similarly improve performance and M. Bacchiani, “State-of-the-art speech recognition with sequence-
to-sequence models,” in Proc. International Conference on Acoustics,
of latent variable models used with auto-regressive decoders Speech and Signal Processing (ICASSP), 2018.
in other problem domains. [9] W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou, “Gated self-matching
Both the VAE and VQ-VAE constrain the information networks for reading comprehension and question answering,” in Proc.
55th Annual Meeting of the Association for Computational Linguistics
bandwidth of the latent representation. However, the VQ-VAE (Volume 1: Long Papers), vol. 1, 2017, pp. 189–198.
uses a quantization mechanism, which deterministically forces [10] A. W. Yu, D. Dohan, M.-T. Luong, R. Zhao, K. Chen, M. Norouzi,
the encoding to be equal to a prototype, while the VAE limits and Q. V. Le, “QANet: Combining local convolution with global self-
attention for reading comprehension,” in Proc. International Conference
the amount of information by injecting noise. In our study, on Learning Representations, 2018.
the VQ-VAE resulted in better information separation than [11] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional
the VAE. However, further experiments are needed to fully networks,” in European Conference on Computer Vision, 2014.
[12] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep
understand this effect. In particular, is this a consequence of networks,” in Proc. International Conference on Machine Learning
the quantization, or of the deterministic operation? (ICML), 2017.
We also observe that while the VQ-VAE produces a discrete [13] T. Nagamine and N. Mesgarani, “Understanding the representation
and computation of multilayer perceptrons: A case study in speech
representation, for best results it uses a token set so large that recognition,” in Proc. International Conference on Machine Learning
it is impractical to assign a separate meaning to each one. In (ICML), 2017.
12
[14] J. Chorowski, R. J. Weiss, R. A. Saurous, and S. Bengio, “On using [39] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational
backpropagation for speech texture generation and voice conversion,” information bottleneck,” in Proc. International Conference on Learning
in Proc. International Conference on Acoustics, Speech and Signal Representations, 2017.
Processing (ICASSP), Apr. 2018. [40] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and
[15] P. Swietojanski, A. Ghoshal, and S. Renals, “Unsupervised cross-lingual M. Welling, “Improved variational inference with inverse autoregressive
knowledge transfer in DNN-based LVCSR,” in Proc. Spoken Language flow,” in Advances in Neural Information Processing Systems, 2016.
Technology Workshop (SLT), 2012, pp. 246–251. [41] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating
[16] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deep neural gradients through stochastic neurons for conditional computation,” arXiv
network features and semi-supervised training for low resource speech preprint arXiv:1308.3432, 2013.
recognition,” in Proc. International Conference on Acoustics, Speech [42] D. Jurafsky and J. H. Martin, Speech and Language Processing (2nd
and Signal Processing (ICASSP), 2013, pp. 6704–6708. Edition). Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 2009.
[17] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an [43] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling
ASR corpus based on public domain audio books,” in Proc. International with gated convolutional networks,” in Proc. International Conference
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, on Machine Learning (ICML), 2017.
pp. 5206–5210. [44] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic
[18] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete convolutional and recurrent networks for sequence modeling,” arXiv
representation learning,” in Advances in Neural Information Processing preprint arXiv:1803.01271, 2018.
Systems, 2017, pp. 6309–6318. [45] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
[19] E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Besacier, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet:
X. Anguera, and E. Dupoux, “The zero resource speech challenge 2017,” A generative model for raw audio,” arXiv preprint arXiv:1609.03499,
in Proc. Automatic Speech Recognition and Understanding Workshop 2016.
(ASRU), 2017. [46] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao,
[20] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning rep- Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical generative
resentations by back-propagating errors,” Nature, vol. 323, no. 6088, modeling for controllable speech synthesis,” in Proc. International
1986. Conference on Learning Representations, 2019.
[21] H. Lee, C. Ekanadham, and A. Ng, “Sparse deep belief net model for [47] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taı̈ga, F. Visin, D. Vázquez,
visual area V2,” in Advances in Neural Information Processing Systems, and A. Courville, “PixelVAE: A latent variable model for natural images,”
2008. in Proc. International Conference on Learning Representations, 2017.
[22] S. Dieleman and B. Schrauwen, “End-to-end learning for music audio,” [48] L. Wiskott and T. J. Sejnowski, “Slow feature analysis: Unsupervised
in Proc. International Conference on Acoustics, Speech and Signal learning of invariances,” Neural Computation, vol. 14, no. 4, 2002.
Processing (ICASSP), 2014, pp. 6964–6968. [49] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and
[23] N. Jaitly and G. Hinton, “Learning a better representation of speech S. Bengio, “Generating sentences from a continuous space,” in SIGNLL
soundwaves using restricted Boltzmann machines,” in Proc. International Conference on Computational Natural Language Learning, 2016.
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011.
[50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi-
[24] Z. Tüske, P. Golik, R. Schlüter, and H. Ney, “Acoustic modeling with deep
nov, “Dropout: A simple way to prevent neural networks from overfitting,”
neural networks using raw time signal for LVCSR,” in Proc. Interspeech,
Journal of Machine Learning Research, vol. 15, no. 1, 2014.
2014.
[51] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke,
[25] D. Palaz, M. Magima Doss, and R. Collobert, “Analysis of CNN-
A. Goyal, Y. Bengio, A. Courville, and C. Pal, “Zoneout: Regularizing
based speech recognition system using raw speech as input,” in Proc.
RNNs by randomly preserving hidden activations,” in Proc. International
Interspeech, Sep. 2015.
Conference on Learning Representations, 2017.
[26] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals,
“Learning the speech front-end with raw waveform CLDNNs,” in Proc. [52] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
Interspeech, Sep. 2015. in Proc. International Conference on Learning Representations, 2015.
[27] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: [53] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approximation
A Large-Scale Hierarchical Image Database,” in Proc. IEEE Conference by averaging,” SIAM Journal on Control and Optimization, vol. 30, no. 4,
on Computer Vision and Pattern Recognition, 2009. pp. 838–855, 1992.
[28] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are [54] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,
features in deep neural networks?” in Advances in Neural Information M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer,
Processing Systems, 2014, pp. 3320–3328. and K. Vesely, “The Kaldi speech recognition toolkit,” in Proc. Automatic
[29] K. Veselỳ, M. Karafiát, F. Grézl, M. Janda, and E. Egorova, “The Speech Recognition and Understanding Workshop (ASRU), 2011.
language-independent bottleneck features,” in Proc. Spoken Language [55] T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and E. Dupoux,
Technology Workshop (SLT), 2012, pp. 336–341. “Evaluating speech features with the minimal-pair ABX task: Analysis
[30] D. Yu and M. L. Seltzer, “Improved bottleneck features using pretrained of the classical MFC/PLP pipeline,” in Proc. Interspeech, 2013, pp. 1–5.
deep neural networks,” in Proc. Interspeech, 2011. [56] T. Schatz, V. Peddinti, X.-N. Cao, F. Bach, H. Hermansky, and
[31] B. McCann, J. Bradbury, C. Xiong, and R. Socher, “Learned in translation: E. Dupoux, “Evaluating speech features with the minimal-pair ABX
Contextualized word vectors,” in Advances in Neural Information task (ii): Resistance to noise,” in Proc. Interspeech, 2014.
Processing Systems, 2017, pp. 6294–6305. [57] M. Heck, S. Sakti, and S. Nakamura, “Unsupervised linear discriminant
[32] S. R. Bowman, G. Angeli, C. Potts, and C. Manning, “A large annotated analysis for supporting DPGMM clustering in the zero resource scenario,”
corpus for learning natural language inference,” in Proc. Conference on Procedia Computer Science, vol. 81, pp. 73–79, 2016.
Empirical Methods in Natural Language Processing, 2015. [58] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Multilingual bottle-
[33] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, neck feature learning from untranscribed speech,” in Proc. Automatic
“Supervised learning of universal sentence representations from natural Speech Recognition and Understanding Workshop (ASRU), 2017.
language inference data,” in Proc. Conference on Empirical Methods in [59] T. Ansari, R. Kumar, S. Singh, and S. Ganapathy, “Deep learning methods
Natural Language Processing (EMNLP), September 2017, pp. 670–680. for unsupervised acoustic modeling—leap submission to zerospeech chal-
[34] C. M. Bishop, “Continuous latent variables,” in Pattern Recognition and lenge 2017,” in Proc. Automatic Speech Recognition and Understanding
Machine Learning. Springer, 2006, ch. 12. Workshop (ASRU), 2017, pp. 754–761.
[35] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative [60] Y. Yuan, C. C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Extracting
matrix factorization,” Nature, vol. 401, no. 6755, p. 788, 1999. bottleneck features and word-like pairs from untranscribed speech for
[36] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive feature representation,” in Proc. Automatic Speech Recognition and
field properties by learning a sparse code for natural images,” Nature, Understanding Workshop (ASRU), Dec 2017, pp. 734–739.
vol. 381, no. 6583, p. 607, 1996. [61] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen,
[37] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgian-
Proc. International Conference on Learning Representations, 2014. nakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet
[38] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, on mel spectrogram predictions,” in Proc. International Conference on
S. Mohamed, and A. Lerchner, “Beta-VAE: Learning basic visual Acoustics, Speech and Signal Processing (ICASSP), 2018.
concepts with a constrained variational framework,” in Proc. International [62] X. Wang and C.-C. J. Kuo, “An 800 bps VQ-based LPC voice coder,”
Conference on Learning Representations, 2017. Journal of the Acoustical Society of America, vol. 103, no. 5, 1998.
13