Accepted Manuscript: Speech Communication
Accepted Manuscript: Speech Communication
Accepted Manuscript: Speech Communication
PII: S0167-6393(18)30239-5
DOI: https://doi.org/10.1016/j.specom.2019.02.003
Reference: SPECOM 2628
Please cite this article as: Subhojeet Pramanik, Aman Hussain, Text Normalization
using Memory Augmented Neural Networks, Speech Communication (2019), doi:
https://doi.org/10.1016/j.specom.2019.02.003
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Abstract
We perform text normalization, i.e. the transformation of words from the written to the spoken form, using a memory augmented
T
neural network. With the addition of dynamic memory access and storage mechanism, we present a neural architecture that will serve
IP
as a language-agnostic text normalization system while avoiding the kind of unacceptable errors made by the LSTM-based recurrent
neural networks. By successfully reducing the frequency of such mistakes, we show that this novel architecture is indeed a better
CR
alternative. Our proposed system requires significantly lesser amounts of data, training time and compute resources. Additionally,
we perform data up-sampling, circumventing the data sparsity problem in some semiotic classes, to show that sufficient examples in
US
any particular class can improve the performance of our text normalization system. Although a few occurrences of these errors still
remain in certain semiotic classes, we demonstrate that memory augmented networks with meta-learning capabilities can open many
doors to a superior text normalization system.
AN
Keywords: text normalization, differentiable neural computer, deep learning
1. Introduction which require normalization are usually very sparse which might
M
and learn algorithms on its own. There have been recent ad- in Text-to-Speech (TTS) systems to render the textual data as a
vancements in memory augmented neural network architectures standard representation that can be converted into the audio form.
such as the Differentiable Neural Computer, with dynamic mem- In Automatic Speech Recognition (ASR) systems, raw textual
ory access and storage capacity. Such architectures have shown data is processed into language models using text normalization
the ability to learn algorithmic tasks such as traversing a graph techniques.
or finding relations in a family tree. Normalization of semiotic For example, a native English speaker would read the fol-
classes of interest, particularly those containing numbers and lowing sentence Please forward this mail to 312, Park Street,
measurement units, can be performed using basic algorithmic Kolkata as Please forward this mail to three one two Park Street
T
steps. While neural network architectures such as LSTM work Kolkata. However, The new model is priced at $ 312 would be
sufficiently well in machine translation tasks, they have shown to read as The new model is priced at three hundred and twelve
IP
suffer in these semiotic classes which require basic step-by-step dollars. This clearly demonstrates that contextual information is
CR
transduction of the input tokens (Sproat and Jaitly, 2017). There of particular importance during the conversion of written text to
is hope that neural network architectures with memory augmen- spoken form.
tation will be able to learn the algorithmic steps (meta-learning), Further, the instances of actual conversion are few and far
similar to the Finite-state filters, but without any human inter-
vention or the external knowledge about the language and its
grammar. With such memory augmentation, the network should
US between swathes of words which do not undergo any transforma-
tion at all. On the English dataset used in this paper, 92.5% of
words remain the same. The inherent data sparsity of this prob-
AN
be able to learn to represent and reason about the sequence of lem makes training machine learning models especially difficult.
characters in the context of the text normalization task. The quality requirements of a TTS system is rather high given
We begin by defining the challenges of text normalization to the nature of the task at hand. A model will be highly penalized
M
try to understand the reason behind these "silly" mistakes. Then, for making "silly" errors such as when transforming measure-
after a brief overview of the prior work done on this topic, we ments from 100 KG to Hundred Kilobytes or when transforming
ED
describe the dataset released by Sproat and Jaitly (2017) which dates from 1/10/2017 to first of January twenty seventeen.
has been used in a way so as to allow for an objective compar-
ative analysis. Subsequently, we delve into the theoretical and 2.1. Prior Work
PT
dicting completely inaccurate dates or currencies. Such "silly" 3. Background: Memory Augmented Neural Networks
errors are unacceptable in a TTS system deployed in production.
Traditional deep neural networks are great at fuzzy pattern
However, a few of these errors were shown to be corrected by a
matching, however they do not generalize well on complex data
FST (Finite State Transducer) which employs a weak covering
structures such as graphs and trees, and also perform poorly in
grammar to filter and correct the misreadings.
learning representations over long sequences. To tackle sequen-
tial forms of data, Recurrent Neural Networks were proposed
2.2. Dataset which have been known to capture temporal patterns in an input
sequence and also known to be Turing complete if wired prop-
T
For the purposes of a comparative study and quantitative erly (Siegelmann and Sontag, 1995). However, traditional RNNs
IP
interpretation, we have used the exact same English and Rus- suffer from what is known as the vanishing gradients problem
sian dataset as used in Sproat and Jaitly (2017). The English (Bengio et al., 1994). A Long Short-Term Memory architecture
CR
dataset consists of 1.1 billion words extracted from Wikipedia was proposed in (Hochreiter and Schmidhuber, 1997) capable
and run through Google’s TTS system’s Kestrel text normaliza- of learning over long sequences by storing representations of
tion system to generate the target verbalizations. The dataset the input data as a cell state vector. LSTM can be trained on
is formatted into ’before’ or unprocessed tokens and ’after’ or
normalized tokens. Each token is labeled with its respective
1
US variable length input-output sequences by training two separate
LSTM’s called the encoder and decoder (Sutskever et al., 2014).
AN
semiotic class such as PUNCT for punctuations and PLAIN The Encoder LSTM is trained to map the input sequence to a
for ordinary words. The Russian dataset consists of 290 mil- fixed length vector and Decoder LSTM generates output vectors
lion words from Wikipedia and is formatted likewise. These from the fixed length vector. This kind of sequence to sequence
M
datasets are available at https://github.com/rwsproat/ learning approach have been known to outperform traditional
text-normalization-data. DNN models in machine translation and sequence classification
Each of these datasets are split into 100 files. The base paper tasks. Extra information can be provided to the decoder by using
ED
(Sproat and Jaitly (2017)) uses 90 of these files as the training set, attention mechanisms (Bahdanau et al., 2014) and (Luong et al.,
5 files for the validation set and 5 for the testing set. However, 2015), which allows the decoder to concentrate on the parts of
PT
our proposed system only uses the first two files of the English the input that seem relevant at a particular decoding step. Such
dataset (2.2%) and the first four files of the Russian dataset models are widely used and have helped to achieve state of the
(4.4%) for the training set. To keep the results consistent and art accuracy in machine translation systems.
CE
draw objective conclusions, we have defined the test set to be However, LSTM based sequence-to-sequence models are still
precisely the same as the one used by the base paper. Hence, not good at representing complex data structures or learning to
AC
the first 100,002 and 100,007 lines are extracted from the 100th perform algorithmic tasks. They also require a lot of training
file output-00099-of-00100 of the English and Russian dataset data to generalize well to long sequences. An interesting ap-
respectively. proach by Joulin and Mikolov presents a recurrent architecture
with a differentiable stack, able to perform algorithmic tasks
1 ALL = all cases; PLAIN = ordinary word (<self>); PUNCT = punctuation such as counting and memorization of input sequences. Similar
(sil); TRANS = transliteration; LETTERS = letter sequence; CARDINAL memory augmented neural network (Grefenstette et al., 2015)
= cardinal number; VERBATIM = verbatim reading of character sequence;
have also shown to benefit in natural language transduction prob-
ORDINAL = ordinal number; DECIMAL = decimal fraction; ELECTRONIC
= electronic address; DIGIT = digit sequence; MONEY = currency amount; lems by being able to learn the underlying generating algorithms
FRACTION = non-decimal fraction; TIME = time expression required for the transduction process. Further, a memory aug-
3
ACCEPTED MANUSCRIPT
mented neural network architecture called the Neural Turing where N is a non-linear function, θ contains the all trainable
Machine was introduced by Graves et al. that uses an external parameters in the controller network. The read vector r is used
memory matrix with read and write heads. A Controller net- to perform read operation at every time step. The read vector r
work that works like an RNN is able read and write information defines a weighted sum over all memory locations for a memory
from the memory. The read and write heads use content and matrix M by applying a read weighting w r ∈ ∆ N over memory
location-based attention mechanisms to focus the attention on M. ∆ N is the non negative orthant of R N with the unit simplex
specific parts of the memory. NTM has also shown promise as a boundary.
in meta-learning (Santoro et al., 2016) showing that memory
X
N
T
augmented networks are able to generalize well to even lesser r= M[i, ·]w r [i] (2)
training examples. i=1
IP
An improvement to this architecture was proposed by Graves where the ‘·’ denotes all j = 1, . . . , W . The interface vector ε t is
used to parameterize memory interactions for the next time step.
CR
et al. called the Differentiable Neural Computer having even
more memory access mechanisms and dynamic storage capa- A write operation is also performed at each time-step using a
bilities. DNC, when trained in a supervised manner, was able write weighting w w ∈ ∆ N which first erases unused information
nal memory matrix M ∈ R N ×W . At each timestep t, the The system uses a combination of different attention mecha-
controller network takes as input a controller input vector nisms to determine where to read and write at every time-step.
1 , . . . , r R ], where x ∈ R X is the
AC
χ t = [x t ; r t−1 t−1 t input vector The attention mechanisms are all parameterized by the inter-
1 R
for the time-step t and r t−1, . . . , r t−1 is a set of R read vectors face vector ε t . The write weighting w w , used to perform the
from the previous time step and outputs an output vector vt and write operation, is defined by a combination of content-based
interface vector ε t ∈ R(W ×R)+3W +5R+3 . The controller network addressing and dynamic memory allocation. The read weight-
is essentially a recurrent neural network such as the LSTM. The ing w r is defined by a combination of content-based addressing
recurrent operation of the controller network can be encapsulated and temporal memory linkage. The entire system is end-to-end
as in Eqn 1: differentiable and can be trained through backpropagation. For
the purpose of this research, the internal architecture of the Dif-
(vt , ε t ) = N([ χ1 ; . . . ; χ t ; θ]) (1) ferentiable Neural Computer remains the same as specified in
4
ACCEPTED MANUSCRIPT
the original paper (Graves et al., 2016). The open-source im- new learner from making the previous mistakes again. Simi-
plementation of the DNC architecture used here is available at larly, in gradient boosting, the "defects" are defined by the error
https://github.com/deepmind/dnc. gradients.
The model is initiated with a weak learner F (x i ), which is
3.2. Extreme Gradient Boosting a decision stump i.e. a shallow decision tree. The subsequent
Boosting is an ensemble machine learning technique which at- steps keep adding a new learner, h(x), which is trained to pre-
tempts to pool the expertise of several learning models to form a dict the error residual of the previous learner. Therefore, it aims
better learner. Adaptive Boosting, or more commonly known as to learn a sequence of models which continuously tries to cor-
T
"AdaBoost", was the first successful boosting algorithm invented rect the residuals of the earlier model. The sum of predictions
is increasingly accurate and the ensemble model increasingly
IP
by Freund and Schapire. (Breiman, 1998) and (Breiman, 1999)
went on to formulate the boosting algorithm of AdaBoost as a complex.
CR
kind of gradient descent with a special loss function. (Friedman To elucidate further, we consider a simple regression problem.
et al., 2000), (Friedman, 2001) further generalized AdaBoost to Initially a regression model F (x 1 ) is fitted to the original data
gradient boosting in order to handle a variety of loss functions. points: (x 1, y1 ), (x 2, y2 ), · · · , (x n, yn ). The error in the model
It has been reported to run more than ten times faster than exist-
F (x i ) := F (x i ) + yi − F (x i ) (6)
ing solutions on a single node. The reason behind using gradient
PT
T
Since the starting position of the target input token can vary,
is used. We propose a novel sequence-to-sequence architecture
this value serves as the ’start of token’ identifier. An example
IP
based on Differentiable Neural Computer. This second model
can better elucidate the process. For instance, the input vector
uses the Encoder Decoder architecture (Sutskever et al., 2014)
CR
for ’genus’ in the sentence : Brillantaisia is a genus of plant in
combined with Badhanau attention mechanism (Bahdanau et al.,
family Acanthaceae will be: [ -1, 97, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
2014). The model tries to maximize the conditional probability
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 103, 101, 110,
P(y | x) where y is the target sentence and x is a sequence of
characters formed by the concatenation of the to-be-normalized
token w and context words wi−k to wi+k surrounding the token.
US 117, 115, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, -1, 111, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1].
AN
The major intuition behind using two different models is that
After the data is preprocessed and ready, we perform a train-
the instances of actual conversion are few and far between. Train-
validation split to help us tune the model. The performance
ing a single deep neural network with such a heavily skewed
M
4.1. XGBoost Classifier model complexity. Finally, we have an AUC score of 0.999875
An extreme gradient boosting machine (XGBoost) model is in the training set and a score of 0.998830 on the validation set.
trained to classify tokens into the following two classes: Re- The XGBoost package by (Chen and Guestrin, 2016) allows
mainSame and ToBeNormalized, to be used in the later stage of us to rank the relative importance of the features for the clas-
the pipeline. Tokenization of the training data has already been sification task by looking at the improvement in the accuracy
performed. Additional preprocessing of the tokens or words brought about by any particular feature. On generating the fea-
needs to be done before the XGBoost model can be trained on ture importance plot of the trained English XGBoost model in
it. Specifically, we transform the individual tokens into numer- Figure 1, we find out that the first six characters of the target
ical feature vectors to be fed into the model. Each token is token, i.e. feature at the 32nd, 33rd, 34th, 35th, 36th and 37th
6
ACCEPTED MANUSCRIPT
T
ht = ge (x t , ht−1, s t ) (9)
IP
where function ge gives the output of the DNC network during
the encoding phase, and s t is the hidden state of the DNC. Dur-
Figure 1: Top 10 important features for the English XGBoost model
CR
ing the decoding phase the DNC is trained to generate an output
position of the feature vector, had the highest score. For a human word yt ∈ R K y , given a context vector ct ∈ Rn , where K y is
classifying these tokens, the first few characters of the word or the output vocabulary size. The decoding phase uses Bahdanau
token are indeed the best indicators to decide whether it needs
to be normalized or not. This assures us that the trained model
USattention mechanism (Bahdanau et al., 2014) to generate a con-
text vector ct by performing soft attention over the annotation
AN
has in fact learned the right features for the task at hand. After vectors h.
these first few features, the model also places high importance The Decoder defines a conditional probability P of an output
on other characters belonging to the preceding and succeeding word yt at time step t given sequence of input vectors x and
M
token. The high F1-score of the model as reported in Table 3 previous predictions y1, . . . , yt−1 .
confirms the overall effectiveness of the model.
The model could also be trained to classify the tokens into P(yt | y1, . . . , yt−1, x) = gd (yt−1, s t , ct ) (10)
ED
semiotic classes. The semiotic classes which most confuses the where gd gives the output of the DNC network during the de-
DNC translator could be fed to a separate sequence-to-sequence coding phase, and s t is the hidden state of the of the DNC at
PT
feed even more contextual information into the model. where f calculates the new state of the DNC network based on
the previous controller and memory states. During the decoding
4.2. Sequence to Sequence DNC
AC
phase, the output of the DNC is fed into a dense layer followed
The ToBeNormalized tokens, as classified by the XGBoost by a soft-max layer to generate word-by-word predictions. We
model, are then fed to a recurrent model. For this end, we present also used embedding layers to encode the input and output tokens
an architecture called sequence-to-sequence DNC that allows the into fixed dimensional vectors during the encoding and decoding
DNC model to be adapted for sequence-to-sequence translation phases.
purposes. Our underlying framework uses the RNN Encoder- A DNC uses a N × W dimensional memory matrix for storing
Decoder architecture (Sutskever et al., 2014). We have also used state information compared to a single cell state in an LSTM.
attention mechanisms to allow the decoder to concentrate on The presence of an external memory allows the DNC to store
the various different output states generated during the encoding representations of the input data in its memory matrix using write
7
ACCEPTED MANUSCRIPT
T
tween 3 context words to the left and right, with a distinctive
Figure 2: Sequence to sequence DNC, encoding phase.
tag marking the to-be-normalized word. This is then fed as a
IP
sequence of characters into the input embedding layer during the
CR
Encoding stage. For the sentence The city is 15kms away from
here, in order to normalize the token 15km the input becomes
US where <norm> and </norm> are tags that mark the beginning
and end of the to-be-normalized token. The output is always a
AN
sequence of words. During decoding phase output tokens are
first fed into an output embedding layer before feeding it to the
decoder. For the above example, the output becomes
M
fifteen kilometers
ED
heads and then read the representations from the memory using
PT
structure in those sequences. The content and location based is reported in Table 2 and 3. The DNC model in the second
attention mechanisms give the network more information about layer was trained for 200k steps on a single GPU system with a
the input data during decoding. The ability to read and write
AC
to look into the kind of errors it makes. A single metric such as used as a text normalization solution for these classes. This is
accuracy or BLEU (Bilingual Evaluation Understudy) score is an improvement over the baseline LSTM model in terms of the
not sufficient for comparison. It is not much of a problem if a quality of prediction.
token from the DATE class: 2012 is normalized as two thousand The DNC network, however, suffers in some classes: MEA-
twelve, instead of twenty twelve. However, we certainly would SURE, FRACTION, MONEY and CARDINAL, similar to the
not want it to be translated to something like twenty thirteen. baseline LSTM network. The errors reported in these classes
These ’silly’ errors are subjective by their very nature and thus, are shown in Table 7. All the unacceptable mistakes in cardi-
rely on a human reader. This makes the analysis of these kinds nals occur in large numbers greater than a million. The DNC
T
of errors difficult but important. One has to take a look at all sometimes also struggles with getting the measurement units
the cases where the model produces completely unacceptable
IP
and denominations right. For non-Russian readers, we illus-
predictions. trate this with the MONEY token $ 1m where the prediction is
CR
It can be seen that the class-wise accuracies of the model completely off; since одиннадцать долларов сэ ш а means
are quite similar to the base LSTM model. Upon analyzing "eleven US dollars" but один миллион долларов сэ шэ а
the nature of error made in each class, it was identified that the means "1 million US dollars". In terms of overall accuracy,
For example, the token 1968 in DATE context is predicted as predicting completely inaccurate digits and units, which is not
M
if in CARDINAL context. However, the DNC never makes a enough to make it a trustworthy system for these classes.
completely unacceptable prediction in these classes for both In order to understand why the model performs so well in
English and Russian data-sets as can be observed in Table 6. some classes but suffers in others, we proceeded to find the fre-
ED
For readers unfamiliar with the Russian language, we look at quency of these specific tokens in the English training dataset.
the issue with 22 июля. This error stems from confusion in The training set has 17,712 instances of dates of the form
PT
grammatical cases which do not exist in the English language x x/yy/zzzz. As reported in the earlier section, the model made
and are replaced with prepositions. However in the Russian lan- zero unacceptable mistakes in these DATE tokens. The baseline
guage prepositions are used along with grammatical cases, but LSTM, however, still reported unacceptable errors for dates of
CE
may also be omitted in many situations. Now, the 22nd is trans- the similar form. On the other hand, measurement units such
formed to двадцать второго (transliterated into Latin script as as m A, g/cm3 and ch occur less than ten times in the train-
AC
"dvadtsat vtorOGO") whereas when used with the preposition ing set. Compared to other measurement units, kg and cm are
of, such as of the 22nd, it is transformed to двадцать второе present more than 200 times in the training set. CARDINAL has
(transliterated into Latin script as "dvadtsat vtorOE"). The base- 273, 111 tokens out of which only 1,941 are numbers which are
line LSTM based sequence-to-sequence architecture proposed larger than a million. Besides, the error in MONEY for the En-
in Experiment 2 of Sproat and Jaitly (2017) showed completely glish data-set was for the denomination that occurred only once
unacceptable errors such as a DATE 11/10/2008 normalized as in the training set. The results in Table 7 clearly demonstrate
the tenth of october two thousand eight. On the other hand the that the model suffers only in the tokens for which a sufficient
DNC network never makes these kind of ’silly’ errors in these number of examples are not available in the training set. The
classes. This suggests that the DNC network can, in fact, be DNC network never made any unacceptable prediction for ex-
9
ACCEPTED MANUSCRIPT
Table 1: sequence-to-sequence DNC experimental settings. K x : input vocabulary size, K y : output vocabulary size, R: number of read heads
T
RemainSelf 1.00 1.00 1.00
without any modification to the DNC activations. For the second
ToBeNormalized 0.99 1.00 1.00
IP
condition, we zero out the read vectors from the memory at each
time-step before providing it as input to the controller network.
CR
Table 3: Classification Report for XGBoost English model
If the DNC learns to use the provided memory access mecha-
Precision Recall F1 score
nisms during the training process (i,e, the prediction is largely
RemainSelf 1.00 1.00 1.00
dependent on the value of the read and write vectors), zeroing
ToBeNormalized 0.94 0.99 0.96
works to perform text normalization, we conducted an ablation a significant reduction in performance. Upon analyzing the kind
experiment to factor out its contribution if any. We know that of errors that the DNC network makes when memory structures
the DNC model consists of a controller network, equipped with are removed, it was found that the DNC network gets the trans-
CE
various memory access mechanisms to read and write from a lation context correct in most cases. For example, for the token
memory matrix. During the process of training, the controller 1984, the prediction of the DNC network without memory is
AC
network is intended to learn to use the provided memory ac- nineteen thousand two hundred eighty one. The model starts
cess mechanisms instead of just relying on its internal LSTM the translation correctly, but however, it fails mid-way by pre-
state. This is important to make the most out of the benefits dicting a completely incorrect digit. This is indicative of our
that come from memory augmentation. The DNC controller prior assumption that the model learns to write the input tokens
network at each time-step receives a set of R read vectors as in its memory matrix during the encoding stage and later reads
input. These read vectors or memory activations are obtained by from the memory during the decoding stage. If the memory
performing a read operation on the memory matrix. Our ablation structures would not have been used by the network for transla-
experiment intends to verify the contributions of these memory tion, we should have seen, essentially, no drop in performance
activations for prediction. We use an existing model pre-trained
10
ACCEPTED MANUSCRIPT
Table 4: Comparison of accuracies over the various semiotic classes of interest on the English data-set. base accuracy: accuracy of the LSTM based sequence-to-
sequence model proposed in (Sproat and Jaitly, 2017). accuracy: accuracy of the proposed, XGBoost + sequence-to-sequence DNC model.
T
4 LETTERS 1404 1409 0.971 0.971
IP
5 CARDINAL 1067 1037 0.989 0.994
6 VERBATIM 894 1001 0.980 0.994
CR
7 MEASURE 142 142 0.986 0.971
8 ORDINAL 103 103 0.971 0.980
9 DECIMAL 89 92 1.000 0.989
10
11
DIGIT
MONEY
37
36
US 44
37
0.865
0.972
0.795
0.973
AN
12 FRACTION 13 16 0.923 0.688
13 TIME 8 8 0.750 0.750
M
Table 5: Comparison of accuracies over the various semiotic classes of interest on the Russian data-set. The headings are same as in Table 4.
ED
11
ACCEPTED MANUSCRIPT
Table 6: Errors in which the DNC network is confused with the context of the token.
input semiotic-class prediction truth
0 2007 DIGIT two thousand seven two o o seven
1 1968 DATE one thousand nine hundred sixty eight nineteen sixty eight
2 0:02:01 TIME zero hours two minutes and one seconds zero hours two minutes and one second
3 22 июля DATE двадцать второго июля двадцать второе июля
4 II ORDINAL два второй
T
Table 7: Errors in which the DNC network makes completely unacceptable predictions.
input semiotic-class prediction truth
IP
0 14356007 CARNINAL one million four hundred thirty five fourteen million three hundred fifty six
thousand six hundred seven thousand seven
CR
1 0.001251 g/cm3 MEASURE zero point o o one two five one sil g per zero point o o one two five one grams
hour per c c
2
3
88.5 million HRK
10/618,543
MONEY
FRACTION
US
eighty eight point five million yen
12
ACCEPTED MANUSCRIPT
on removing them. Apparently, the read vectors have high fea- 6. Discussion
ture importance in performing a successful prediction during
Given the reduced number of unacceptable predictions in
inference.
most semiotic classes we can say that the quality of the pre-
dictions produced by DNC are much better than the baseline
5.2. Results on up-sampled training set
LSTM model. The reason why DNC works better than LSTM
The initial results lead us to a follow-up question. Can our might be due to the presence of a memory matrix and read-write
system perform better given better quality data? Will a simple heads. The read-write heads allows the DNC to store richer
up-sampling procedure on the rare kinds of tokens improve the representations of the data in its dynamic memory matrix. Re-
T
model? To test our hypothesis of whether sufficient examples search done by Santoro et al., shows memory augmented neural
IP
can improve the performance in certain semiotic classes, we up- networks have the ability to generalize well to less number of
sampled the distribution of those specific tokens which occurred training examples. The model never made any unacceptable pre-
CR
less than a particular threshold frequency. The up-sampling was diction in some classes (DIGIT, DATE, ORDINAL and TIME)
done with duplication only for MEASURE and CARDINAL on for the same test set used by Sproat and Jaitly. With a basic
the English dataset. Sentences which had measurement units augmentation technique and minimal human requirement the
which occurred less than 100 times were up-sampled to have
100 instances each in the training set. Out of 253 measurement
US number of unacceptable errors in MEASURE was reduced to
zero. The LSTM model reported in (Sproat and Jaitly, 2017)
AN
units, 229 of them occurred less than a 100 times in the en- reported unacceptable errors even when sufficient examples are
tire training set. Similarly, sentences with Cardinals with value present. On the other hand, DNC is quite resistant to errors
larger than a million were up-sampled to 10,000 instances. The when sufficient examples are present. However, the DNC is
M
final distribution of the training set was 59,439 for MEASURE still prone to making unacceptable predictions in some classes
and 299,694 for CARDINAL. The model was then retrained (FRACTION, MONEY and CARDINAL) which makes it still
for the same number of training iterations with the up-sampled risky as a standalone text normalization system. There is still
ED
data. The overall accuracies and number of unacceptable errors a lot of work to be done before a purely deep learning based
for MEASURE & CARDINAL after up-sampling are shown in algorithm can be used a standalone component of a TTS system.
PT
Table 10. The comparison of the predictions is shown in Table We believe the performance the model can be further improved
9. Overall, it is very interesting to see that using simple data by designing a more balanced training set.
augmentation techniques like up-sampling helped remove all Apart from the domain of text normalization, we also provide
CE
the unacceptable errors in MEASURE and reduced the number evidence that a sequence-to-sequence architecture made with
of unacceptable errors from three to two in CARDINAL. Such DNC can be successfully trained for tasks similar to machine
AC
an elementary technique even removed errors in rare measure- translation systems. Until now DNC has only been used for
ment units such as ch and g/cc. However, the improvement solving simple algorithmic tasks and have not been applied to
observed in CARDINAL was rather modest. And as expected, real-time production environments. The quality of the results
the number of unacceptable errors in other classes were unaf- produced by DNC in text normalization demonstrates it is, in
fected. This clearly provides evidence to the initial assumption fact, a viable alternative to LSTM based models. LSTM based
that our system improves, even if marginally, when a sufficient architectures usually require large amounts of training data. The
number of examples are produced for any particular instance results in Sproat and Jaitly (2017) show that the LSTM based seq-
type. Nonetheless, we can safely say that this system looks to-seq models can sometimes produce a weird output even when
promising and worthy of widespread adoption. sufficient examples are present. For instance, LSTM’s did not
13
ACCEPTED MANUSCRIPT
T
_letter o_letter
IP
4 14356007 CARDINAL one million four hundred thirty five fourteen million three hundred fifty six
thousand six hundred seven thousand seven
CR
7. Conclusion
Table 10: Accuracies and no. of unacceptable errors before and after up-sampling
for the English data-set. a1: accuracy before up-sampling, a2: accuracy after
up-sampling, e1: no. of unacceptable errors before up-sampling, e2: no. of
unacceptable errors after up-sampling.
semiotic-class a1 a2 e1 e2
US Therefore, we can safely arrive at the conclusion that mem-
ory augmented neural networks such as the DNC are in fact a
promising alternative to LSTM based models for a language
AN
agnostic text normalization system. Additionally, the proposed
0 MEASURE 0.971 0.986 4 0
system requires significantly lesser amounts of data, training
1 CARDINAL 0.994 0.991 3 2
duration and compute resources. Our DNC model has reduced
M
generalization compared to the stacked bidirectional LSTM used We would like to show our gratitude to Richard Sproat (Se-
by Sproat and Jaitly, proving that memory augmented neural net- nior Research Scientist at Research & Machine Intelligence,
works can provide much better results with significantly reduced Google, New York) for his insights and comments that greatly
training times and fewer data points. The LSTM model reported improved the manuscript. We thank Kaggle for hosting the Text
in their paper was trained on 8 parallel GPUs for about five and Normalization Challenge by Richard Sproat and Kyle Gorman
half days (460k steps). On the contrary, our model was trained which got us interested in this problem in the first place. We are
on a single GPU system for two days (200k steps). Furthermore, also very grateful to Google DeepMind for open sourcing their
our model used only 2.2% of the English data and 4.4% of the implementation of the Differentiable Neural Computer which
Russian data for training. was a requirement for this research.
14
ACCEPTED MANUSCRIPT
References Hochreiter, S., Schmidhuber, J., 1997. Long Short-Term memory. Neural Com-
put. 9, 1735–1780. URL: http://dx.doi.org/10.1162/neco.1997.9.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, 8.1735, doi:10.1162/neco.1997.9.8.1735.
G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, Joulin, A., Mikolov, T., 2015. Inferring algorithmic patterns with Stack-
A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Augmented recurrent nets. CoRR abs/1503.01007. URL: http://arxiv.
Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., org/abs/1503.01007, arXiv:1503.01007.
Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Luong, M.T., Pham, H., Manning, C.D., 2015. Effective approaches to attention-
Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, based neural machine translation. URL: http://arxiv.org/abs/1508.
T
M., Wicke, M., Yu, Y., Zheng, X., 2015. TensorFlow: Large-scale machine 04025, arXiv:1508.04025.
learning on heterogeneous systems. URL: https://www.tensorflow. Pusateri, E., Ambati, B.R., Brooks, E., Platek, O., McAllaster, D., Nagesha, V.,
IP
org/. software available from tensorflow.org. 2017. A mostly Data-Driven approach to inverse text normalization, in: Proc.
Allen, J., Hunnicutt, S.M., Klatt, D., 1987. From Text to Speech: The MITalk Interspeech 2017, pp. 2784–2788. URL: http://dx.doi.org/10.21437/
CR
System. Cambridge University Press. Interspeech.2017-1274, doi:10.21437/Interspeech.2017-1274.
Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine translation by Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T., 2016.
jointly learning to align and translate. CoRR abs/1409.0473. URL: http: One-shot learning with Memory-Augmented neural networks. URL: http:
//arxiv.org/abs/1605.06065, arXiv:1605.06065.
//arxiv.org/abs/1409.0473, arXiv:1409.0473.
Bengio, Y., Simard, P., Frasconi, P., 1994. Learning long-term dependencies
with gradient descent is difficult. IEEE Transactions on Neural Networks
5, 157–166. URL: http://dx.doi.org/10.1109/72.279181, doi:10.
US Siegelmann, H.T., Sontag, E.D., 1995. On the computational power of neural
nets. J. Comput. Syst. Sci. 50, 132–150. URL: http://dx.doi.org/10.
1006/jcss.1995.1013, doi:10.1006/jcss.1995.1013.
AN
1109/72.279181. Sproat, R., 1996. Multilingual text analysis for text-to-speech synthesis, in:
Breiman, L., 1998. Arcing classifier (with discussion and a rejoinder by the Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Con-
author). Ann. Statist. 26, 801–849. URL: https://doi.org/10.1214/ ference on. URL: http://dx.doi.org/10.1109/ICSLP.1996.607867,
doi:10.1109/ICSLP.1996.607867.
M
aos/1024691079, doi:10.1214/aos/1024691079.
Breiman, L., 1999. Prediction games and arcing algorithms. Neural Sproat, R., Black, A.W., Chen, S.F., Kumar, S., Ostendorf, M., Richards, C.,
Computation 11, 1493–1517. URL: http://dx.doi.org/10.1162/ 2001. Normalization of non-standard words. Computer Speech & Language
ED
Chen, T., Guestrin, C., 2016. Xgboost: A scalable tree boosting system. Sproat, R., Jaitly, N., 2017. RNN approaches to text normalization: A challenge.
arXiv:1603.02754. Sutskever, I., Vinyals, O., Le, Q.V., 2014. Sequence to sequence
PT
Freund, Y., Schapire, R.E., 1999. A short introduction to boosting. learning with neural networks, in: Ghahramani, Z., Welling, M.,
Friedman, J., Hastie, T., Tibshirani, R., 2000. Special invited paper. additive Cortes, C., Lawrence, N.D., Weinberger, K.Q. (Eds.), Advances
logistic regression: A statistical view of boosting. The Annals of Statistics in Neural Information Processing Systems 27. Curran Associates,
CE
aos/1013203451, doi:10.1214/aos/1013203451. Weston, J., Bordes, A., Chopra, S., Rush, A.M., van Merriënboer, B., Joulin,
Graves, A., Wayne, G., Danihelka, I., 2014. Neural turing machines. URL: A., Mikolov, T., 2015. Towards AI-complete question answering: A set
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska- arXiv:1502.05698.
15