Accepted Manuscript: Speech Communication

Accepted Manuscript
Text Normalization using Memory Augmented Neural Networks
Subhojeet Pramanik, Aman Hussain
PII: S0167-6393(18)30239-5
DOI: https://doi.org/10.1016/j.specom.2019.02.003
Reference: SPECOM 2628
To appear in: Speech Communication
Received date: 9 July 2018

Revised date: 28 December 2018
Accepted date: 27 February 2019
Please cite this article as: Subhojeet Pramanik, Aman Hussain, Text Normalization
using Memory Augmented Neural Networks, Speech Communication (2019), doi:
https://doi.org/10.1016/j.specom.2019.02.003
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Text Normalization using Memory Augmented Neural Networks
Subhojeet Pramanika,∗, Aman Hussaina

a VIT University, Vandaloor-Kelambakkam Road, Chennai, Tamil Nadu, India
Abstract
We perform text normalization, i.e. the transformation of words from the written to the spoken form, using a memory augmented
T
neural network. With the addition of dynamic memory access and storage mechanism, we present a neural architecture that will serve
IP
as a language-agnostic text normalization system while avoiding the kind of unacceptable errors made by the LSTM-based recurrent
neural networks. By successfully reducing the frequency of such mistakes, we show that this novel architecture is indeed a better
CR
alternative. Our proposed system requires significantly lesser amounts of data, training time and compute resources. Additionally,
we perform data up-sampling, circumventing the data sparsity problem in some semiotic classes, to show that sufficient examples in
US
any particular class can improve the performance of our text normalization system. Although a few occurrences of these errors still
remain in certain semiotic classes, we demonstrate that memory augmented networks with meta-learning capabilities can open many
doors to a superior text normalization system.
AN
Keywords: text normalization, differentiable neural computer, deep learning
1. Introduction which require normalization are usually very sparse which might
M
result in high overall accuracy. Hence, these specific cases must

The field of natural language processing has seen significant
get the highest attention when evaluating the performance of any
improvements with the application of deep learning. However,
ED
text normalization system. Yet the existing models are prone

there are many unsolved challenges in NLP yet to be solved by
to making "silly" mistakes which are extremely non-trivial and
the prevalent deep neural networks. One of the simple but inter-
mission-critical to TTS (Text-to-Speech) and ASR (Automatic
PT
esting challenges lies in designing a flawless text normalization

Speech Recognition) systems. Finite-state filters, which perform
solution for Text to Speech and Automatic Speech Recognition
simple algorithmic steps on the normalized tokens, are then used
systems. Unlike many other problems being solved by neural
CE
to correct such errors and "guide" the model. Although develop-

networks, the tolerance for unacceptable or "silly" errors in text
ing such FST grammars is a lot simpler than constructing a fully
normalization systems is very low. Prevailing neural architec-
fledged finite-state text normalization system, they do require
AC
tures in Sproat and Jaitly (2017) produce near perfect overall

some human expertise and domain knowledge of the language
accuracy on such a problem. But there is a caveat. Whereas
involved.
deciding whether a word needs to be normalized or not turns out
to be an easier problem since it mostly falls back on classifying We, therefore, ask if there is a way to circumvent the require-
the semiotic class of the token, the actual challenge lies in gener- ment of human involvement and language expertise, and instead
ating the normalized form of the token. However, such instances design a system that is language agnostic and learns on its own
given enough data. Although the most commonly used neural
∗ Correspondingauthor
Email addresses: [email protected] (Subhojeet networks are adept at sequence learning and sensory processing,
Pramanik), [email protected] ( Aman Hussain ) they are very limited in their ability to represent data structures
Preprint submitted to Speech Communication February 28, 2019
ACCEPTED MANUSCRIPT
and learn algorithms on its own. There have been recent ad- in Text-to-Speech (TTS) systems to render the textual data as a
vancements in memory augmented neural network architectures standard representation that can be converted into the audio form.
such as the Differentiable Neural Computer, with dynamic mem- In Automatic Speech Recognition (ASR) systems, raw textual
ory access and storage capacity. Such architectures have shown data is processed into language models using text normalization
the ability to learn algorithmic tasks such as traversing a graph techniques.
or finding relations in a family tree. Normalization of semiotic For example, a native English speaker would read the fol-
classes of interest, particularly those containing numbers and lowing sentence Please forward this mail to 312, Park Street,
measurement units, can be performed using basic algorithmic Kolkata as Please forward this mail to three one two Park Street
T
steps. While neural network architectures such as LSTM work Kolkata. However, The new model is priced at $ 312 would be
sufficiently well in machine translation tasks, they have shown to read as The new model is priced at three hundred and twelve
IP
suffer in these semiotic classes which require basic step-by-step dollars. This clearly demonstrates that contextual information is
CR
transduction of the input tokens (Sproat and Jaitly, 2017). There of particular importance during the conversion of written text to
is hope that neural network architectures with memory augmen- spoken form.
tation will be able to learn the algorithmic steps (meta-learning), Further, the instances of actual conversion are few and far
similar to the Finite-state filters, but without any human inter-
vention or the external knowledge about the language and its
grammar. With such memory augmentation, the network should
US between swathes of words which do not undergo any transforma-
tion at all. On the English dataset used in this paper, 92.5% of
words remain the same. The inherent data sparsity of this prob-
AN
be able to learn to represent and reason about the sequence of lem makes training machine learning models especially difficult.
characters in the context of the text normalization task. The quality requirements of a TTS system is rather high given
We begin by defining the challenges of text normalization to the nature of the task at hand. A model will be highly penalized
M
try to understand the reason behind these "silly" mistakes. Then, for making "silly" errors such as when transforming measure-
after a brief overview of the prior work done on this topic, we ments from 100 KG to Hundred Kilobytes or when transforming
ED
describe the dataset released by Sproat and Jaitly (2017) which dates from 1/10/2017 to first of January twenty seventeen.
has been used in a way so as to allow for an objective compar-
ative analysis. Subsequently, we delve into the theoretical and 2.1. Prior Work
PT
implementation details of our proposed system. Thereafter, the

One of the earliest work done on the problem of text nor-
results of our experiments are discussed. Finally, we find that
malization was used in the MITalk TTS system (Allen et al.,
CE
memory augmented neural networks are indeed able to do a

1987). Further, a unified weighted finite-state transducers based
good job with far lesser amounts of data, time and resources. To
approach was proposed by Sproat (1996). The model serves
make our work reproducible, open to scrutiny and further devel-
AC
as the text-analysis module of the multilingual Bell Labs TTS

opment, we have open sourced a demonstration of our system
system. An advancement in this field was made by looking at
implemented using Tensorflow (Abadi et al., 2015) at https:
it as a language modeling problem by Sproat et al. (2001). For
//github.com/cognibit/Text-Normalization-Demo.
ASR systems requiring inverse text normalization, data-driven
approaches have been proposed by Pusateri et al. (2017). The
2. Text Normalization
latest development in this problem space has been by Sproat and
Text normalization is the canonicalization of text from one or Jaitly (2017). Even though a recurrent neural network model
more possible forms of representation to a ’standard’ or ’canon- trained on the corpus results in very high overall accuracies, it
ical’ form. This transformation is used as a preliminary step remains prone to making misleading predictions such as pre-
2
ACCEPTED MANUSCRIPT
dicting completely inaccurate dates or currencies. Such "silly" 3. Background: Memory Augmented Neural Networks
errors are unacceptable in a TTS system deployed in production.
Traditional deep neural networks are great at fuzzy pattern
However, a few of these errors were shown to be corrected by a
matching, however they do not generalize well on complex data
FST (Finite State Transducer) which employs a weak covering
structures such as graphs and trees, and also perform poorly in
grammar to filter and correct the misreadings.
learning representations over long sequences. To tackle sequen-
tial forms of data, Recurrent Neural Networks were proposed
2.2. Dataset which have been known to capture temporal patterns in an input
sequence and also known to be Turing complete if wired prop-
T
For the purposes of a comparative study and quantitative erly (Siegelmann and Sontag, 1995). However, traditional RNNs
IP
interpretation, we have used the exact same English and Rus- suffer from what is known as the vanishing gradients problem
sian dataset as used in Sproat and Jaitly (2017). The English (Bengio et al., 1994). A Long Short-Term Memory architecture
CR
dataset consists of 1.1 billion words extracted from Wikipedia was proposed in (Hochreiter and Schmidhuber, 1997) capable
and run through Google’s TTS system’s Kestrel text normaliza- of learning over long sequences by storing representations of
tion system to generate the target verbalizations. The dataset the input data as a cell state vector. LSTM can be trained on
is formatted into ’before’ or unprocessed tokens and ’after’ or
normalized tokens. Each token is labeled with its respective
1
US variable length input-output sequences by training two separate
LSTM’s called the encoder and decoder (Sutskever et al., 2014).
AN
semiotic class such as PUNCT for punctuations and PLAIN The Encoder LSTM is trained to map the input sequence to a
for ordinary words. The Russian dataset consists of 290 mil- fixed length vector and Decoder LSTM generates output vectors
lion words from Wikipedia and is formatted likewise. These from the fixed length vector. This kind of sequence to sequence
M
datasets are available at https://github.com/rwsproat/ learning approach have been known to outperform traditional
text-normalization-data. DNN models in machine translation and sequence classification
Each of these datasets are split into 100 files. The base paper tasks. Extra information can be provided to the decoder by using
ED
(Sproat and Jaitly (2017)) uses 90 of these files as the training set, attention mechanisms (Bahdanau et al., 2014) and (Luong et al.,
5 files for the validation set and 5 for the testing set. However, 2015), which allows the decoder to concentrate on the parts of
PT
our proposed system only uses the first two files of the English the input that seem relevant at a particular decoding step. Such
dataset (2.2%) and the first four files of the Russian dataset models are widely used and have helped to achieve state of the
(4.4%) for the training set. To keep the results consistent and art accuracy in machine translation systems.
CE
draw objective conclusions, we have defined the test set to be However, LSTM based sequence-to-sequence models are still
precisely the same as the one used by the base paper. Hence, not good at representing complex data structures or learning to
AC
the first 100,002 and 100,007 lines are extracted from the 100th perform algorithmic tasks. They also require a lot of training
file output-00099-of-00100 of the English and Russian dataset data to generalize well to long sequences. An interesting ap-
respectively. proach by Joulin and Mikolov presents a recurrent architecture
with a differentiable stack, able to perform algorithmic tasks
1 ALL = all cases; PLAIN = ordinary word (<self>); PUNCT = punctuation such as counting and memorization of input sequences. Similar
(sil); TRANS = transliteration; LETTERS = letter sequence; CARDINAL memory augmented neural network (Grefenstette et al., 2015)
= cardinal number; VERBATIM = verbatim reading of character sequence;
have also shown to benefit in natural language transduction prob-
ORDINAL = ordinal number; DECIMAL = decimal fraction; ELECTRONIC
= electronic address; DIGIT = digit sequence; MONEY = currency amount; lems by being able to learn the underlying generating algorithms
FRACTION = non-decimal fraction; TIME = time expression required for the transduction process. Further, a memory aug-
3
ACCEPTED MANUSCRIPT
mented neural network architecture called the Neural Turing where N is a non-linear function, θ contains the all trainable
Machine was introduced by Graves et al. that uses an external parameters in the controller network. The read vector r is used
memory matrix with read and write heads. A Controller net- to perform read operation at every time step. The read vector r
work that works like an RNN is able read and write information defines a weighted sum over all memory locations for a memory
from the memory. The read and write heads use content and matrix M by applying a read weighting w r ∈ ∆ N over memory
location-based attention mechanisms to focus the attention on M. ∆ N is the non negative orthant of R N with the unit simplex
specific parts of the memory. NTM has also shown promise as a boundary.
in meta-learning (Santoro et al., 2016) showing that memory
X
N
T
augmented networks are able to generalize well to even lesser r= M[i, ·]w r [i] (2)
training examples. i=1
IP
An improvement to this architecture was proposed by Graves where the ‘·’ denotes all j = 1, . . . , W . The interface vector ε t is
used to parameterize memory interactions for the next time step.
CR
et al. called the Differentiable Neural Computer having even
more memory access mechanisms and dynamic storage capa- A write operation is also performed at each time-step using a
bilities. DNC, when trained in a supervised manner, was able write weighting w w ∈ ∆ N which first erases unused information
to store representations of input data as “variables” and then

read those representations from the memory to answer synthetic
questions from the BaBI dataset (Weston et al., 2015). DNC
US from the memory using an erase vector e and writes relevant
information to the memory using the write vector vt . The overall
write operation can be formalized as in Eqn 3.
AN
was also able to solve algorithmic tasks such as traversing a
graph or inferring from a family tree, showing that it is able Mt = Mt−1 ◦ (E − wtw e> w >
t ) + w t vt (3)
to process structured data in a manner that is not possible in
M
where ◦ denotes element-wise multiplication and E is an N × W

traditional neural networks. Dynamic memory access allows
matrix of ones. The final output of the controller network yt ∈
DNC to process longer sentences and moreover, extra memory
ED
RY is obtained by multiplying the concatenation of the current

can be added anytime without retraining the whole network.
read vector r t and output vector vt with a RW × Y dimensional
3.1. Differentiable Neural Computer weight matrix Wr .

PT
A basic DNC architecture consists of a controller network,

yt = vt + Wr [r t1 ; . . . ; r tR ] (4)
which is usually a recurrent network coupled with an exter-
CE
nal memory matrix M ∈ R N ×W . At each timestep t, the The system uses a combination of different attention mecha-
controller network takes as input a controller input vector nisms to determine where to read and write at every time-step.
1 , . . . , r R ], where x ∈ R X is the
AC
χ t = [x t ; r t−1 t−1 t input vector The attention mechanisms are all parameterized by the inter-
1 R
for the time-step t and r t−1, . . . , r t−1 is a set of R read vectors face vector ε t . The write weighting w w , used to perform the
from the previous time step and outputs an output vector vt and write operation, is defined by a combination of content-based
interface vector ε t ∈ R(W ×R)+3W +5R+3 . The controller network addressing and dynamic memory allocation. The read weight-
is essentially a recurrent neural network such as the LSTM. The ing w r is defined by a combination of content-based addressing
recurrent operation of the controller network can be encapsulated and temporal memory linkage. The entire system is end-to-end
as in Eqn 1: differentiable and can be trained through backpropagation. For
the purpose of this research, the internal architecture of the Dif-
(vt , ε t ) = N([ χ1 ; . . . ; χ t ; θ]) (1) ferentiable Neural Computer remains the same as specified in
4
ACCEPTED MANUSCRIPT
the original paper (Graves et al., 2016). The open-source im- new learner from making the previous mistakes again. Simi-
plementation of the DNC architecture used here is available at larly, in gradient boosting, the "defects" are defined by the error
https://github.com/deepmind/dnc. gradients.
The model is initiated with a weak learner F (x i ), which is
3.2. Extreme Gradient Boosting a decision stump i.e. a shallow decision tree. The subsequent
Boosting is an ensemble machine learning technique which at- steps keep adding a new learner, h(x), which is trained to pre-
tempts to pool the expertise of several learning models to form a dict the error residual of the previous learner. Therefore, it aims
better learner. Adaptive Boosting, or more commonly known as to learn a sequence of models which continuously tries to cor-
T
"AdaBoost", was the first successful boosting algorithm invented rect the residuals of the earlier model. The sum of predictions
is increasingly accurate and the ensemble model increasingly
IP
by Freund and Schapire. (Breiman, 1998) and (Breiman, 1999)
went on to formulate the boosting algorithm of AdaBoost as a complex.
CR
kind of gradient descent with a special loss function. (Friedman To elucidate further, we consider a simple regression problem.
et al., 2000), (Friedman, 2001) further generalized AdaBoost to Initially a regression model F (x 1 ) is fitted to the original data
gradient boosting in order to handle a variety of loss functions. points: (x 1, y1 ), (x 2, y2 ), · · · , (x n, yn ). The error in the model
Gradient Boosting has proven to be a practical, powerful

and effective machine learning algorithm. Gradient boosting and
US prediction yi − F (x i ) is called the error residual. Now, we fit a
new regression model h(x) to data : (x 1, y1 − F (x 1 )), (x 2, y2 −
F (x 2 )), ..., (x n, yn − F (x n )) where F (x) is the earlier model and
AN
deep neural networks are the two learning models that are widely
recognized and used by the competitive machine learning com- h is the new model to be added to F (x) such that it corrects the
munity at Kaggle and elsewhere. Tree boosting as implemented error residuals yi − F (x i ).

M
by XGBoost (Chen and Guestrin, 2016) gives state-of-the-art

results on a wide range of problems from different domains. The F (x i ) := F (x i ) + h(x i ) (5)
most defining factor for the success of XGBoost is its scalability.
ED
It has been reported to run more than ten times faster than exist-
F (x i ) := F (x i ) + yi − F (x i ) (6)
ing solutions on a single node. The reason behind using gradient
PT
boosting for classification first and a normalization model later

δJ
is two-fold. First, we solve the data sparsity problem by having F (x i ) := F (x i ) − 1 (7)
δF (x i )
the deep neural network train only on the tokens which need
CE
This is quite similar to the method of gradient descent which

normalization. Second, we turn to the incredible scalability and
tries to minimize a function by moving in the opposite direction
efficiency of the XGBoost model to condense huge amounts
of the gradient:
AC
of data which makes it more amenable to quick training and

experimental iterations. δJ
θ i := θ i − ρ (8)
Now, we give a brief overview of the gradient boosting al- δθ i
gorithm and establish the intuition necessary to understand the where ρ is the learning rate. However from a more applied per-
proposed solution for text normalization. Essentially, gradient spective, the model of choice for implementing the weak learners
boosting adds and fits weak learners in a sequential manner to in the XGBoost library (Chen and Guestrin, 2016) are decision
rectify the defects of the existing weak learners. In adaptive tree ensembles which consist of a set of classification and regres-
boosting, these "defects" are defined by assigning higher penalty sion trees. Further implementation details are discussed in the
weights to the misclassified data points in order to restrain the following section.
5
ACCEPTED MANUSCRIPT
4. Proposed Architecture encoded as a vector comprised of the Unicode (UTF-8) values

of its individual characters. We limit the length of the vector
We propose a two-step architecture for text normalization.
at 30; all characters beyond this are left out. In case of shorter
For a given token wi which is to be normalized, the token wi
words, the remaining vector is filled or padded with 0. Contex-
and some context words wi−k to wi+k are first fed as characters
tual information is incorporated into the input feature vector by
into an XGBoost classification model, where k is the number of
prepending the preceding token and appending the succeeding
context words. The XGBoost classification model is trained to
token to the target input token. To demarcate the boundaries
predict whether a particular word is to be normalized or not.
between the consecutive tokens, we use the −1 integer value.
For those words which require normalization, a second model
T
Since the starting position of the target input token can vary,
is used. We propose a novel sequence-to-sequence architecture
this value serves as the ’start of token’ identifier. An example
IP
based on Differentiable Neural Computer. This second model
can better elucidate the process. For instance, the input vector
uses the Encoder Decoder architecture (Sutskever et al., 2014)
CR
for ’genus’ in the sentence : Brillantaisia is a genus of plant in
combined with Badhanau attention mechanism (Bahdanau et al.,
family Acanthaceae will be: [ -1, 97, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
2014). The model tries to maximize the conditional probability
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 103, 101, 110,
P(y | x) where y is the target sentence and x is a sequence of
characters formed by the concatenation of the to-be-normalized
token w and context words wi−k to wi+k surrounding the token.
US 117, 115, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, -1, 111, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1].
AN
The major intuition behind using two different models is that
After the data is preprocessed and ready, we perform a train-
the instances of actual conversion are few and far between. Train-
validation split to help us tune the model. The performance
ing a single deep neural network with such a heavily skewed
M
metric we have used is AUC or area under the curve. A top-down

dataset is incredibly difficult. Separating the task of predicting
approach to hyperparameter tuning is employed. We begin with
whether a word needs to be normalized or not and predicting
a high learning rate and determine the best number of estimators
ED
the normalized sequence of words helps us increase the overall

or trees which is the most important hyperparameter for the
accuracy of the normalization pipeline. Both models being inde-
model along with the learning rate. We find the best number of
pendent of each other are trained separately. The first model tries
estimators to be 361. Then we go on to tune the tree-specific
PT
to maximize the classification accuracy of predicting whether

parameters such as the maximum depth of the decision stumps,
a word requires normalization. The second model tries to min-
the minimum child weight and the gamma value. Once we
imize the softmax cross-entropy of the sequence of predicted
CE
have a decent model at hand, we start tuning the regularization

words averaged over all the time-steps.
parameters to get better or similar performance at a reduced
AC
4.1. XGBoost Classifier model complexity. Finally, we have an AUC score of 0.999875
An extreme gradient boosting machine (XGBoost) model is in the training set and a score of 0.998830 on the validation set.
trained to classify tokens into the following two classes: Re- The XGBoost package by (Chen and Guestrin, 2016) allows
mainSame and ToBeNormalized, to be used in the later stage of us to rank the relative importance of the features for the clas-
the pipeline. Tokenization of the training data has already been sification task by looking at the improvement in the accuracy
performed. Additional preprocessing of the tokens or words brought about by any particular feature. On generating the fea-
needs to be done before the XGBoost model can be trained on ture importance plot of the trained English XGBoost model in
it. Specifically, we transform the individual tokens into numer- Figure 1, we find out that the first six characters of the target
ical feature vectors to be fed into the model. Each token is token, i.e. feature at the 32nd, 33rd, 34th, 35th, 36th and 37th
6
ACCEPTED MANUSCRIPT
phase. One major contribution of this paper is to replace bidirec-

tional LSTM used in a Neural Machine Translation system with a
single unidirectional DNC. During the encoding phase, the DNC
reads an input sequence of vectors x = (x 1, . . . , x T x ); x t ∈ R K x ,
and outputs a sequence of annotation vectors h = (h1, . . . , hT x );
ht ∈ Rn , where K x is the input vocabulary size and Tx is the
number of input tokens.
T
ht = ge (x t , ht−1, s t ) (9)
IP
where function ge gives the output of the DNC network during
the encoding phase, and s t is the hidden state of the DNC. Dur-
Figure 1: Top 10 important features for the English XGBoost model
CR
ing the decoding phase the DNC is trained to generate an output
position of the feature vector, had the highest score. For a human word yt ∈ R K y , given a context vector ct ∈ Rn , where K y is
classifying these tokens, the first few characters of the word or the output vocabulary size. The decoding phase uses Bahdanau
token are indeed the best indicators to decide whether it needs
to be normalized or not. This assures us that the trained model
USattention mechanism (Bahdanau et al., 2014) to generate a con-
text vector ct by performing soft attention over the annotation
AN
has in fact learned the right features for the task at hand. After vectors h.
these first few features, the model also places high importance The Decoder defines a conditional probability P of an output
on other characters belonging to the preceding and succeeding word yt at time step t given sequence of input vectors x and
M
token. The high F1-score of the model as reported in Table 3 previous predictions y1, . . . , yt−1 .
confirms the overall effectiveness of the model.
The model could also be trained to classify the tokens into P(yt | y1, . . . , yt−1, x) = gd (yt−1, s t , ct ) (10)
ED
semiotic classes. The semiotic classes which most confuses the where gd gives the output of the DNC network during the de-
DNC translator could be fed to a separate sequence-to-sequence coding phase, and s t is the hidden state of the of the DNC at
PT
model which is exclusively trained on those error-prone classes. timestep t computed by

Another direction to go from here would be to increase the size
s t = f (s t−1, yt−1, ct ) (11)
of the context window during the data preprocessing stage to
CE
feed even more contextual information into the model. where f calculates the new state of the DNC network based on
the previous controller and memory states. During the decoding
4.2. Sequence to Sequence DNC
AC
phase, the output of the DNC is fed into a dense layer followed
The ToBeNormalized tokens, as classified by the XGBoost by a soft-max layer to generate word-by-word predictions. We
model, are then fed to a recurrent model. For this end, we present also used embedding layers to encode the input and output tokens
an architecture called sequence-to-sequence DNC that allows the into fixed dimensional vectors during the encoding and decoding
DNC model to be adapted for sequence-to-sequence translation phases.
purposes. Our underlying framework uses the RNN Encoder- A DNC uses a N × W dimensional memory matrix for storing
Decoder architecture (Sutskever et al., 2014). We have also used state information compared to a single cell state in an LSTM.
attention mechanisms to allow the decoder to concentrate on The presence of an external memory allows the DNC to store
the various different output states generated during the encoding representations of the input data in its memory matrix using write
7
ACCEPTED MANUSCRIPT
decoding stage was necessary for convergence. Probably, the

context vector provides the network with more information about
which locations to focus during each decoding step. The annota-
tions used for generating the context vector stores information
about the states of the DNC during the entire encoding phase.
For the purpose of text normalization, we feed the ToBeNor-
malized tokens in a manner specified in Experiment 2 of (Sproat
and Jaitly, 2017). The ToBeNormalized token is placed in be-
T
tween 3 context words to the left and right, with a distinctive
Figure 2: Sequence to sequence DNC, encoding phase.
tag marking the to-be-normalized word. This is then fed as a
IP
sequence of characters into the input embedding layer during the
CR
Encoding stage. For the sentence The city is 15kms away from
here, in order to normalize the token 15km the input becomes
city is <norm> 15km </norm> away from
US where <norm> and </norm> are tags that mark the beginning
and end of the to-be-normalized token. The output is always a
AN
sequence of words. During decoding phase output tokens are
first fed into an output embedding layer before feeding it to the
decoder. For the above example, the output becomes
M
fifteen kilometers
ED
Figure 3: Sequence to sequence DNC, decoding phase.

5. Experimental Results
heads and then read the representations from the memory using
PT
The initial XGBoost classification layer gives an F1-score

read heads. Dynamic memory allocation helps the network of 0.96 in for English and 1.00 for Russian. The classifiers’
to encode large input sequences while retaining the inherent performance in terms of precision and recall of the two classes
CE
structure in those sequences. The content and location based is reported in Table 2 and 3. The DNC model in the second
attention mechanisms give the network more information about layer was trained for 200k steps on a single GPU system with a
the input data during decoding. The ability to read and write
AC
batch-size of 16 until the perplexity reached 1.02. The parameter

from memory helps the network in meta-learning to store richer values used for training the second layer are given in Table 1 The
representations of the input data. We found out that network was overall results of entire system are reported in Table 4 & 5. It can
able to generalize faster compared to LSTM with a low number be seen that the overall performance of the model in terms of the
of training examples. This is probably due to the fact that DNC F1-score is quite good. The Russian model reported an accuracy
is able to store complex quasi-regular structure embedded in the of 99.3% whereas the English reported an accuracy of 99.4%.
input data sequences in its memory and then later, it is able to As mentioned before the instances of actual conversion are quite
infer from these representations. a few, which is the reason for the high overall accuracy. However,
We also found that feeding the context vector c during the when analyzing a text normalization system, it is more important
8
ACCEPTED MANUSCRIPT
to look into the kind of errors it makes. A single metric such as used as a text normalization solution for these classes. This is
accuracy or BLEU (Bilingual Evaluation Understudy) score is an improvement over the baseline LSTM model in terms of the
not sufficient for comparison. It is not much of a problem if a quality of prediction.
token from the DATE class: 2012 is normalized as two thousand The DNC network, however, suffers in some classes: MEA-
twelve, instead of twenty twelve. However, we certainly would SURE, FRACTION, MONEY and CARDINAL, similar to the
not want it to be translated to something like twenty thirteen. baseline LSTM network. The errors reported in these classes
These ’silly’ errors are subjective by their very nature and thus, are shown in Table 7. All the unacceptable mistakes in cardi-
rely on a human reader. This makes the analysis of these kinds nals occur in large numbers greater than a million. The DNC
T
of errors difficult but important. One has to take a look at all sometimes also struggles with getting the measurement units
the cases where the model produces completely unacceptable
IP
and denominations right. For non-Russian readers, we illus-
predictions. trate this with the MONEY token $ 1m where the prediction is
CR
It can be seen that the class-wise accuracies of the model completely off; since одиннадцать долларов сэ ш а means
are quite similar to the base LSTM model. Upon analyzing "eleven US dollars" but один миллион долларов сэ шэ а
the nature of error made in each class, it was identified that the means "1 million US dollars". In terms of overall accuracy,
model performs quite well in classes: DIGIT, DATE, ORDINAL

and TIME. The errors reported in these classes are shown in
US the DNC performs slightly better in MEASURE, MONEY and
CARDINAL. The model, however, performs much worse in
FRACTION compared to the baseline model. The DNC network
AN
Table 6. Most of the errors in these classes are due to the fact
that the DNC is confused with the true context of the token. still makes unacceptable ’silly’ mistakes in these classes such as
For example, the token 1968 in DATE context is predicted as predicting completely inaccurate digits and units, which is not
M
if in CARDINAL context. However, the DNC never makes a enough to make it a trustworthy system for these classes.
completely unacceptable prediction in these classes for both In order to understand why the model performs so well in
English and Russian data-sets as can be observed in Table 6. some classes but suffers in others, we proceeded to find the fre-
ED
For readers unfamiliar with the Russian language, we look at quency of these specific tokens in the English training dataset.
the issue with 22 июля. This error stems from confusion in The training set has 17,712 instances of dates of the form
PT
grammatical cases which do not exist in the English language x x/yy/zzzz. As reported in the earlier section, the model made
and are replaced with prepositions. However in the Russian lan- zero unacceptable mistakes in these DATE tokens. The baseline
guage prepositions are used along with grammatical cases, but LSTM, however, still reported unacceptable errors for dates of
CE
may also be omitted in many situations. Now, the 22nd is trans- the similar form. On the other hand, measurement units such
formed to двадцать второго (transliterated into Latin script as as m A, g/cm3 and ch occur less than ten times in the train-
AC
"dvadtsat vtorOGO") whereas when used with the preposition ing set. Compared to other measurement units, kg and cm are
of, such as of the 22nd, it is transformed to двадцать второе present more than 200 times in the training set. CARDINAL has
(transliterated into Latin script as "dvadtsat vtorOE"). The base- 273, 111 tokens out of which only 1,941 are numbers which are
line LSTM based sequence-to-sequence architecture proposed larger than a million. Besides, the error in MONEY for the En-
in Experiment 2 of Sproat and Jaitly (2017) showed completely glish data-set was for the denomination that occurred only once
unacceptable errors such as a DATE 11/10/2008 normalized as in the training set. The results in Table 7 clearly demonstrate
the tenth of october two thousand eight. On the other hand the that the model suffers only in the tokens for which a sufficient
DNC network never makes these kind of ’silly’ errors in these number of examples are not available in the training set. The
classes. This suggests that the DNC network can, in fact, be DNC network never made any unacceptable prediction for ex-
9
ACCEPTED MANUSCRIPT
Table 1: sequence-to-sequence DNC experimental settings. K x : input vocabulary size, K y : output vocabulary size, R: number of read heads
Kx Ky Memory size (N × W ) R Controller hidden units Embedding size

English 168 1781 256 × 64 5 1024 32
Russian 222 2578 256 × 64 5 1024 32
in the previous section and perform inference on the test data on

Table 2: Classification Report for XGBoost Russian model
two conditions, with memory activations and without memory
Precision Recall F1 score
activations. The first condition is just inference on the test data
T
RemainSelf 1.00 1.00 1.00
without any modification to the DNC activations. For the second
ToBeNormalized 0.99 1.00 1.00
IP
condition, we zero out the read vectors from the memory at each
time-step before providing it as input to the controller network.
CR
Table 3: Classification Report for XGBoost English model
If the DNC learns to use the provided memory access mecha-
Precision Recall F1 score
nisms during the training process (i,e, the prediction is largely
RemainSelf 1.00 1.00 1.00
dependent on the value of the read and write vectors), zeroing
ToBeNormalized 0.94 0.99 0.96
amples that were sufficiently present. This hints at the notion

US out the read vectors during inference should have a significant
impact in the performance of the model. However, if the model
AN
is just learning to predict based on its internal LSTM controller
that the model performs poorly particularly for units, cardinals
state, zeroing out the read vectors during inference should have
and denominations which occur a lesser number of times in the
a minute impact in performance. As can be seen from the com-
training set. Unlike the baseline LSTM, the model is reasonable
M
parison in Table 8, there is a significant drop in accuracy for

and durable to examples which are sufficiently present.
most semiotic classes when memory structures are inaccessible
ED
to the controller network. Particularly, semiotic classes which

5.1. Ablation Study
require structured processing of the input tokens such as, DATE,
To make the case for memory augmentation in neural net- TIME, DIGIT, MEASURE, MONEY, and TELEPHONE, show
PT
works to perform text normalization, we conducted an ablation a significant reduction in performance. Upon analyzing the kind
experiment to factor out its contribution if any. We know that of errors that the DNC network makes when memory structures
the DNC model consists of a controller network, equipped with are removed, it was found that the DNC network gets the trans-
CE
various memory access mechanisms to read and write from a lation context correct in most cases. For example, for the token
memory matrix. During the process of training, the controller 1984, the prediction of the DNC network without memory is
AC
network is intended to learn to use the provided memory ac- nineteen thousand two hundred eighty one. The model starts
cess mechanisms instead of just relying on its internal LSTM the translation correctly, but however, it fails mid-way by pre-
state. This is important to make the most out of the benefits dicting a completely incorrect digit. This is indicative of our
that come from memory augmentation. The DNC controller prior assumption that the model learns to write the input tokens
network at each time-step receives a set of R read vectors as in its memory matrix during the encoding stage and later reads
input. These read vectors or memory activations are obtained by from the memory during the decoding stage. If the memory
performing a read operation on the memory matrix. Our ablation structures would not have been used by the network for transla-
experiment intends to verify the contributions of these memory tion, we should have seen, essentially, no drop in performance
activations for prediction. We use an existing model pre-trained
10
ACCEPTED MANUSCRIPT
Table 4: Comparison of accuracies over the various semiotic classes of interest on the English data-set. base accuracy: accuracy of the LSTM based sequence-to-
sequence model proposed in (Sproat and Jaitly, 2017). accuracy: accuracy of the proposed, XGBoost + sequence-to-sequence DNC model.
semiotic-class base count count base accuracy accuracy
0 ALL 92416 92451 0.997 0.994

1 PLAIN 68029 67894 0.998 0.994
2 PUNCT 17726 17746 1.000 0.999
3 DATE 2808 2832 0.999 0.997
T
4 LETTERS 1404 1409 0.971 0.971
IP
5 CARDINAL 1067 1037 0.989 0.994
6 VERBATIM 894 1001 0.980 0.994
CR
7 MEASURE 142 142 0.986 0.971
8 ORDINAL 103 103 0.971 0.980
9 DECIMAL 89 92 1.000 0.989
10
11
DIGIT
MONEY
37
36
US 44
37
0.865
0.972
0.795
0.973
AN
12 FRACTION 13 16 0.923 0.688
13 TIME 8 8 0.750 0.750
M
Table 5: Comparison of accuracies over the various semiotic classes of interest on the Russian data-set. The headings are same as in Table 4.
ED
semiotic-class base count count base accuracy accuracy
0 ALL 93184 93196 0.993 0.993

1 PLAIN 60747 64764 0.999 0.995
PT
2 PUNCT 20263 20264 1.000 0.999

3 DATE 1495 1495 0.976 0.973
CE
4 LETTERS 1839 1840 0.991 0.991

5 CARDINAL 2387 2388 0.940 0.942
6 VERBATIM 1298 1344 1.000 0.999
AC
7 MEASURE 409 411 0.883 0.898

8 ORDINAL 427 427 0.956 0.946
9 DECIMAL 60 60 0.867 0.900
10 DIGIT 16 16 1.000 1.000
11 MONEY 19 19 0.842 0.894
12 FRACTION 23 23 0.826 0.609
13 TIME 8 8 0.750 0.750
11
ACCEPTED MANUSCRIPT
Table 6: Errors in which the DNC network is confused with the context of the token.
input semiotic-class prediction truth
0 2007 DIGIT two thousand seven two o o seven
1 1968 DATE one thousand nine hundred sixty eight nineteen sixty eight
2 0:02:01 TIME zero hours two minutes and one seconds zero hours two minutes and one second
3 22 июля DATE двадцать второго июля двадцать второе июля
4 II ORDINAL два второй
T
Table 7: Errors in which the DNC network makes completely unacceptable predictions.
input semiotic-class prediction truth
IP
0 14356007 CARNINAL one million four hundred thirty five fourteen million three hundred fifty six
thousand six hundred seven thousand seven
CR
1 0.001251 g/cm3 MEASURE zero point o o one two five one sil g per zero point o o one two five one grams
hour per c c
2
3
88.5 million HRK
10/618,543
MONEY
FRACTION
US
eighty eight point five million yen
ten sixteenth sixty one thousand five

eighty eight point five million croatian
kunas
ten six hundred eighteen thousand five
AN
hundred forty three hundred forty thirds
4 15 м/с MEASURE пятнадцать сантиметров в секунду пятнадцати метров в секунду
5 $1m MONEY одиннадцать долларов сэ ш а один миллион долларов сэ ш а
M
Table 8: Class-wise accuracy comparison with and without DNC memory

ED
semiotic-class accuracy without memory accuracy with memory
0 ALL 0.940704 0.994181

1 CARDINAL 0.345227 0.991321
PT
2 DATE 0.186794 0.996469

3 DECIMAL 0.021739 1.000000
CE
4 DIGIT 0.068182 0.818182

5 FRACTION 0.000000 0.687500
6 LETTERS 0.101490 0.971611
AC
7 MEASURE 0.021127 0.985915

8 MONEY 0.054054 0.972973
9 ORDINAL 0.077670 0.980583
10 PLAIN 0.992444 0.993858
11 PUNCT 0.998873 0.998873
12 TIME 0.125000 0.750000
13 VERBATIM 0.812188 0.995005
12
ACCEPTED MANUSCRIPT
on removing them. Apparently, the read vectors have high fea- 6. Discussion
ture importance in performing a successful prediction during
Given the reduced number of unacceptable predictions in
inference.
most semiotic classes we can say that the quality of the pre-
dictions produced by DNC are much better than the baseline
5.2. Results on up-sampled training set
LSTM model. The reason why DNC works better than LSTM
The initial results lead us to a follow-up question. Can our might be due to the presence of a memory matrix and read-write
system perform better given better quality data? Will a simple heads. The read-write heads allows the DNC to store richer
up-sampling procedure on the rare kinds of tokens improve the representations of the data in its dynamic memory matrix. Re-
T
model? To test our hypothesis of whether sufficient examples search done by Santoro et al., shows memory augmented neural
IP
can improve the performance in certain semiotic classes, we up- networks have the ability to generalize well to less number of
sampled the distribution of those specific tokens which occurred training examples. The model never made any unacceptable pre-
CR
less than a particular threshold frequency. The up-sampling was diction in some classes (DIGIT, DATE, ORDINAL and TIME)
done with duplication only for MEASURE and CARDINAL on for the same test set used by Sproat and Jaitly. With a basic
the English dataset. Sentences which had measurement units augmentation technique and minimal human requirement the
which occurred less than 100 times were up-sampled to have
100 instances each in the training set. Out of 253 measurement
US number of unacceptable errors in MEASURE was reduced to
zero. The LSTM model reported in (Sproat and Jaitly, 2017)
AN
units, 229 of them occurred less than a 100 times in the en- reported unacceptable errors even when sufficient examples are
tire training set. Similarly, sentences with Cardinals with value present. On the other hand, DNC is quite resistant to errors
larger than a million were up-sampled to 10,000 instances. The when sufficient examples are present. However, the DNC is
M
final distribution of the training set was 59,439 for MEASURE still prone to making unacceptable predictions in some classes
and 299,694 for CARDINAL. The model was then retrained (FRACTION, MONEY and CARDINAL) which makes it still
for the same number of training iterations with the up-sampled risky as a standalone text normalization system. There is still
ED
data. The overall accuracies and number of unacceptable errors a lot of work to be done before a purely deep learning based
for MEASURE & CARDINAL after up-sampling are shown in algorithm can be used a standalone component of a TTS system.
PT
Table 10. The comparison of the predictions is shown in Table We believe the performance the model can be further improved
9. Overall, it is very interesting to see that using simple data by designing a more balanced training set.
augmentation techniques like up-sampling helped remove all Apart from the domain of text normalization, we also provide
CE
the unacceptable errors in MEASURE and reduced the number evidence that a sequence-to-sequence architecture made with
of unacceptable errors from three to two in CARDINAL. Such DNC can be successfully trained for tasks similar to machine
AC
an elementary technique even removed errors in rare measure- translation systems. Until now DNC has only been used for
ment units such as ch and g/cc. However, the improvement solving simple algorithmic tasks and have not been applied to
observed in CARDINAL was rather modest. And as expected, real-time production environments. The quality of the results
the number of unacceptable errors in other classes were unaf- produced by DNC in text normalization demonstrates it is, in
fected. This clearly provides evidence to the initial assumption fact, a viable alternative to LSTM based models. LSTM based
that our system improves, even if marginally, when a sufficient architectures usually require large amounts of training data. The
number of examples are produced for any particular instance results in Sproat and Jaitly (2017) show that the LSTM based seq-
type. Nonetheless, we can safely say that this system looks to-seq models can sometimes produce a weird output even when
promising and worthy of widespread adoption. sufficient examples are present. For instance, LSTM’s did not
13
ACCEPTED MANUSCRIPT
Table 9: Predictions corrected after up-sampling for the English data-set.

token semiotic-class prediction before up-sampling prediction after up-sampling
0 295 ch MEASURE two hundred ninety five hours two hundred ninety five chains
1 2 mA MEASURE two a m two milli amperes
2 0.001251g/cm3 MEASURE zero point o o one two five one sil g per zero point o o one two five one sil g per c
hour c
3 1/2 cc MEASURE one _letter d_letter a_letter s_letter one half c c
h_letter _letter t_letter w_letter o_letter
T
_letter o_letter
IP
4 14356007 CARDINAL one million four hundred thirty five fourteen million three hundred fifty six
thousand six hundred seven thousand seven
CR
7. Conclusion
Table 10: Accuracies and no. of unacceptable errors before and after up-sampling
for the English data-set. a1: accuracy before up-sampling, a2: accuracy after
up-sampling, e1: no. of unacceptable errors before up-sampling, e2: no. of
unacceptable errors after up-sampling.
semiotic-class a1 a2 e1 e2
US Therefore, we can safely arrive at the conclusion that mem-
ory augmented neural networks such as the DNC are in fact a
promising alternative to LSTM based models for a language
AN
agnostic text normalization system. Additionally, the proposed
0 MEASURE 0.971 0.986 4 0
system requires significantly lesser amounts of data, training
1 CARDINAL 0.994 0.991 3 2
duration and compute resources. Our DNC model has reduced
M
the number of unacceptable errors to zero for some classes with

basic up-sampling of rare data points. However, there are still
ED
classes where the performance needs to be improved before

work well in predicting DATE, TIME and DIGIT, even though
an exclusively deep learning based model can become the text
the training set had a lot of examples from the category. DNC,
normalization component of a TTS system. Besides, we have
PT
on the other hand, is able to generalize well and avoids these

also demonstrated a system that can be used to train sequence-
kinds of errors. We believe that the DNC architecture should
to-sequence models using a DNC cell as the recurrent unit.
give good results in designing NMT models for languages which
CE
do not have a lot of training data available. It is also important

Acknowledgements
to note that a single unidirectional DNC provides much better
AC
generalization compared to the stacked bidirectional LSTM used We would like to show our gratitude to Richard Sproat (Se-
by Sproat and Jaitly, proving that memory augmented neural net- nior Research Scientist at Research & Machine Intelligence,
works can provide much better results with significantly reduced Google, New York) for his insights and comments that greatly
training times and fewer data points. The LSTM model reported improved the manuscript. We thank Kaggle for hosting the Text
in their paper was trained on 8 parallel GPUs for about five and Normalization Challenge by Richard Sproat and Kyle Gorman
half days (460k steps). On the contrary, our model was trained which got us interested in this problem in the first place. We are
on a single GPU system for two days (200k steps). Furthermore, also very grateful to Google DeepMind for open sourcing their
our model used only 2.2% of the English data and 4.4% of the implementation of the Differentiable Neural Computer which
Russian data for training. was a requirement for this research.
14
ACCEPTED MANUSCRIPT
References to transduce with unbounded memory. CoRR abs/1506.02516. URL: http:

//arxiv.org/abs/1506.02516, arXiv:1506.02516.
References Hochreiter, S., Schmidhuber, J., 1997. Long Short-Term memory. Neural Com-
put. 9, 1735–1780. URL: http://dx.doi.org/10.1162/neco.1997.9.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, 8.1735, doi:10.1162/neco.1997.9.8.1735.
G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, Joulin, A., Mikolov, T., 2015. Inferring algorithmic patterns with Stack-
A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Augmented recurrent nets. CoRR abs/1503.01007. URL: http://arxiv.
Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., org/abs/1503.01007, arXiv:1503.01007.
Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Luong, M.T., Pham, H., Manning, C.D., 2015. Effective approaches to attention-
Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, based neural machine translation. URL: http://arxiv.org/abs/1508.
T
M., Wicke, M., Yu, Y., Zheng, X., 2015. TensorFlow: Large-scale machine 04025, arXiv:1508.04025.
learning on heterogeneous systems. URL: https://www.tensorflow. Pusateri, E., Ambati, B.R., Brooks, E., Platek, O., McAllaster, D., Nagesha, V.,
IP
org/. software available from tensorflow.org. 2017. A mostly Data-Driven approach to inverse text normalization, in: Proc.
Allen, J., Hunnicutt, S.M., Klatt, D., 1987. From Text to Speech: The MITalk Interspeech 2017, pp. 2784–2788. URL: http://dx.doi.org/10.21437/
CR
System. Cambridge University Press. Interspeech.2017-1274, doi:10.21437/Interspeech.2017-1274.
Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine translation by Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T., 2016.
jointly learning to align and translate. CoRR abs/1409.0473. URL: http: One-shot learning with Memory-Augmented neural networks. URL: http:
Bengio, Y., Simard, P., Frasconi, P., 1994. Learning long-term dependencies
with gradient descent is difficult. IEEE Transactions on Neural Networks
5, 157–166. URL: http://dx.doi.org/10.1109/72.279181, doi:10.
US Siegelmann, H.T., Sontag, E.D., 1995. On the computational power of neural
nets. J. Comput. Syst. Sci. 50, 132–150. URL: http://dx.doi.org/10.
1006/jcss.1995.1013, doi:10.1006/jcss.1995.1013.
AN
1109/72.279181. Sproat, R., 1996. Multilingual text analysis for text-to-speech synthesis, in:
Breiman, L., 1998. Arcing classifier (with discussion and a rejoinder by the Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Con-
author). Ann. Statist. 26, 801–849. URL: https://doi.org/10.1214/ ference on. URL: http://dx.doi.org/10.1109/ICSLP.1996.607867,
doi:10.1109/ICSLP.1996.607867.
M
aos/1024691079, doi:10.1214/aos/1024691079.
Breiman, L., 1999. Prediction games and arcing algorithms. Neural Sproat, R., Black, A.W., Chen, S.F., Kumar, S., Ostendorf, M., Richards, C.,
Computation 11, 1493–1517. URL: http://dx.doi.org/10.1162/ 2001. Normalization of non-standard words. Computer Speech & Language
ED
089976699300016106, doi:10.1162/089976699300016106. 15, 287–333.
Chen, T., Guestrin, C., 2016. Xgboost: A scalable tree boosting system. Sproat, R., Jaitly, N., 2017. RNN approaches to text normalization: A challenge.
CoRR abs/1603.02754. URL: http://arxiv.org/abs/1603.02754, URL: http://arxiv.org/abs/1611.00068, arXiv:1611.00068.
arXiv:1603.02754. Sutskever, I., Vinyals, O., Le, Q.V., 2014. Sequence to sequence
PT
Freund, Y., Schapire, R.E., 1999. A short introduction to boosting. learning with neural networks, in: Ghahramani, Z., Welling, M.,
Friedman, J., Hastie, T., Tibshirani, R., 2000. Special invited paper. additive Cortes, C., Lawrence, N.D., Weinberger, K.Q. (Eds.), Advances
logistic regression: A statistical view of boosting. The Annals of Statistics in Neural Information Processing Systems 27. Curran Associates,
CE
28, 337–374. URL: http://www.jstor.org/stable/2674028. Inc., pp. 3104–3112. URL: http://papers.nips.cc/paper/
Friedman, J.H., 2001. Greedy function approximation: A gradient boosting 5346-sequence-to-sequence-learning-with-neural-networks.
machine. Ann. Statist. 29, 1189–1232. URL: https://doi.org/10.1214/ pdf.

AC
aos/1013203451, doi:10.1214/aos/1013203451. Weston, J., Bordes, A., Chopra, S., Rush, A.M., van Merriënboer, B., Joulin,
Graves, A., Wayne, G., Danihelka, I., 2014. Neural turing machines. URL: A., Mikolov, T., 2015. Towards AI-complete question answering: A set
http://arxiv.org/abs/1410.5401, arXiv:1410.5401. of prerequisite toy tasks. URL: http://arxiv.org/abs/1502.05698,
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska- arXiv:1502.05698.
Barwińska, A., Colmenarejo, S.G., Grefenstette, E., Ramalho, T., Agapiou,

J., Badia, A.P., Hermann, K.M., Zwols, Y., Ostrovski, G., Cain, A., King,
H., Summerfield, C., Blunsom, P., Kavukcuoglu, K., Hassabis, D., 2016.
Hybrid computing using a neural network with dynamic external memory.
Nature 538, 471+. URL: http://dx.doi.org/10.1038/nature20101,
doi:10.1038/nature20101.
Grefenstette, E., Hermann, K.M., Suleyman, M., Blunsom, P., 2015. Learning
15

Accepted Manuscript: Speech Communication

Uploaded by

Copyright:

Available Formats

Accepted Manuscript: Speech Communication

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Accepted Manuscript: Speech Communication

Uploaded by

Copyright:

Available Formats

Accepted Manuscript

Text Normalization using Memory Augmented Neural Networks

Subhojeet Pramanik, Aman Hussain

To appear in: Speech Communication

Received date: 9 July 2018

Text Normalization using Memory Augmented Neural Networks

Subhojeet Pramanika,∗, Aman Hussaina

result in high overall accuracy. Hence, these specific cases must

text normalization system. Yet the existing models are prone

esting challenges lies in designing a flawless text normalization

to correct such errors and "guide" the model. Although develop-

tures in Sproat and Jaitly (2017) produce near perfect overall

implementation details of our proposed system. Thereafter, the

memory augmented neural networks are indeed able to do a

as the text-analysis module of the multilingual Bell Labs TTS

to store representations of input data as “variables” and then

where ◦ denotes element-wise multiplication and E is an N × W

RY is obtained by multiplying the concatenation of the current

3.1. Differentiable Neural Computer weight matrix Wr .

A basic DNC architecture consists of a controller network,

Gradient Boosting has proven to be a practical, powerful

munity at Kaggle and elsewhere. Tree boosting as implemented error residuals yi − F (x i ).

by XGBoost (Chen and Guestrin, 2016) gives state-of-the-art

boosting for classification first and a normalization model later

This is quite similar to the method of gradient descent which

of data which makes it more amenable to quick training and

4. Proposed Architecture encoded as a vector comprised of the Unicode (UTF-8) values

metric we have used is AUC or area under the curve. A top-down

the normalized sequence of words helps us increase the overall

to maximize the classification accuracy of predicting whether

have a decent model at hand, we start tuning the regularization

phase. One major contribution of this paper is to replace bidirec-

model which is exclusively trained on those error-prone classes. timestep t computed by

decoding stage was necessary for convergence. Probably, the

city is <norm> 15km </norm> away from

Figure 3: Sequence to sequence DNC, decoding phase.

The initial XGBoost classification layer gives an F1-score

batch-size of 16 until the perplexity reached 1.02. The parameter

model performs quite well in classes: DIGIT, DATE, ORDINAL

Kx Ky Memory size (N × W ) R Controller hidden units Embedding size

in the previous section and perform inference on the test data on

amples that were sufficiently present. This hints at the notion

parison in Table 8, there is a significant drop in accuracy for

to the controller network. Particularly, semiotic classes which

semiotic-class base count count base accuracy accuracy

0 ALL 92416 92451 0.997 0.994

semiotic-class base count count base accuracy accuracy

0 ALL 93184 93196 0.993 0.993

2 PUNCT 20263 20264 1.000 0.999

4 LETTERS 1839 1840 0.991 0.991

7 MEASURE 409 411 0.883 0.898

ten sixteenth sixty one thousand five

Table 8: Class-wise accuracy comparison with and without DNC memory

semiotic-class accuracy without memory accuracy with memory

0 ALL 0.940704 0.994181

2 DATE 0.186794 0.996469

4 DIGIT 0.068182 0.818182

7 MEASURE 0.021127 0.985915

Table 9: Predictions corrected after up-sampling for the English data-set.