!!!CTCModel
!!!CTCModel
!!!CTCModel
Classification
Yann Soullard, Cyprien Ruffino, Thierry Paquet
Abstract
We report an extension of a Keras Model, called CTCModel, to perform the Connection-
ist Temporal Classification (CTC) in a transparent way. Combined with Recurrent Neural
Networks, the Connectionist Temporal Classification is the reference method for dealing with
unsegmented input sequences, i.e. with data that are a couple of observation and label se-
quences where each label is related to a subset of observation frames. CTCModel makes use
of the CTC implementation in the Tensorflow backend for training and predicting sequences
of labels using Keras. It consists of three branches made of Keras models: one for training,
computing the CTC loss function; one for predicting, providing sequences of labels; and one
for evaluating that returns standard metrics for analyzing sequences of predictions.
1 Introduction
Recurrent Neural Networks (RNN) are commonly used for dealing with sequential data. The
last few years, many developments have been proposed to overcome some limitations which have
allowed RNN reaching remarkable performance. For instance, Long Short-Term Memory (LSTM)
[7] and Gated Recurrent Units (GRU) [2] mitigate the vanishing (and exploding) gradient problem
encountered with Back-Propagation-Through-Time [11]. They also provide a solution for modeling
long-range dependencies. Bidirectional systems [4, 9] allow to take into account both the left
and right context by introducing a forward and backward process. Multi-dimensional Recurrent
Neural Networks [5] apply recurrent connexions on each dimension which allows accessing a wider
contextual information and to deal with input images of variable sizes (both in pixel width and
height). In addition, convolutional layers are commonly used to extract features (this relates to an
encoder part) that are given in input to a recurrent network [12] (a decoder part). More recently,
attention models have been proposed with success [1, 8, 10].
In many applications on sequential data such as speech and handwritting recognition or activity
recognition in videos, a data is an observation sequence x = (x1 , ...xT ) of any length T to which
one wants to associate a sequence of labels y of length L, with L 6 T . During training, a system
is trained to model examples in the form of input observation sequence and output labels couples
(x, y). The training dataset is incomplete, as the labeling related to each observation xt for
1 6 t 6 T is not known. In others words, the observation sequences are unsegmented as the subset
of observation frames related to each label is unknown. For training a recurrent neural network
on such data, Alex Graves et al. introduced the Connectionist Temporal Classification (CTC, [3]).
The CTC approach relies on dynamic programming, a Forward-Backward algorithm, to compute a
specific loss function based on the probabilities of the possible paths which can produce the output
label sequence.
In the Keras functionnal API, one can define, train and use a neural network using the class
Model. The loss functions that can be used in a class Model have only 2 arguments, the ground
truth y_true and the prediction y_pred given in output of the neural network. At present, there
is no CTC loss proposed in a Keras Model and, to our knowledge, Keras doesn’t currently support
loss functions with extra parameters, which is the case for a CTC loss that requires sequence
lengths for batch training. Indeed, as observation and label sequences are of variable length, one
has to provide the length of each sequence (observation and label), which allows to not consider
padding in the loss computation.
In this paper, we present a way for training observation sequences in Keras using the Connec-
tionist Temporal Classification approach. Our proposal extends the Keras Model class to perform a
1
CTC training and decoding. It relies on several Keras Model and on CTC functions in Tensorflow.
To the next, we recall the CTC approach (section 2) and describe the model architecture that has
been defined (section 3). Then, we show how to use CTCModel with Python and present some
results on a public dataset in section 4.
This is computed in an efficient way using a Forward-Backward Algorithm. The decoding task
consists in finding the most probable label sequence given an observation sequence:
In practice, solving equation 2 may be time-consuming and may be approximated using a faster
method. For instance, A. Graves et al. [3] proposed the best path decoding and prefix search
decoding methods to get an approximate solution. In the best path decoding, finding the most
probable path consists in selecting the most probable label for every time frame. On the other
hand, the prefix search decoding consists in dividing the output sequence between time steps where
the blank label is highly probable and then applying the standard Forward-Backward algorithm
on every subset.
2
3.1 CTCModel architecture
CTCModel can be seen as a Keras Model defined to perform the CTC training and decoding steps.
At initialization step, it requires the input and output of a recurrent neural network architecture
defined in Keras. Let x be an observation sequence of length T and y its label sequence of length
L where each term yl is an element c of a set of classes C. Let C˜ denotes the set of possible classes
˜ S
(including the blank label), in other words C = C {b} where b denotes the blank label. The output
of the network is a tensor containing the conditional probabilities p(c|xt ) for each class c ∈ C and
1 6 t 6 T . This is commonly the output of a softmax activation layer on a fully connected layer
applied at every time frame (e.g. a TimeDistributed(Dense) in Keras). This relates to a tensor of
dimension batch_size × T × |C|, ˜ where |.| denotes the cardinal of a set. If sequences are of variable
length, T relates to the length of the longer observation sequence in the batch as it is required to
deal with tensors of fixed-sizes.
CTCModel is composed of 3 Keras models: one for training (Model_train), one for predicting
(Model_predict) and one for evaluating sequences of predictions (Model_evaluate). It is illustrated
in Figure 1. These models are automatically defined when CTCModel is compiled. Each one has
a specific loss function (related to the specific task), thus only the optimizer used for training is
requested at compile time.
Figure 1: Illustration of the CTCModel architecture containing three Model defined in Keras.
Model_train and Model_evaluate have 4 inputs while Model_predict has only the 2 inputs related
to the observation sequences.
To the next, we will present more in details the 3 models shown in Figure 1. We will see that
CTCModel contains the main functions defined in a Keras model. Some new methods have been
defined in the CTCModel implementation while some others have not been implemented again
because they are accessible through one of the 3 models as these are Keras Model.
3
rate (’ler’) and the sequence error rate (’ser’). They are defined in the metrics argument. Whatever
the metric, the 4 inputs defined above are required. The loss and sequence error rate are computed
on the entire input dataset while the label error rate is returned for every sequence (this is a list
of label error rate values).
4 Experimentations
We now present how to use CTCModel in Python and then show some results we obtained using
the public French dataset RIMES.
Then we instantiate a CTCModel where inputs and outputs are respectively the Tensors given
by the Input layer and the one given by the softmax activation layer. Then, the network is compiled:
notice that the only argument to provide is the optimization method (here Adam with a learning
rate of 1e-4), as the only possible loss is the CTC loss function. In contrast to a standard Keras
Model, evaluation metrics are specified as argument of the evaluate method, not at compilation
time.
For the use of CTCModel methods, one recalls that inputs x and y are defined in a particular
way as x contains the input observations, the labels, the input lengths and the label lengths while
y is a dummy structure. Thus, the fit and evaluate methods require the specific inputs x, while
the predict function only requires the observation sequences and observation lengths as input, as
illustrated in Figures 1 and 3. Note that input observations and label sequences have to be padded
to get input tensors of fixed size in input of the Keras and Tensorflow functions.
4
Figure 3: Example of the use of methods fit, evaluate and predict, proposed in CTCModel.
5 Conclusion
We present CTCModel, an extension of a Keras Model to perform the Connectionist Temporal
Classification approach. It relies on efficient methods defined in Tensorflow for training, by com-
puting the CTC loss, and predicting, by performing a CTC decoding. The main Keras Model
methods have been proposed in CTCModel and can be used in a standard way. The main differ-
ence with a standard Keras Model is the specific input structure containing both the observation
sequences, the input observation lengths, the label sequences and the label lengths. Two evaluation
metrics, the label error rate and the sequence error rate can also be computed in a transparent
way from the CTC decoding.
Acknowledgment
This work has been supported by the French National grant ANR 16-LCV2-0004-01 “Labcom
INKS”. This work is founded by the French region Normandy and the European Union. Europe
acts in Normandy with the European Regional Development Fund (ERDF).
References
[1] Théodore Bluche, Jérôome Louradour, and Ronaldo Messina. Scan, attend and read: End-
to-end handwritten paragraph recognition with mdlstm attention. In Document Analysis and
5
Recognition (ICDAR), 2017 14th IAPR International Conference on, volume 1, pages 1050–
1055. IEEE, 2017.
[2] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On
the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint
arXiv:1409.1259, 2014.
[3] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist
temporal classification: labelling unsegmented sequence data with recurrent neural networks.
In Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM,
2006.
[4] Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional
lstm and other neural network architectures. Neural Networks, 18(5):602–610, 2005.
[5] Alex Graves and Jürgen Schmidhuber. Offline handwriting recognition with multidimensional
recurrent neural networks. In Advances in neural information processing systems, pages 545–
552, 2009.
[6] Emmanuèle Grosicki, Matthieu Carre, Jean-Marie Brodin, and Edouard Geoffrois. Rimes
evaluation campaign for handwritten mail processing. In ICFHR 2008: 11th International
Conference on Frontiers in Handwriting Recognition, pages 1–6. Concordia University, 2008.
[7] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[8] Suyoun Kim, Takaaki Hori, and Shinji Watanabe. Joint ctc-attention based end-to-end speech
recognition using multi-task learning. In Acoustics, Speech and Signal Processing (ICASSP),
2017 IEEE International Conference on, pages 4835–4839. IEEE, 2017.
[9] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Trans-
actions on Signal Processing, 45(11):2673–2681, 1997.
[10] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. An end-to-end spatio-
temporal attention model for human action recognition from skeleton data. In AAAI, volume 1,
page 7, 2017.
[11] Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings
of the IEEE, 78(10):1550–1560, 1990.
[12] Zhen Zuo, Bing Shuai, Gang Wang, Xiao Liu, Xingxing Wang, Bing Wang, and Yushi Chen.
Convolutional recurrent neural networks: Learning spatial dependencies for image represen-
tation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, pages 18–26, 2015.