1 s2.0 S1877050920311443 Main
1 s2.0 S1877050920311443 Main
1 s2.0 S1877050920311443 Main
Comparative Analysis
Third International Conferenceof
on Convolution Neural
Computing and Network Network Models
Communications for
(CoCoNet’19)
Continuous Indian Sign Language Classification
Comparative Analysis of Convolution Neural Network Models for
ContinuousRinki Guptaa,*, Sreeraman Rajanb
Indian Sign Language Classification
a
Electronics and Communication Engineering Department, Amity University Uttar Pradesh, Sector 125, Noida, UP-201313, India
b
Department of Systems and Computera,Engineering, Carleton University,
b Ottawa, Canada
Rinki Gupta *, Sreeraman Rajan
a
Electronics and Communication Engineering Department, Amity University Uttar Pradesh, Sector 125, Noida, UP-201313, India
b
Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada
Abstract
Classification of continuous sign language is essential for development of a sign language to spoken language translator. In this
paper, classification of continuously signed sentences from the Indian Sign Language is considered using data from one inertial
Abstract
measurement unit placed on each hand of the signer. The recorded accelerometer and gyroscope data are used in tracking the
position of hand
Classification in three-dimension,
of continuous sign languagewhich are used
is essential forasdevelopment
input to theof aclassifier. The time-LeNet
sign language and multi-channel
to spoken language translator. Indeep
this
convolutional neural network (MC-DCNN) are employed for classification of sentences from raw
paper, classification of continuously signed sentences from the Indian Sign Language is considered using data from position data of both hands.
one inertial
Moreover, a modified
measurement unit placedtime-LeNet architecture
on each hand is proposed
of the signer. to addressaccelerometer
The recorded the issue of over-fitting
and gyroscopeobserved in the
data are usedtime-LeNet.
in trackingThe
the
three models
position are compared
of hand for performance
in three-dimension, whichinare
terms
usedof as
model
inputcomplexity, loss andThe
to the classifier. classification
time-LeNet accuracies. MC-DCNNdeep
and multi-channel has
large number of
convolutional trainable
neural parameters
network and provides
(MC-DCNN) an overall
are employed foraccuracy of 83.94%,
classification while time-LeNet
of sentences yields an
from raw position average
data of bothaccuracy
hands.
of 79.70%. The modified time-LeNet yields a classification accuracy of 81.62 % with just sixteenth of trainable
Moreover, a modified time-LeNet architecture is proposed to address the issue of over-fitting observed in the time-LeNet. The parameters as
compared
three to MC-DCNN.
models are compared for performance in terms of model complexity, loss and classification accuracies. MC-DCNN has
large number of trainable parameters and provides an overall accuracy of 83.94%, while time-LeNet yields an average accuracy
©
of 2020
79.70%.TheThe
Authors. Published
modified by Elsevier
time-LeNet yieldsB.V.
a classification accuracy of 81.62 % with just sixteenth of trainable parameters as
This is an open access
compared to MC-DCNN. article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the Third International Conference on Computing and Network
© 2020
2020 The
The Authors.
Communications
© Published
Published by
(CoCoNet’19)
Authors. by Elsevier
Elsevier B.V.
B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the Third International Conference on Computing and Network
Peer-review
Communications under(CoCoNet’19).
responsibility of the scientific committee of the Third International Conference on Computing and Network
Communications (CoCoNet’19)
Keywords: Indian sign language; inertial measurement unit; time-LeNet; MC-DCNN; convolution neural network
Keywords: Indian sign language; inertial measurement unit; time-LeNet; MC-DCNN; convolution neural network
1. Introduction
The deaf community all around the world uses sign language for communication. Sign language involves various
postures and motions of hands according to specific vocabulary and lexical along with the use of facial expressions
and body language. Although, sign language is commonly used among the deaf, it is not understood by hearing
community due to which there exists a communication barrier between those who understand signing and those who
do not. To address this issue, automatic sign language recognition systems have been proposed and has been an
active area of research. Literature on automatic sign language recognition can broadly be categorized into
classification of isolated signs or continuous signing [1, 2]. Moreover, the isolated signs that are classified may be
just static postures, such as those involved in fingerspelling or may also include dynamic motion. Also, only one-
hand or both the hands may be monitored while carrying out sign classification. As evident from the review of sign
language recognition presented in [1], while majority of papers in literature have been reported for classification of
static hand postures with one-hand monitoring, techniques for classification of continuously signed sentences have
also been proposed. For instance, classification of isolated signs is described in [3] and [4, 5] for British and
American sign languages, respectively. Whereas, classification of continuous signing is presented in [6] and [7] for
German and Indian sign languages, respectively. It may be noted that just like verbal languages, sign language is
also significantly different for distant geographic locations, such as different countries.
Sign language recognition has been carried out using vision-based as well as wearable sensing modalities [2]. For
instance, in video-based sensing, skin color is used for segmentation of hands and head region, which are then
processed for detection of posture and hand trajectory to perform sign classification [3]. In [4], depth information
captured using Kinect sensor is used to classify the alphabets in American sign language since depth information is
less sensitive to background variation and illumination condition. Alternately, wearable systems may be designed for
sign language recognition using various sensors such as flex sensors, motion sensors and electromyogram sensors.
Use of sensory gloves with flex sensors, optical sensors and motion sensors for classification of American,
Australian, German, Spanish, Arabic, Malay, Indian, Pakistani, Vietnamese and Taiwanese are reviewed in [1].
Sensors may be worn on the forearm so that the interfere in natural signing is minimum. For instance, classification
of 10 continuously signed sentences using signals from a single inertial measurement unit (IMU) on the forearm is
presented in [7]. In [5], the authors propose classification of 40 isolated signs using a wrist-worn IMU and four
surface electromyogram (sEMG) sensors on the forearm. In both the papers [5, 7], sensors are present only on one
hand. However, signing involves use of both hands. There has been relatively less work reported in literature, where
wearable sensors have been placed on both hands.
Various machine learning and deep learning techniques have been proposed in literature for classification of sign
language and human activity in general [1, 2]. A significant step in machine learning methods is the representation
of the input image of signal in terms of appropriate features. Statistical features evaluated in time and frequency
domain as well as autoregressive parameters have been used in classifiers such as support vector machine (SVM)
and artificial neural networks to yield classification accuracies as high as 95.94% [5]. Features for classification may
be extracted from deep learning architecture, such as Principal Component Analysis Network (PCANet) and used in
SVM for classification, as proposed in [4]. Deep learning architectures such as convolutional neural networks
(CNN), recurrent CNN and Capsule Network have also been applied for classification of isolated as well as
continuous sign recognition [6, 7], the details of which will be discussed in the following section.
In this paper, classification of continuously signed sentences from the Indian sign language is presented using two
IMUs, one on each forearm of the signer. The sentences consist of three to four words and are derived from a
vocabulary of 15 words. Two convolutional neural network (CNN) models namely, multi-channel deep CNN (MC-
DCNN) and time-LeNet (t-LeNet) available for time-series classification are applied for classification of the derived
position data obtained from the acceleration and rotation signals recorded by the IMUs. Also, a novel modified t-
LeNet architecture is proposed to take advantages of both MC-DCNN and t-LeNet. All the three networks are
compared in terms of architecture complexity, loss and classification accuracy. The remaining paper is organized as
follows. Section 2 contains a review of use of deep learning in sign language recognition and the CNN models
available in literature for time-series classification. The details of the experiment conducted for collection of signals
from wearable sensors for continuous sign language recognition, its processing and the proposed CNN architecture
are presented in Section 3. A comparative analysis of complexity and performance of the considered CNN
1544 Rinki Gupta et al. / Procedia Computer Science 171 (2020) 1542–1550
Author name / Procedia Computer Science 00 (2019) 000–000 3
architectures for the collected signing data is given in Section 4. Finally, in Section 5, conclusions and future scope
are summarized.
2. Related Work
Sign language recognition has been extensively studied using conventional machine learning techniques such as
SVM, hidden Markov model (HMM) and artificial neural networks (ANN) [2]. In these methods, the features that
are used have to be hand-crafted and also, only shallow features are used for learning. In recent years, several deep
learning approaches have been applied for sign language and activity recognition [8]. For instance, in [9], the
authors compared deep neural networks (DNN), CNN and recurrent neural network (RNN) for classification of
human activity using raw accelerometer and IMU data. The DNN architectures have several hidden layers that are
fully connected. DNN are often used as the final layer in CNN. Because of deeper architecture, DNN are capable of
learning more complex features. The CNN architectures consist of convolution and pooling layers. Convolution
layers have sparse connectivity, which reduces the computational complexity. The pooling layers, implemented
using Max or Average pooling, help avoid overfitting. The CNN model shows characteristics of local dependency
and scale invariance, which are particularly suited for activity recognition [8]. CNN has been used on image of hand
trajectory derived from IMU measurements to classify air-writing of alphabets in [10]. CNN and Capsule Networks
that consist of nested group of neurons or capsules, have been employed for classification of continuously signed
sentences using one IMU [7]. The RNN models include memory units that enable them to learn the temporal
dynamic behavior. In [6], the authors use CNN and RNN together to classify signs from a continuous stream of
images. In this paper, three CNN architectures are applied on data recorded from IMUs on both hands, to classify
the signed sentence. The CNN models for classification of time series are described in the following subsection.
Several deep learning architectures are popular for image analysis, such as LeNet, VGG, AlexNet and
GoogLeNet. Signals recorded using wearable sensors, such as IMUs, are time-series that are multi-dimensional. The
signals may be recorded using multiple sensors and sensors of different modalities. The time series signals may
either be represented as an image, such as in [10] or the signals may be stacked together into a virtual image as
explained in [8]. Another approach is to apply deep learning layers on each time-series in the multi-dimensional data
and then combine the features in a fully connected layer before final classification. This approach is termed as data-
driven approach [8].
The success of LeNet architectures for image classification led to the development of a time-series counterpart
known as time-LeNet (t-LeNet) [11]. t-LeNet consists of two convolution layers, first convolution uses 5 filters
followed by max-pooling of length equal to 2. The second convolution uses 20 filters and max pooling of length
equal to 4. One dimensional (1D) convolution and Rectified Linear Unit (ReLU) activation functions are used in the
convolution layers. Convolution layers are followed by a fully-connected (FC) layer with 500 neurons using ReLU
activation function and finally, a softmax classifier. Also, data augmentation using window warping and window
slicing are implemented to increase the number of training instances. Another successful deep learning architecture
proposed for multi-variate time-series classification is the multi-channel deep convolutional neural network (MC-
DCNN) [12]. In this network, the convolutions are applied independently, in parallel, on each dimension or channel
of the input multi-variate time-series. Each channel passes through two convolutional stages with 8 filters of length
5, ReLU activation function and max pooling of length 2. Then, the output of the second convolutional stage for all
dimensions is concatenated over the channels axis and fed to an FC layer with 732 neurons and ReLU activation
function. Finally, the softmax classifier is used with a number of neurons equal to the number of classes in the
dataset. In this paper, t-LeNet and MC-DCNN models are applied for sign language recognition. Moreover, a
modification of t-LeNet is proposed, as explained in the following section.
Rinki Gupta et al. / Procedia Computer Science 171 (2020) 1542–1550 1545
4 Author name / Procedia Computer Science 00 (2019) 000–000
In this work, we have recorded our own dataset of wearable sensor signals for sentences signed in the Indian Sign
Language. The details of the experiment conducted and the proposed classification algorithm are as described below.
Eleven sentences constructed from 15 commonly used words in the Indian Sign Language are considered. For
instance, one of the sentences is “You are right”, which is signed by signing You and then Right. Another sentence
is “I need help”, which is signed by signing each of these words, and “I don’t need help”, which is signed as “I need
help not”. The signs were performed according to the Indian Sign Language dictionary developed by Ramakrishna
Mission Vivekananda University [13]. Each sentence was signed 10 times by 10 different volunteers, 7 females and
3 males. Each subject provided a written consent to participate in the study. Two wireless IMUs from the Delsys
Trigno Wireless digital system were placed on the subject’s forearm, one on each hand and held in place using
elastic bands, as shown in Fig. 1a. The IMUs consist of triaxial accelerometers and gyroscopes that record signals at
148.15 Hz sample rate and 16-bit resolution. The sensors send the recorded signals wirelessly to a base-station,
shown in Fig. 1b, which is connected via USB to the computer, where the data is recorded for further processing. A
total of 1100 samples of data were recorded.
Next, the recorded signals are pre-processed to replace missing values using interpolation. The sensors are
calibrated by removing bias values estimated using standard orientation and rotation tests. A combination of a
triaxial accelerometer and a triaxial gyroscope may be used to evaluate the position in three-axis. At any given time-
instant, the orientation of the sensor with respect to gravity is determined as described in [14]. The gyroscopes
record turn rate ω (in rad/s) along the x-, y- and z-axis of the sensor (denoted by S) with respect to the earth-frame.
The vector denoting the measured turn rate is
S
0 x y z . (1)
Orientation of the sensor with respect to the earth-frame (denoted by E) may be described using the quaternion
vector
S
E qˆ q1 q2 q3 q4 , (2)
1546 Rinki Gupta et al. / Procedia Computer Science 171 (2020) 1542–1550
Author name / Procedia Computer Science 00 (2019) 000–000 5
where ^ indicates that the vector has been normalized to unit length. Turn rate is related to the derivative of the
quaternion describing the rate of change of earth frame with respect to the sensor frame as
1S
S
E q ˆ S
Eq , (3)
2
where denotes a quaternion product. Provided the initial conditions are known, the quaternion at current time
instant may be estimated by taking an integral of the quaternion derivative given in (3). However, the resulting
orientation may drift with time due to accumulation of errors with successive integrations. Hence, orientation is also
be estimated using accelerometer signals. Let the vector containing the normalized accelerometer measurements be
S
aˆ 0 a x ay az . (4)
E
gˆ 0 0 0 1 . (5)
The estimate of orientation is obtained such that when the acceleration due to gravity vector is rotated using the
estimated quaternion vector, the resulting vector is the accelerometer measurements vector. This is achieved by
defining an objective function f,
f S
E
qˆ, E gˆ , S aˆ ES qˆ * E gˆ ES qˆ S aˆ , (6)
which is minimized using gradient descent. The orientation estimate obtained using gyroscope and accelerometer
signals is used to rotate the vector in earth-frame. Hence, the acceleration due to gravity is contained in the z-axis,
which is simply subtracted. The resulting linear acceleration is double integrated to obtain the position data in three-
axis, which is used as input to the deep learning models. The stationary regions in time may be detected as having
2nd-order norm of gyroscope signal less than 50π/180 rad/s and here, the velocity is set to zero to avoid incrementing
the position.
There is no one deep learning architecture that is most suitable for sign language classification, and the general
approach is to test the existing architectures in the given scenario and determine the suitable model parameters. For
this purpose, the trained model may be tested for underfitting and overfitting. When a model is underfitted, the
training loss is high. On the other hand, when the training loss is low, however when the validation loss seems to
increase, it indicates that the model does not generalize well and may be overfitted. Data augmentation has been
proposed by the authors in [11] to avoid overfitting. Here, time-scaled and shifted version of the time-series data
available in training set are used to increase the observations in training data. When t-LeNet and MC-DCNN were
applied on the position data described above, both t-LeNet and MC-DCNN gave low training loss. However, the
validation loss of t-LeNet increased indicating overfitting. Hence, a modification in the parameters of the t-LeNet
architecture is proposed.
The t-LeNet architecture has much smaller number of parameters as compared to MC-DCNN. t-LeNet is a
shallow CNN model and hence, it requires more parameters in the FC layer to improve performance. In [15], the
authors performed a systematic study describing the effect of FC layer on the CNN model. The authors concluded
that deeper architectures require less number of FC layers as compared to shallow architectures. As FC has large
number of parameters, it may lead to overfitting. This observation has been reported in [16], where authors replace
FC layer with a SparseNet layer to avoid overfitting. In the proposed model, just like t-LeNet, two convolution
layers and a dense layer followed by a softmax classifier are used. However, the number of neurons in the dense
Rinki Gupta et al. / Procedia Computer Science 171 (2020) 1542–1550 1547
6 Author name / Procedia Computer Science 00 (2019) 000–000
layer are reduced to 64. To avoid underfitting, the number of filters is increased in each convolution layer. For the
first convolution layer, the number of filters is increased from 5 to 32 and for the second convolution layer, the
number of filters is increased from 20 to 64. The proposed t-LeNet architecture is depicted in Fig 2. The results for
the proposed modified t-LeNet architecture are compared with those obtained with the t-LeNet and MC-DCNN
models available in literature in the following section.
4. Results
Fig. 3a shows the calibrated accelerometer and gyroscope signals for the IMU placed on the right hand for the
sign “Right” in the sentence “You are right”. Acceleration is plotted in g, where 1g=9.81m/s2 and angular rate is
plotted in degrees/s. Fig. 3b shows the corresponding three-axis position estimated as described in Section 3.1. It can
be observed that the estimated trajectory matches well with the expected trajectory of the hand, which makes a right-
Acceleration
1.5 ax
1
ay
(g)
0.5
0 az
-0.5
6 7 8 9 10
Time(s)
z
Gyroscope
200 gx
(deg/s)
0 gy
gz
-200
6 7 8 9 10
Time(s)
symbol in air.
(a) (b)
Fig. 3. (a) Accelerometer and Gyroscope signal; (b) Three-axis position for sign “Right”.
The three-axis position data of both hands is given as input to the three CNN models described in Sections 2.2
and 3.2. The architecture details and number of parameters to be learnt are summarized in Table 1. Amongst the
three networks, MC-DNN model required the highest number of parameters while t-LeNet required the least number
of parameters to be learnt. The proposed modified t-LeNet has higher number of trainable parameters than t-LeNet,
however they are significantly lower than the MC-DCNN model.
1548 Rinki Gupta et al. / Procedia Computer Science 171 (2020) 1542–1550
Author name / Procedia Computer Science 00 (2019) 000–000 7
The models are trained for 50 epochs with a batch-size of 16. MC-DCNN uses stochastic gradient descent (SGD)
optimization method to minimize the loss function, whereas Adam optimizer is used for the t-LeNet architectures.
The entire dataset is randomly split in 70-30 ratio for training and testing data, respectively. Training loss for the
three CNN models is shown in Fig. 4a. Loss function is also evaluated for test data for each epoch and the graphs
are shown in Fig. 4b. The training loss of MC-DCNN and t-LeNet are comparable, while that of modified t-LeNet is
higher. However, on the test data, the loss for t-LeNet increases as the training loss decreases, indicating over-
fitting. Whereas the modified t-LeNet generalizes on the test dataset better as compared to the original t- LeNet as
the test loss is reduced. The test loss of modified t-LeNet is comparable with that of MC-DCNN.
(a) (b)
Finally, the classification accuracy on the test data is compared for the three models in Fig. 5. Modified t-LeNet
performs better than the original t-LeNet giving a classification accuracy of 79.70% and 81.62%, respectively. MC-
DCNN yields best overall accuracy of 83.94%. Modified t-LeNet trains only 1/16 times the number of parameters
required by MC-DCNN and provides accuracies close to that provided by a deep CNN architecture such as MC-
DCNN.
5. Conclusions
Classification of continuously signed sentences using data from one IMUs placed on each hand of the signer is
considered. The recorded accelerometer and gyroscope signals are processed to generate position data, which are
used as input to three deep-learning models for classification. Two CNN models present in literature for time-series
classification, namely MC-DCNN and t-LeNet are applied for classification. MC-DCNN is the deepest architecture
and yields the best overall accuracy of 83.94%. The t-LeNet model is relatively shallow and yields the least
classification accuracy of 79.70%. Even though data augmentation is carried out for t-LeNet, the model depicts
over-fitting. Hence, a modified t-LeNet architecture is proposed with reduced number of neurons in the fully-
connected layer and increased number of filters in the convolution layer. The modified t-LeNet architecture shows
improved generalization as well as yields better classification accuracy of 81.62% when compared with the
classification accuracy of t-LeNet.
Acknowledgements
The author would like to thank the volunteers who helped in recording the data. The author also recognizes the
funding support provided by the Science & Engineering Research Board, a statutory body of the Department of
Science & Technology (DST), Government of India (ECR/2016/000637).
References
[1] Ahmed, Mohamed Aktham, Bilal Bahaa Zaidan, Aws Alaa Zaidan, Mahmood Maher Salih, and Muhammad Modi bin Lakulu. (2018) "A
review on systems-based sensory gloves for sign language recognition state of the art between 2007 and 2017." Sensors 18, no. 7: 2018.
[2] Cheok, Ming Jin, Zaid Omar, and Mohamed Hisham Jaward. (2019) “A review of hand gesture and sign language recognition techniques.”
International Journal of Machine Learning and Cybernetics 10, no. 1: 131-153.
[3] Fakhfakh, Sana, and Yousra Ben Jemaa. (2018) “Gesture Recognition System for Isolated Word Sign Language Based on Key-Point
Trajectory Matrix.” Computación y Sistemas 22, no. 4: 1415–1430.
[4] Aly, Walaa, Aly Saleh, and Almotairi Sultan. (2019) “User-Independent American Sign Language Alphabet Recognition Based on Depth
Image and PCANet Features.” IEEE Access, 7: 123138 – 123150.
[5] Wu, Jian, Zhongjun Tian, Lu Sun, Leonardo Estevez, and Roozbeh Jafari. (2015) "Real-time American sign language recognition using
wrist-worn motion and surface EMG sensors." In IEEE 12th Int. Conference on Wearable and Implantable Body Sensor Networks (BSN):1-6.
[6] Cui, Runpeng, Hu Liu, and Changshui Zhang. (2017) "Recurrent convolutional neural networks for continuous sign language recognition by
staged optimization." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition: 7361-7369.
[7] Suri, Karush, and Rinki Gupta. (2019) "Continuous sign language recognition from wearable IMUs using deep capsule networks and game
theory." Computers & Electrical Engineering, 78: 493-503.
[8] Wang, Jindong, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. (2019) "Deep learning for sensor-based activity recognition: A
survey." Pattern Recognition Letters 119: 3-11.
[9] Hammerla, Nils Y., Shane Halloran, and Thomas Plötz. (2016) "Deep, convolutional, and recurrent models for human activity recognition
using wearables.", in: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16).
[10] Pan, Tse-Yu, Chih-Hsuan Kuo, Hou-Tim Liu, and Min-Chun Hu. (2018) "Handwriting Trajectory Reconstruction Using Low-Cost IMU."
IEEE Transactions on Emerging Topics in Computational Intelligence 3, no. 3: 261-270.
[11] Le Guennec, Arthur, Simon Malinowski, and Romain Tavenard. (2016) "Data augmentation for time series classification using convolutional
neural networks." In: ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data: 1-9.
[12] Zheng Y, Liu Q, Chen E, Ge Y, Zhao JL (2016). “Exploiting multi-channels deep convolutional neural networks for multivariate time series
classification.” Frontiers of Computer Science 10, no 1:96-112.
[13] Ramakrishna mission Vivekananda University, Coimbatore Campus, Indian Sign Language Dictionary, http://indiansignlanguage.org.
[14] Madgwick, Sebastian OH, Andrew JL Harrison, and Ravi Vaidyanathan. (2011) "Estimation of IMU and MARG orientation using a gradient
descent algorithm." In 2011 IEEE international conference on rehabilitation robotics: 1-7.
1550 Rinki Gupta et al. / Procedia Computer Science 171 (2020) 1542–1550
Author name / Procedia Computer Science 00 (2019) 000–000 9
[15] Basha, S. H., Shiv Ram Dubey, Viswanath Pulabaigari, and Snehasis Mukherjee. (2019) "Impact of Fully Connected Layers on Performance
of Convolutional Neural Networks for Image Classification." arXiv preprint:1902.02771.
[16] Xu, Qi, Ming Zhang, Zonghua Gu, and Gang Pan. "Overfitting remedy by sparsifying regularization on fully-connected layers of CNNs."
Neurocomputing 328 (2019): 69-74.