2-Way Arabic Sign Language Translator Using CNNLSTM Architecture and NLP

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2-way Arabic Sign Language Translator using CNNLSTM

Architecture and NLP


Tushar Agrawal Siddhaling Urolagin
Department of Computer Science Department of Computer Science
Birla Institute of Technology and Birla Institute of Technology and
Science Pilani, Dubai Campus, Science Pilani, Dubai Campus,
Dubai, UAE Dubai, UAE
+971544007831 +971566593138
[email protected] [email protected]

ABSTRACT September 23 as the International Day of Sign Languages, which


Over 466 million (5%) people across the world are suffering from advocates sign language as a human right and equal in status to
hearing impairment, according to the World Health Organization. spoken languages [1].
There is a great need to bridge the communication gap between Notwithstanding the fact that signs language is of utmost
the deaf and the general population. In our research work, recent importance and is the primary method of communication in the
developments such as Natural Language Processing (NLP) and deaf-mute community. Very few people, who aren’t a part of the
Deep Learning Neural Network (DLNN) are utilized to bridge this deaf-mute community, are familiar with sign languages; hence, it
gap. We developed a 2-way sign language translator for the serves as a major barrier to communication between the speaking
Arabic language, which translates text to sign and vice versa. The and deaf-mute community and hence acts as a major roadblock to
NLP such as parsing, part of speech tagging, tokenization and the collective progress of these societies.
translation are developed to achieve text to sign translation. The
Convolutional Neural Network (CNN) along with Long Short- The existing sign language translators are classified into two
Term Memory (LSTM) is used to perform sign to text translation. categories; text to sign language translators and sign language to
text translators. The sign to text approaches are usually sensor-
CCS Concepts based or image-based, sometimes even a combination of both. In
• Computing methodologies ➝Artificial intelligence ➝Natural [2], the authors have showcased a concept which utilizes sensor
language processing ➝Machine translation • Computing embedded gloves, to recognize the ArSL signs made by the deaf
methodologies ➝ Artificial intelligence ➝Computer vision person. In recent years, there has been an increasing interest in
➝ Computer vision problems ➝ Matching • Computing taking advantage of deep neural networks in analyzing the spatio-
methodologies ➝ Artificial intelligence ➝Computer vision temporal features for sign language recognition. For instance, [3],
➝ Computer vision problems ➝ Video segmentation [4] and [5] use the 3DCNN architecture for classification of
gestures. Although highly accurate, the 3DCNN architecture
Keywords requires inputs with a depth dimension, i.e. RGB-D, which are
Language translation; Machine Translation; Gesture recognition; generated from devices with dedicated depth sensors.
Natural Language Processing; Deep Learning Furthermore, in [6], Necati Cihan Camgoz, et al. use CNN, bi-
directional LSTM and sequence to sequence learning techniques
1. INTRODUCTION for continuous sign language recognition on RGB image inputs.
There are over 466 million with disabling hearing loss in the Conversely, there have been varied attempts to translate Arabic
world, and of these 72 million are deaf, according to the World speech and text to sign language. Authors in [7] present an
Health Organization. Before the origins of sign language,
intelligent conversion system for Arabic speech to ArSL, which is
communication was a major problem for the deaf-mute
based on a knowledge base, with an accuracy of 96%. However,
community and acted as a major roadblock to their progress. Sign
the model is a desktop application with still images as output,
language is the fundamental form and the primary method of
which limits its utility in real-time scenarios. Halwani, et al. [8],
communication for people in the deaf community. Sign language
introduce an application that translates Arabic speech to text and
can be defined as a language that utilizes visual gestures made consequently, Arabic text to Arabic Sign language represented by
with the hands and incorporates the facial expressions and body a 2D avatar, an image of the sign. Additionally, in [9], Al-Khalifa
posture. Owing to its importance the United Nations recognizes introduces a mobile phone translator which converts typed Arabic
Permission to make digital or hard copies of all or part of this work for text input to ArSL depicted by a 3D sign avatar. Although, this
personal or classroom use is granted without fee provided that copies are work does not account for different sentence structures in the
not made or distributed for profit or commercial advantage and that copies ArSL. Moreover, a recent approach in [10] deploys the power of
bear this notice and the full citation on the first page. Copyrights for cloud computing framework for the translation of Arabic text to
components of this work owned by others than ACM must be honored. the Egyptian Arabic Sign Language dialect.
Abstracting with credit is permitted. To copy otherwise, or republish, to
post on servers or to redistribute to lists, requires prior specific permission This work aims to bridge the gap between the communities and to
and/or a fee. Request permissions from [email protected]. remove all roadblocks by developing a two-way Arabic sign
BDET2020, January 3–5, 2020, Singapore. language translator through a smartphone that would allow for
© 2020 Association for Computing Machinery. both text-to-speech and speech-to-text conversion; using Deep
ACM ISBN 978-1-4503-7683-9/20/01…$15.00
Learning Neural Network for gesture recognition, and NLP.
DOI: https://doi.org/10.1145/3378904.3378915

96
Sociolinguistic factors affecting change in sign language along There is also some variation according to gender. Some men and
with the assistance a 2-way translator provides to keep up with the women use different signs for ‘hello’, ‘terrible’ and ‘welcome’
change is discussed in Section 2. The application of NLP for text (Figure 1).
to sign conversion is discussed in section 3.1, and methodology is
discussed in section 3. The results are discussed in section 4.
Conclusion along with the impact of a 2-way communicator on
the society is presented in section 5.

2. SOCIOLINGUISTIC FACTORS
AFFECTING SIGN LANGUAGE
2.1 Political Correctness
Different social components have impacted modern days’ gesture-
based communication. One of the central points of impact is
political accuracy. An overview of how British communication
via gestures (BSL) is utilized by hard of hearing individuals of
various ages, based in the UK, uncovered that a significant shift
has occurred in the signs utilized by various ages. In this present Figure 1. Variation in ‘welcome’ sign for different genders
day and age, to sign an inclined eye to allude to Chinese is (Male – top, Female – bottom)
profoundly examined by the general public. Henceforth, for hard
of hearing individuals matured in the vicinity of 16 and 30, the
culturally sensitive approach to show China is to draw the correct A wide variety of social factors influence the language of the
hand from the signers’ heart evenly over their chest, and then silent. The rate of changing sign language is tied to the rapidly
down towards the hip, demonstrating the shape of a Mao coat [11]. changing rate of sociocultural evolution of the society. To keep up
with these changes mobile phone-based two-way translator is
It is also considered offensive to mime a hook nose when
ideal. Primarily, a mobile-based two-way translator connected to a
referring to Jewish people. Similarly, their sign for a Jewish man
cloud database, with functionality to upload gestures, will ensure
or woman is a hand resting against the chin and making a short
that the database keeps with the sociolinguistic changes.
movement down, in the shape of a beard. A finger pointing to an
Moreover, two-way communication from sign to text and vice-
imaginary spot in the middle of a forehead is no longer
versa warrants enhanced independence and self-confidence of a
appropriate as the sign for India. Modern sign translation of India
deaf person.
is a mime of the triangular shape of the subcontinent [11].

2.2 Age 3. METHODOLOGY


The use of traditional signs has declined from older to younger 3.1 Data Acquisition
signers, across all regions. Technological change means the older We have downloaded 1600 basic ArSL signs from an online
generations sign for something is different from the younger standard dictionary [13]. Additionally, we collected and compiled
generation’s sign. The sign for ‘Telephone’ could be taken as an data from the official Instagram Video Library of Arabic Sign
example. Another major difference is the use of fingerspelling by language in UAE [14] which was recently updated on October 19,
older signers when compared to younger signers. 2016. Next, we identified and recorded some signs that were
essential to sentence structure such as ‘what’, ‘where’, ‘who’; but
2.3 Language Background were missing in the library of amassed gestures. This was done
Research on sign languages has shown that the language used by with the help of the five native ArSL signers and an expert
native signers versus the language used by non-native signers has interpreter from the Al-Amal School for the Deaf in Sharjah City
many significant differences. Having English as a first language, of Humanitarian Services. They also helped us build a corpus of
influences the signers to use a variety of signing influenced by 145 Arabic language sentences and their sign-language
spoken English. On the contrary, research has shown that signers equivalents.
with deaf parents are inclined to use traditional signs. To support
this, studies on American Sign Language concluded that signers 3.2 Text to Sign Translation
with deaf parents favor the use of conservative’ signs [12]. The
importance of language background is considerable as it has been We utilized natural language processing for text to sign language
observed that older signers use more traditional signs. conversion. Natural language processing often referred to as NLP
is a branch of Artificial Intelligence that aims at reading,
2.4. Education and Region deciphering and understanding human natural languages. Text to
The location of the educational institution affects the sign sign conversion, in order to create a more personal and
language in aspects. Locally educated individuals use a higher comfortable experience for the disabled community, is vital in
proportion of regional signs than those individuals who attended a creating a platform wherein the communication is two-way and
school in foreign regions. In spite of the fact that people may have enabling the disabled community to converse in real-time.
lived in a given area throughout the previous 10 years, they don’t Since written language is not directly translated to sign language,
completely adjust their vocabulary in accordance with the local and translation differs between different written languages,
neighborhood. It proposes that the locale of a member’s school grammar preservation rules are applied in order to make sure the
may be a superior indicator of vocabulary in adulthood than the meaning is not lost in translation. In order to do this, first, we
current district of home. preprocessed the text data. A flow chart with all the steps is
shown in Figure 2.
2.5 Gender

97
3.2.1 Preprocessing the data
3.2.1.1 Parsing
We used the Stanford Arabic Parser which performs the
segmentation of a sentence and outputs the set of segmented
words with their grammatical category.[15] Stanford Parser is a
probabilistic parser that returns the most probable analysis of new
sentences based on the knowledge acquired from hand parsed
sentences. The parser assumes precisely the tokenization of
Arabic used in the Penn Arabic Treebank (ATB) and uses a
whitespace tokenizer.
3.2.1.2 Parts of Speech Tagging
We applied the Stanford Arabic Parts of Speech (POS) Tagger
which is software that reads text in the Arabic language and
correspondingly associates each token to a part of speech such as
nouns, verbs, adjectives, etc. To recognize each token, the
Stanford Arabic Parts of Speech Tagger is already trained on Penn
Arabic Treebank (ATB).
3.2.1.3 Cleaning the Text
Certain rules are applied to the input text. The rules applied are
mentioned as follows:
I. Delete Special Characters: Some special characters (For
e.g. &, *,”, %, #) in Arabic have no significance to
the meaning of the sentence. Hence, such
characters are removed.
II. Spell Checking: We also check for spelling errors to
ensure that the word entered is correct and it also
makes the system more error-tolerant and robust.
III. Delete Stop Words: Stop words hold syntactic Figure 2. Preprocessing the data
significance but they carry very little meaning and 3.2.2 Tokenization and Translation
hence are almost unrelated to the subject matter. The last step before obtaining the signs’ videos is the tokenization
The Arabic stop words were defined using a of the preprocessed text. We first do a sliding window search to
common library available online [16]. The input identify patterns in the preprocessed text for compound sentences.
text was then scanned, and all the detected stop If the match for a pattern is found the pattern is added to the queue
words were filtered out. of tokens as a single token. Also, the entities recognized by the
IV. Retain Exception Words: In Arabic Sign Language, NER are broken down into characters and inserted as separate
generally, the signs for organizations, people, and tokens. All the remaining words are inserted into the queue in the
locations do not have a specific translation. For appropriate order. The videos corresponding to the tokens are
recognizing such named entities, we use the retrieved from the sign language database. The video queue
Named Entity Recognition (NER) module which obtained is as per the Arabic Syntax is rearranged to the Arabic
identifies the entities and categorizes them into Sign Language Syntax based on grammar preservation rules in
three classes either as a name, person or location. Table 1. The videos are then integrated to form a single video and
For such cases, these words are translated as the final video is streamed back to the mobile application.
separated letters [17].
V. Morphological Analysis: We have used SARF Table 1. Sample of rules for Arabic Syntax to Arabic Sign
Morphological Analyzer for analyzing the text. It Language Syntax conversion
extracts the root, pattern, stem, part of speech,
prefixes, and suffixes of the text. No Arabic Syntax ArSL syntax
1 S+V S+V
2 V+S S+V
3 S+P S+P
4 S+V+O S+O+V
5 S+V+O (Adj, Adv) S+O+V (Adj, Adv)
6 S+P+(Adj, Adv) S+P+(Adj, Adv)
7 S+V+Pr S+V
8 V+O O+V

98
3.3 Sign to Text Translation The differential image is then given as input to the first
convolutional layer, at each time step. After processing the
3.3.1 Preprocessing the Gesture Data differential image, the first convolutional produces a set of feature
To build the dataset for sign to text translation, we picked out 200 maps which is consequently processed by the second
signs and had 4 subjects enact each gesture 10 times. A python convolutional layer. The output from the second layer is then
script was written to store and process the live input video frame flattened, then passed as input into the hidden layer, and finally
by frame at a frame rate of 30 FPS. This produces 30 frames for into the LSTM blocks of the recurrent layer. The recurrent layer,
every second of the video. As the average duration of each gesture responsible for the mapping of temporal context, ultimately
in the amassed gesture video dictionary is 3 seconds, this provided assigns a gesture label to the differential image.
us with 3600 frames for each gesture. In total, the dataset is
comprised of 720,000 frames. The dataset was split in 70% for 4. RESULTS
training and 30% for testing. Each frame in the training dataset 4.1 Text to Sign
was labeled with their correct gesture class label. As for the
testing data, each gesture sequence had been annotated To check the accuracy of the complete system an interpreter tested
exclusively. it in the deaf domain. The interpreter tested a sample of 58
sentences and ranked the system based on the following factors
3.3.2 Dynamic Gesture Recognition using CNNLSTM such as grammar translation, appropriate sign representation, and
semantic transfer. The performance evaluation of the system was
based on parameters like accuracy, recall, and precision as shown
in Table 2. This indicates that most of the sentences are correctly
translated; a minority of the sentences which were incorrectly
translated contain words not found in our library.

Table 2. Metrics for Text to Sign translation using NLP


Accuracy Precision Recall F1-measure
92.55% 88.05% 83.45% 85.68%

Figure 3. CNNLSTM architecture For instance, Figure 5 shows an example of an Arabic sentence
For dynamic gesture recognition, we use the CNNLSTM [18] being processed correctly to produce its sign language equivalent.
deep learning architecture. Convolutional Neural Network Long
Short-Term Memory Network often abbreviated as CNNLSTM is
an LSTM architecture designed for temporal prediction problems
with spatial inputs, like images or videos. It involves
Convolutional Neural Network (CNN) layers for feature
extraction on input data integrated with LSTMs to temporal
prediction. More precisely, the architecture consists of two
convolutional layers, a flattening layer, a Long Short-Term
Memory recurrent layer followed by a SoftMax output layer as
depicted in Figure 3. Each convolutional layer is comprised of
convolution and max-pooling operations in addition to the
SoftMax function. Moreover, CNNLSTM is a type of an Elman
recurrent neural network and consequently, can be trained with
Backpropagation Through Time (BPTT).

Figure 5. Arabic sentence processed to output sign language


equivalent.
Figure 4. Equation for generating differential image. Now this processed text is tokenized, and three different tokens
are generated; videos corresponding to the tokens are retrieved
First, in order to detect the person in each individual frame, we
and rearranged to be displayed in order as shown in Figure 6.
used the single-shot detector model pre-trained by TensorFlow
API on Microsoft’s COCO - Common Objects in Context dataset
[19]. This dataset is comprised of nearly 300,000 images with 91
distinct object types labeled correctly. Of those, over 65,000
images have people labeled specifically. The area captured by the
bounding box is then scaled to the size of 64 x 48 and converted
to a differential image. The differential image is the output of the
segmentation process and represents the body motion. It is
calculated as described in the equation shown in Figure 4, where,
where, It, It −1, It +1 are the frames at the current time-step ‘t’, the
previous time-step and the next time-step respectively, ‘-’
corresponds to segmentation, ‘ ∧ ’ is the bitwise AND
Operation.

99
Figure 7. Motion of ArSL gesture for the phrase ‘How are
you?’(above) and ‘meeting’ (below).

Overall, the individual frames were mostly correctly classified,


with a minority of the initial frames in gesture sequence being
incorrectly classified. This may be due to initial frames
representing the rest position, common among most gestures.

5. CONCLUSION
To bridge the communication gap between the deaf and the
hearing community, we have successfully built a 2-way sign
language translator which could be implemented on a smartphone
device. The CNNLSTM architecture used for sign to text
translation is especially ideal for this task as it works with an RGB
input from a regular smartphone camera. Although, for sign
language to text translation, this proposed work is limited to
translating solo dynamic words and phrases, and could be
Figure 6. Videos corresponding to the tokens are retrieved improved upon to translate complete sentences. Additionally, the
and displayed in order. model could be extended to accommodate 2-way translation over
multiple languages. Moreover, connecting the model to a cloud
database which holds a crowdsourced gesture library, would
ensure that the model is robust to the sociolinguistic changes
4.2 Sign to Text affecting sign language. Overall, a two-way translator will have a
Table 3. Metrics for Sign to Text translation using CNNLSTM profound impression on the educational sector and society as a
model whole.
Accuracy Precision Recall F1-measure Almost 15% of school-age children (ages 6-19) in the United
88.67% 87.52% 85.75% 86.62% States alone have some degree of hearing loss [20]. The
technological aspect of the 2-way translator could be modified to
be fit into current schools and colleges. This would greatly help
The CNNLSTM model was evaluated using hold out cross-
the hearing-impaired students integrate into normal schools, rather
validation. For each gesture sequence in the testing dataset, the than send them away to specially-abled schools. Use of the two-
differential images corresponding to each of the frames in the way translator in educational institutions would vastly help reduce
sequence were processed by the CNNLSTM model, and the label the cost of paying for expensive specially-abled institutions and
assigned most recurrently to the differential images was
concomitant translators. In addition to this, it will enable the deaf
considered representative of the gesture. Similar to the text to sign students to freely and fully voice their thoughts and queries
evaluation, the CNNLSTM model was also evaluated based on
regarding discussions in the institution. Furthermore, it will give
accuracy, precision, recall and F1-measure as shown in Table 3.
the hearing-impaired community, equitable educational and
Furthermore, Figure 7, illustrates the motion-tracked frame by
employment opportunities, which are often lost are due to
frame to classify some of the signs accurately. communication gaps or errors in translation.
The translator will endow the deaf with a choice between the
‘Deaf Culture’ and ‘Normal’ culture [21]. Communication via the
mobile device would allow the deaf to explore and interact with
more places and people, thus allowing them to have more social
experiences. A precise two-way communicator would also vastly

100
improve the emotional health of a deaf person. 40 percent of [10] El-Gayyar, Mahmoud M., et al. 2016. “Translation from
displaced refugees from war-torn areas suffer from a hearing loss Arabic Speech to Arabic Sign Language Based on Cloud
due to high-pressure waves from bombing. In addition to the Computing.” Egyptian Informatics Journal, vol. 17, no. 3, pp.
sudden pressure of adjusting to a new language and culture, loss 295–303., doi:10.1016/j.eij.2016.04.001.
of hearing is an added disadvantage. Overcoming the barrier of [11] Stamp R, Schembri A, Fenlon J, Rentelis R, Woll B, Cormier
communication is of need and importance to the deaf refugees and K. 2014. Lexical Variation and Change in British Sign
utilizing a two-way communicator would immensely help them Language. PLoS ONE 9(4): e94053.
secure a sustainable life. https://doi.org/10.1371/journal.pone.0094053
6. REFERENCES [12] Stamp, Rose, et al. 2014. “Lexical Variation and Change in
[1] United Nations. “Sign Language, Deaf, Advocacy, Human British Sign Language.” PLoS ONE, vol. 9, no. 4,
Rights, Disability.” United Nations, doi:10.1371/journal.pone.0094053.
https://www.un.org/en/events/signlanguagesday. [13] Minisi MN. 2015. Arabic sign language dictionary;
[2] Sarji, David K. 2008. “HandTalk: Assistive Technology for http://www.menasy.com/
the Deaf.” Computer, vol. 41, no. 7, pp. 84–86., [14] @esl_zayed. 2016. “Emirati Sign Language (ESL)”,
doi:10.1109/mc.2008.226. Instagram Photos and Videos.” Instagram,
[3] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. https://www.instagram.com/esl_zayed/?hl=en
Kautz. 2016. Online detection and classification of dynamic [15] “The Stanford NLP Group.” The Stanford Natural Language
hand gestures with recurrent 3D convolutional neural Processing Group,
network. In Proc. CVPR. https://nlp.stanford.edu/projects/arabic.shtml.
[4] N. Neverova, C. Wolf, G. Taylor, and F. Nebout. 2014. [16] Mohataher. 2017. “Mohataher/Arabic-Stop-Words.” GitHub,
Multi-scale deep learning for gesture detection and 22 Jan. https://github.com/mohataher/arabic-stop-words.
localization. In ECCV Workshops
[17] Oudalab. “Oudalab/Arabic-NER.” GitHub,
[5] D. Wu, L. Pigou, P.-J. Kindermans, N. Le, L. Shao, J. https://github.com/oudalab/Arabic-NER.
Dambre, and J.-M. Odobez. 2016. Deep dynamic neural
networks for multimodal gesture segmentation and [18] Tsironi, Eleni, et al. 2017. “An Analysis of Convolutional
recognition. IEEE Transactions on Pattern Analysis and Long Short-Term Memory Recurrent Neural Networks for
Machine Intelligence, 38(8):1583–1597 Gesture Recognition.” Neurocomputing, vol. 268, pp. 76–86.,
doi: 10.1016/j.neucom.2016.12.088.
[6] Camgoz, Necati Cihan, et al. 2017. “SubUNets: End-to-End
Hand Shape and Continuous Sign Language Recognition.” [19] Lin, Tsung-Yi & Maire, Michael & Belongie, Serge & Hays,
2017 IEEE International Conference on Computer Vision James & Perona, Pietro & Ramanan, Deva & Dollár, Piotr &
(ICCV), doi:10.1109/iccv.2017.332. Lawrence Zitnick, 2014. C. Microsoft COCO: Common
Objects in Context. 8693. 10.1007/978-3-319-10602-1_48.
[7] A.E.E.El Alfi, M.M.R.El Basuony, S.M.El Atawy. 2014.
Intelligent Arabic text to Arabic sign language translation for [20] Lisa Yuan. “Hearing Loss Facts and Demographics.” HLAA,
easy deaf communication Int J Comput Appl, 92, pp. 22-29 http://hlaa-la.org/better-hearing/hearing-loss-statistics-and-
demographics/.
[8] S.M. Halawani, A.B. Zaitun. 2012. An avatar based
translation system from Arabic speech to Arabic sign [21] Joanne Cripps. 2017. “What Is Deaf Culture?” DEAF
language for deaf people Int J Inf Sci Educ ISSN, 2, pp. 13- CULTURE CENTRE, 27 Dec., https:// deafculturecentre.ca
20 ISSN 2231-1262 /what-is-deaf-culture/.
[9] H. Al-Khalifa. 2010. Introducing Arabic sign language for
mobile phones Comput Help People Spec Needs, 6180, pp.
213-220, Springer Berlin Heidelberg

101

You might also like