STGCN
STGCN
Abstract—Indian Sign Language (ISL) recognition poses this area. This is in contrast to the efforts made by other
unique challenges due to its intricate gestures and diverse
linguistic variations. In this study, we propose a novel approach
countries, which is quite contradictory[[1],[2],[3]].
utilizing Spatial-Temporal Graph Convolutional Networks (ST- The delay in standardization can be traced back to the
GCN) for accurate ISL recognition. Our method leverages the evidence supporting this notion. Indian Sign Language (ISL)
spatial and temporal dependencies inherent in sign language studies commenced in India in 1978. However, due to the
gestures, modeling them through graph convolutional networks. absence of a standardized form of ISL, its application
We construct a spatiotemporal graph representation of sign
gestures, enabling effective feature extraction and learning of remained limited to short-term courses exclusively.
complex temporal dynamics. Extensive experiments on Furthermore, the gestures employed in the majority of deaf
benchmark ISL datasets demonstrate the superiority of our schools exhibited notable variations, with nearly 5% of the
proposed approach over existing methods, achieving state-of-the- overall deaf population attending such institutions. It wasn't
art performance in ISL recognition accuracy. Additionally, our until 2003 that ISL underwent standardization, garnering the
model exhibits robustness to variations in signing speed and
style, highlighting its potential for real-world applications in interest of researchers[4].
assistive technology and communication aids for the deaf Indian Sign Language (ISL) encompasses a diverse
community. range of static and dynamic signs, including both single and
Keywords—Indian Sign Language, Convolutional Neural double-handed gestures. Compounded by the fact that
Network, Spatial Temporal Graph, Graph Convolutional different regions in India employ numerous variations for the
Network.
same alphabet, introducing a unified scheme becomes
I. INTRODUCTION exceedingly challenging. Moreover, the absence of a
Communication has always played a crucial role in the standardized dataset further complicates matters, highlighting
lives of human beings. The ability to interact with others and the intricate nature of Indian sign language. In recent times,
express ourselves is a fundamental necessity. However, our researchers have begun delving into this domain, employing
perspective and communication style can vary greatly from primarily two distinct approaches for sign language
those around us, depending on factors such as upbringing, recognition: the sensor-based approach and the vision-based
education, and society. It is also important to ensure that we approach [5].
are understood in the way we intend. Despite this, most people The sensor-based approach relies on gloves or other
do not face significant difficulties when interacting with each instruments capable of recognizing finger gestures and
other and can easily express themselves through speech, converting them into corresponding electrical signals for sign
gestures, body language, reading, and writing. Speech, in interpretation. Conversely, the vision-based approach utilizes
particular, is widely used among them. However, individuals web cameras to capture videos or images. This approach,
with speech impairments face challenges in communicating devoid of specialized hardware requirements, offers
with the majority who rely on spoken language. They often spontaneity and is favored by signers [6]. However, hand
rely solely on sign language, which further complicates their segmentation in complex settings remains a significant
ability to communicate effectively. This highlights the need for challenge, crucial for accurate identification. Hence, a
sign language recognition systems that can convert sign framework to address this issue is proposed.
language into spoken or written language, and vice versa. The advancements in machine learning and deep
Unfortunately, such systems are currently limited in learning technologies are introducing novel methods and
availability, expensive, and cumbersome to use. Researchers algorithms for efficient, accurate, and cost-effective
from various countries are now actively working on recognition of Indian sign language alphabets. The end-to-end
developing sign language recognition systems. These systems automation of these models overcomes the subjective and
aim to bridge the communication gap faced by individuals inconsistent limitations of traditional methods, enhancing
with speech impairments, allowing them to communicate more accuracy and efficiency.
easily with others. However, it is worth noting that despite This study presents a methodology for constructing a
India's diverse population, which accounts for nearly 17.7% of large, diverse, and robust real-time system for recognizing
the world's population, limited research has been conducted in alphabets (A-Z) and digits (0–9) in Indian Sign Language.
Unlike high-end technologies such as gloves or Kinect, The recognition of sign languages obtained significant
the authors recognize signs from images captured via a progress in recent years, which is motivated mainly by the
webcam. The paper also discusses the accuracy achieved in advent of advanced sensors, new machine learning techniques,
the results. Real-time, accurate, and efficient recognition of and more powerful hardware [28]. Besides, approaches con-
ISL signs is crucial for bridging the communication gap sidered intrusive and requiring the use of sensors such as
between the hearing or speech-impaired individuals and the gloves, accelerometers, and markers coupled to the body of
general population. the interlocutor have been gradually abandoned and replaced
by new approaches using conventional cameras and computer
II. RELATED WORK vision techniques.
Different authors have employed diverse methodologies Due to this movement, it is also notable the increase in
depending on the nature of sign language and the signs the adoption of techniques for feature extraction such as SIFT,
involved. HOG, HOF and STIP to preprocess images obtained from
Singha et al. [7] proposed a real-time recognition method cameras and provide more information to machine learning
utilizing Eigen value-weighted Euclidean distance for sign algorithms [30], [27].
classification. Kishore et al. [8] introduced a system Convolutional Neural Networks (CNN), as in many com-
employing Artificial Neural Network (ANN) to classify signs puter vision applications, obtained remarkable results in this
by finding active contours from boundary edge maps. field with accuracy reached 90% depending on the dataset
Another approach utilized the Viola Jones algorithm with [31], [32]. There are still some variations as 3D CNNs, the
Local Binary Patterns (LBP) functions for hand gesture combination with other models such as Inception or the Re-
recognition in real-time, requiring less processing capacity gions of Interest applications [33], [34], [35]. Recurrent
for movement detection [9]. Segmentation, a crucial step in Neural
hand processing, often utilized Otsu's algorithm due to its Networks and Temporal Residual Networks also obtained
high accuracy rate [10]. Moving block distance interesting results in the same purpose [36], [37].
parameterization method was explored in [11] to skip Despite the above advances, a large portion of these studies
initialization and segmentation steps, focusing on high addresses static signs or single-letter images, from the dacty-
precision static symbols and basic word units. lology1 [31], [34]. The problem is the negative effect on the
While many works focused on pattern recognition and intrinsic dynamics of the language, such as its movements,
feature extraction [12], single-feature systems often proved non-manual expressions, and articulations between parts of
insufficient, leading to the introduction of hybrid approaches. the body [35]. In this sense, it is extremely relevant that new
Nandy et al. [13] combined K-Nearest Neighbor (KNN) and studies observe such important characteristics.
Euclidean distance with orientated histogram features, albeit With this purpose, we present an approach based on skeletal
showing poor performance with similar gestures. Manjushree body movement to perform sign recognition. This technique is
et al. [14] employed histogram of oriented gradients and known as Spatial-Temporal Graph Convolutional Network
feature matching for single-handed sign classification, while (ST-GCN)2 and was introduced in [36]. The approach aims
Kanade et al. [15] achieved good accuracy using Principal for methods capable of autonomously capturing the patterns
Component Analysis (PCA) features and Support Vector contained in the spatial configuration of the body joints as
Machines (SVM). Sahoo [16] proposed recognition for both well as their temporal dynamics.
single and double-handed character signs, while Geetha. M et
al. [17] utilized B-Spline approximation for shape matching III. THE PROPOSED WORK
of static gestures. Sign language recognition demands reliable and extensive
Deep learning technologies have revolutionized image data to develop a precise system beneficial for real-time users.
recognition for real-time systems. Jayadeep et al. [23] In this study, authors utilized a custom-built dataset to address
utilized Convolutional Neural Networks (CNN) for feature the challenges of sign detection and classification. The data
extraction and Long Short Term Memory (LSTM) for progresses through various stages in the sign language
classification and translation. Bin et al. [24] proposed the recognition process, as illustrated in Figure 1: Dataset
InceptionV3 model for identifying static signs using depth Creation, Image Acquisition, Data Pre-processing, Feature
sensors, eliminating segmentation and feature extraction Extraction, and Sign Classification.
steps. Bheda et al. [25] introduced a mini-batch supervised 1. Pre-processing (division of data into training data and
learning method with stochastic gradient descent for testing data).
classifying images using deep convolutional neural networks 2. Feature extraction and training through ST-GCN.
for each digit and American Sign Language letter. 3. Classification and testing.
Lastly and most importantly the work done by Shagun K. 4. Reverse recognition
er al. on Indian Sign Language recognition system using
SURF with SVM and CNN has played the most major role
and support for the base of our project [26].
Motivated by these advancements, the authors opted
to create a custom dataset and algorithm for accurate video
detection. They chose SURF features to reduce measurement
time and ensure system rotation invariance. Additionally,
they addressed background dependency to enable system use
in diverse environments, not just controlled settings.
Fig. 1. Flowchart
Fig. 3. Pre-Processing