STGCN

Indian- Sign-Language-Recognition using
Spatial-Temporal Graph Convolutional Networks

Rohit Majumder Bhaskar Chaurasia Sanika Khankale Aditya Tyagi
Department of Computer Department of Computer Department of Computer Department of Computer
Science and Engineering Science and Engineering Science and Engineering Science and Engineering
Bharati Vidyapeeth Bharati Vidyapeeth Bharati Vidyapeeth Bharati Vidyapeeth
College of Engineering College of Engineering College of Engineering College of Engineering
Navi Mumbai, INDIA Navi Mumbai, INDIA Navi Mumbai, INDIA Navi Mumbai, INDIA
[email protected] [email protected] [email protected] [email protected]
Abstract—Indian Sign Language (ISL) recognition poses this area. This is in contrast to the efforts made by other
unique challenges due to its intricate gestures and diverse
linguistic variations. In this study, we propose a novel approach
countries, which is quite contradictory[[1],[2],[3]].
utilizing Spatial-Temporal Graph Convolutional Networks (ST- The delay in standardization can be traced back to the
GCN) for accurate ISL recognition. Our method leverages the evidence supporting this notion. Indian Sign Language (ISL)
spatial and temporal dependencies inherent in sign language studies commenced in India in 1978. However, due to the
gestures, modeling them through graph convolutional networks. absence of a standardized form of ISL, its application
We construct a spatiotemporal graph representation of sign
gestures, enabling effective feature extraction and learning of remained limited to short-term courses exclusively.
complex temporal dynamics. Extensive experiments on Furthermore, the gestures employed in the majority of deaf
benchmark ISL datasets demonstrate the superiority of our schools exhibited notable variations, with nearly 5% of the
proposed approach over existing methods, achieving state-of-the- overall deaf population attending such institutions. It wasn't
art performance in ISL recognition accuracy. Additionally, our until 2003 that ISL underwent standardization, garnering the
model exhibits robustness to variations in signing speed and
style, highlighting its potential for real-world applications in interest of researchers[4].
assistive technology and communication aids for the deaf Indian Sign Language (ISL) encompasses a diverse
community. range of static and dynamic signs, including both single and
Keywords—Indian Sign Language, Convolutional Neural double-handed gestures. Compounded by the fact that
Network, Spatial Temporal Graph, Graph Convolutional different regions in India employ numerous variations for the
Network.
same alphabet, introducing a unified scheme becomes
I. INTRODUCTION exceedingly challenging. Moreover, the absence of a
Communication has always played a crucial role in the standardized dataset further complicates matters, highlighting
lives of human beings. The ability to interact with others and the intricate nature of Indian sign language. In recent times,
express ourselves is a fundamental necessity. However, our researchers have begun delving into this domain, employing
perspective and communication style can vary greatly from primarily two distinct approaches for sign language
those around us, depending on factors such as upbringing, recognition: the sensor-based approach and the vision-based
education, and society. It is also important to ensure that we approach [5].
are understood in the way we intend. Despite this, most people The sensor-based approach relies on gloves or other
do not face significant difficulties when interacting with each instruments capable of recognizing finger gestures and
other and can easily express themselves through speech, converting them into corresponding electrical signals for sign
gestures, body language, reading, and writing. Speech, in interpretation. Conversely, the vision-based approach utilizes
particular, is widely used among them. However, individuals web cameras to capture videos or images. This approach,
with speech impairments face challenges in communicating devoid of specialized hardware requirements, offers
with the majority who rely on spoken language. They often spontaneity and is favored by signers [6]. However, hand
rely solely on sign language, which further complicates their segmentation in complex settings remains a significant
ability to communicate effectively. This highlights the need for challenge, crucial for accurate identification. Hence, a
sign language recognition systems that can convert sign framework to address this issue is proposed.
language into spoken or written language, and vice versa. The advancements in machine learning and deep
Unfortunately, such systems are currently limited in learning technologies are introducing novel methods and
availability, expensive, and cumbersome to use. Researchers algorithms for efficient, accurate, and cost-effective
from various countries are now actively working on recognition of Indian sign language alphabets. The end-to-end
developing sign language recognition systems. These systems automation of these models overcomes the subjective and
aim to bridge the communication gap faced by individuals inconsistent limitations of traditional methods, enhancing
with speech impairments, allowing them to communicate more accuracy and efficiency.
easily with others. However, it is worth noting that despite This study presents a methodology for constructing a
India's diverse population, which accounts for nearly 17.7% of large, diverse, and robust real-time system for recognizing
the world's population, limited research has been conducted in alphabets (A-Z) and digits (0–9) in Indian Sign Language.
Unlike high-end technologies such as gloves or Kinect, The recognition of sign languages obtained significant
the authors recognize signs from images captured via a progress in recent years, which is motivated mainly by the
webcam. The paper also discusses the accuracy achieved in advent of advanced sensors, new machine learning techniques,
the results. Real-time, accurate, and efficient recognition of and more powerful hardware [28]. Besides, approaches con-
ISL signs is crucial for bridging the communication gap sidered intrusive and requiring the use of sensors such as
between the hearing or speech-impaired individuals and the gloves, accelerometers, and markers coupled to the body of
general population. the interlocutor have been gradually abandoned and replaced
by new approaches using conventional cameras and computer
II. RELATED WORK vision techniques.
Different authors have employed diverse methodologies Due to this movement, it is also notable the increase in
depending on the nature of sign language and the signs the adoption of techniques for feature extraction such as SIFT,
involved. HOG, HOF and STIP to preprocess images obtained from
Singha et al. [7] proposed a real-time recognition method cameras and provide more information to machine learning
utilizing Eigen value-weighted Euclidean distance for sign algorithms [30], [27].
classification. Kishore et al. [8] introduced a system Convolutional Neural Networks (CNN), as in many com-
employing Artificial Neural Network (ANN) to classify signs puter vision applications, obtained remarkable results in this
by finding active contours from boundary edge maps. field with accuracy reached 90% depending on the dataset
Another approach utilized the Viola Jones algorithm with [31], [32]. There are still some variations as 3D CNNs, the
Local Binary Patterns (LBP) functions for hand gesture combination with other models such as Inception or the Re-
recognition in real-time, requiring less processing capacity gions of Interest applications [33], [34], [35]. Recurrent
for movement detection [9]. Segmentation, a crucial step in Neural
hand processing, often utilized Otsu's algorithm due to its Networks and Temporal Residual Networks also obtained
high accuracy rate [10]. Moving block distance interesting results in the same purpose [36], [37].
parameterization method was explored in [11] to skip Despite the above advances, a large portion of these studies
initialization and segmentation steps, focusing on high addresses static signs or single-letter images, from the dacty-
precision static symbols and basic word units. lology1 [31], [34]. The problem is the negative effect on the
While many works focused on pattern recognition and intrinsic dynamics of the language, such as its movements,
feature extraction [12], single-feature systems often proved non-manual expressions, and articulations between parts of
insufficient, leading to the introduction of hybrid approaches. the body [35]. In this sense, it is extremely relevant that new
Nandy et al. [13] combined K-Nearest Neighbor (KNN) and studies observe such important characteristics.
Euclidean distance with orientated histogram features, albeit With this purpose, we present an approach based on skeletal
showing poor performance with similar gestures. Manjushree body movement to perform sign recognition. This technique is
et al. [14] employed histogram of oriented gradients and known as Spatial-Temporal Graph Convolutional Network
feature matching for single-handed sign classification, while (ST-GCN)2 and was introduced in [36]. The approach aims
Kanade et al. [15] achieved good accuracy using Principal for methods capable of autonomously capturing the patterns
Component Analysis (PCA) features and Support Vector contained in the spatial configuration of the body joints as
Machines (SVM). Sahoo [16] proposed recognition for both well as their temporal dynamics.
single and double-handed character signs, while Geetha. M et
al. [17] utilized B-Spline approximation for shape matching III. THE PROPOSED WORK
of static gestures. Sign language recognition demands reliable and extensive
Deep learning technologies have revolutionized image data to develop a precise system beneficial for real-time users.
recognition for real-time systems. Jayadeep et al. [23] In this study, authors utilized a custom-built dataset to address
utilized Convolutional Neural Networks (CNN) for feature the challenges of sign detection and classification. The data
extraction and Long Short Term Memory (LSTM) for progresses through various stages in the sign language
classification and translation. Bin et al. [24] proposed the recognition process, as illustrated in Figure 1: Dataset
InceptionV3 model for identifying static signs using depth Creation, Image Acquisition, Data Pre-processing, Feature
sensors, eliminating segmentation and feature extraction Extraction, and Sign Classification.
steps. Bheda et al. [25] introduced a mini-batch supervised 1. Pre-processing (division of data into training data and
learning method with stochastic gradient descent for testing data).
classifying images using deep convolutional neural networks 2. Feature extraction and training through ST-GCN.
for each digit and American Sign Language letter. 3. Classification and testing.
Lastly and most importantly the work done by Shagun K. 4. Reverse recognition
er al. on Indian Sign Language recognition system using
SURF with SVM and CNN has played the most major role
and support for the base of our project [26].
Motivated by these advancements, the authors opted
to create a custom dataset and algorithm for accurate video
detection. They chose SURF features to reduce measurement
time and ensure system rotation invariance. Additionally,
they addressed background dependency to enable system use
in diverse environments, not just controlled settings.
Fig. 1. Flowchart
3.1 Dataset Collection 3.2 Preprocessing

Dataset collection is a pivotal aspect of research across Preprocessing for hand sign recognition using ST-GCN
various domains, serving as the foundation for the development (Spatio-Temporal Graph Convolutional Network) involves
of machine or deep learning models. However, it is fraught with several key steps to prepare the input data for the model.
challenges. One major hurdle we encountered during data areacquired,typically captured using webcams or depth sensors.
collection was the absence of standard datasets for Indian sign Individual frames are then extracted from these videos, each
language. Thus, as part of this project, we undertook the task of representing a distinct moment in time. Hand detection
manually constructing a dataset to address this issue. techniques are employed to isolate the hand region within each
Initially, we recorded videos using a webcam, frame, focusing the analysis on the relevant part of the image.
encompassing various sign gestures. We focused on 26 different Subsequently, hand segmentation is performed to separate the
alphabets (A-Z) and 10 numeric signs (0–9), captured from hand from the background using methods like thresholding or
three individuals. The positioning of the camera was crucial for background subtraction. The segmented hand images are resized
picture quality and background noise elimination. To introduce to a uniform size to ensure consistency in input dimensions.
variability into the dataset, we employed two methods for Temporal sampling is then applied to select a fixed
capturing images. number of frames or duration from each video sequence,
The first method involved default settings, which capturing the temporal dynamics of hand movements. Data
utilized skin segmentation and plain color backgrounds. In the normalization techniques are employed to bring the pixel values
second method, we utilized running averages, where initial of the hand images within a similar range. Optionally, data
frames were designated as background, and subsequent objects augmentation methods such as rotation, translation, and flipping
were considered foreground, facilitating extraction. We can be applied to increase the variability of the dataset. Finally,
incorporated both approaches to ensure the model's adaptability the preprocessed frames are represented as a spatio-temporal
to diverse scenarios. graph, with nodes corresponding to key points on the hand (e.g.,
The signs from the live video were converted into joints) and edges representing spatial and temporal relationships
frames and further extracted using a pixel value threshold. The between these points. The preprocessed data is then formatted
resulting frames had a resolution of 250*250 to minimize into the input format expected by the ST-GCN model, facilitating
computational requirements for pre-processing. Each sign folder effective hand sign recognition.
contained approximately 1000 images per sign, totaling 36,000 In this phase, the focus is on preparing the image for
images for both image acquisition methods. The signs involved feature detection and extraction while ensuring consistency in
both single and dual-handed gestures, captured from different scale across all images. In the default approach, where images
angles and stored in grayscale format with the .jpg extension. are captured against a plain background, the initial step involves
The dataset images are depicted in Figure 2 (refer to Figure 3). converting the captured video frame into the HSV color space.
This conversion aids in isolating skin-colored pixels, given the
distinct hue of skin compared to the background. Subsequently,
an experimental threshold is applied to the frame to filter out
skin-colored pixels, followed by binarization and blurring to
eliminate noise. The largest contour, presumed to correspond to
the hands, is then extracted, with further refinement through the
application of a median filter and morphological operations to
address any errors.
Fig. 2. ISL Signs
Fig. 3. Pre-Processing

STGCN

Uploaded by

STGCN

Uploaded by

Indian- Sign-Language-Recognition using

Spatial-Temporal Graph Convolutional Networks

3.1 Dataset Collection 3.2 Preprocessing

You might also like