Deep Learning Based Sign Language Recognition System Using Convolutional Neural Network

DEEP LEARNING BASED SIGN LANGUAGE
RECOGNITION SYSTEM FOR STATIC SIGNS USING

CONVOLUTIONAL NEURAL NETWORKS
A PROJECT REPORT
Submitted by
R.PRAVEEN (922018104303)
G.KANNAN (922018104301)
P.THANGAPANDI (922018104028)
in
partial fulfillment for the award of the degree
of
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
SRI VIDYA COLLEGE OF ENGINEERING & TECHNOLOGY
VIRUDHUNAGAR 626 005
ANNA UNIVERSITY, CHENNAI 600 025

MAY 2022
i
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
Certified that this, project report “DEEP LEARNING BASED SIGN

LANGUAGE RECOGNITION SYSTEM FOR STATIC SIGNS USING
CONVOLUTIONAL NEURAL NETWORKS” is the bonafide work of
“PRAVEEN R (922018104303), KANNAN G (922018104301),
THANGAPANDI P (922018104028) ” who carried out the project work under
my supervision. Certified further that to the best of my knowledge the work
reported herein does not from part of any other thesis or dissertation on the basis
of which a degree or award was conferred on an earlier occasion on this or any
other candidate.
SIGNATURE SIGNATURE
Mrs.M.Mohana M.E., Mrs.M.Mohana M.E.,
HEAD OF THE DEPARTMENT, SUPERVISOR,
Associate Professor, Assistant Professor,
Department of CSE, Department of CSE,
Sri Vidya College of Sri Vidya College of
Engineering & Technology, Engineering & Technology,
Virudhunagar - 626 005. Virudhunagar - 626 005.
Submitted for the project viva-voce held on ___________ at Sri Vidya College
of Engineering & Technology, Virudhunagar.
Internal Examiner External Examiner
ii
ACKNOWLEDGEMENT
First and foremost we thank the Almighty for this gracious guidance
throughout the project.
We express our sincere and respectful regards to chairman

Er.R.THIRUVENKADA RAMANUJA DOSS B.E., for providing necessary
facilities in carrying of this work.
We endow our thank to Principal, Dr.T.LOUIE FRANGO M.Tech.,

Ph.D., of Sri Vidya College of Engineering and Technology, who has given us
permission and extending facilities for the successful completion of the project.
At the same time, we wish to record our deep sense of gratitude to

Mrs.M.MOHANA M.E., Professor and Head of the Department, Computer
Science and Engineering for the constant encouragement and valuable
suggestion throughout this project.
We wish to thank Mrs.M.MOHANA M.E., Project Guide, generously

offered valuable suggestions and guidance throughout the project. She has been
very friendly and kind in discussing various ideas with us.
We wish to thank Mrs.M.MOHANA M.E., Project Coordinator, who

never hesitated to help us during the course of the project.
Finally, we also thank our parents, friends for their moral support in our
project.
iii
ABSTRACT
Sign language is a form of communication used by people with impaired

hearing and speech. People use sign language gestures as a means of non-verbal
communication to express their thoughts and emotions. But non-signers find it
extremely difficult to understand, hence trained sign language interpreters are
needed during medical and legal appointments, educational and training
sessions. Over the past five years, there has been an increasing demand for
interpreting services. Other means, such as video remote human interpreting
using high-speed internet connections, have been introduced. They will thus
provide an easy to use sign language interpreting service, which can be used,
but has major limitation such as accessibility to internet and an appropriate
device.
To address this, we use an ensemble of two models to recognize gestures
in sign language. We use a custom recorded American Sign Language dataset
based off an existing dataset [1] for training the model to recognize gestures.
The dataset is very comprehensive and has 150 different gestures performed
multiple times giving us variation in context and video conditions. For
simplicity, the videos are recording at a common frame rate. We propose to use
a CNN (Convolutional Neural Network) named Inception to extract spatial
features from the video stream for Sign Language Recognition (SLR). Then by
using a LSTM (Long Short-Term Memory), an RNN (Recurrent Neural
Network) model, we can extract temporal features from the video sequences. 2
A proposed improvement is to test the model with more gestures to see how
accuracy scales with larger sample sizes and compare the performance of two
different outputs of a CNN.
Keywords - Sign language recognition, Segmentation, Feature extraction,

Neural networks, Recognition, CNN, RNN, ANN.
iv
TABLE OF CONTENTS
CHAPTER PAGE
TITLE
NO NO
ABSTRACT iv
LIST OF FIGURES vii
LIST OF ABBREVIATIONS viii
1 INTRODUCTION 1
1.1 IMAGE PROCESSING 1
1.2 SIGN LANGUAGE 3
1.3 SIGN LANGUAGE AND HAND GESTURE
4
RECOGNITION
1.4 MOTIVATION 5
1.5 PROBLEM STATEMENT 5
1.6 ORGANISATION OF THESIS 5
2 LITERATURE SURVEY 6
3 EXISTING SYSTEM 8
4 PROPOSED SYSTEM 9
4.1 INTRODUCTION 9
4.2 TRAINING MODULE 10
4.2.1 PREPROCESSING 11
4.3 ALGORITHM : CONVOLUTIONAL
13
NEURAL NETWORK
4.4 SEGMENTATION 16
4.5 FEATURE EXTRACTION 17
5 MODULES DESCRIPTION 19
5.1 IMAGE ACQUISITION 19
5.2 SEGMENTATION 19
5.3 FEATURES EXTRACTION 19
5.4 PREPROCESSING 20
5.4.1 MORPHOLOGICAL TRANSFORM 20
5.4.2 BLURRING 20
v
5.4.3 THRESHOLDING 21
5.4.4 RECOGNITION 21
5.4.5 TEXT OUTPUT 21
6 SYSTEM SPECIFICATION 22
6.1 SYSTEM REQUIREMENTS 22
6.1.1 SOFTWARE REQUIREMENTS 22
6.1.2 HARDWARE REQUIREMENTS 22
6.2 PYTHON 23
6.3 ANACONDA 26
6.4 ANACONDA-PACKAGES 26
6.4.1 SPYDER (SOFTWARE) 32
6.4.2 TENSORFLOW 33
6.4.3 OPENCV 33
6.5 WEB CAMERA (HARDWARE) 34
7 IMPLEMENTATION AND RESULTS 36
8 CONCLUSION AND FUTURE SCOPE 39
APPENDIX I 40
APPENDIX II 57
vi
LIST OF FIGURES
FIG.NO NAME OF THE FIGURES PAGE NO

1.1 Phases of pattern recognition 2
4.1 ASL Recognition System Architecture 9
4.2 Dataset used for training the model 12
4.3 Sample pictures of training data 12
4.4 Feature Extraction 18
6.1 Anaconda navigator 28
6.2 Anaconda environments 29
6.3 PIP package manager 29
6.4 Anaconda Jupyter Notebook insallation wizard 31
6.5 Jupyter Notebook running 31
7.1 Dataset Collection 1 36
7.2 Dataset Collection 2 36
7.3 Capturing Sign from web camera 37
7.4 Dataset Location path 1 37
7.5 Dataset Location path 2 38
7.6 Recognition of the hand gestures 38
vii
LIST OF ABBREVIATIONS
ACRONYMS ABBREVIATIONS
ASL - American Sign Language
AI - Artificial Intelligent
CNN - Convolutional Neural Network
CV - Computer Vision
PIP - Preferred Installer Program
DNN - Deep Neural Network
RNN - Recurrent Neural Network
FPS - Frames Per Second
RELU - Rectified Linear Unit
MNN - Matcher Neural Network
SRN - Simple Recurrent Network
TDNN - Time Delay Neural Network
viii
CHAPTER 1
INTRODUCTION
Speech impaired people use hand signs and gestures to communicate.

Normal people face difficulty in understanding their language. Hence there is a
need of a system which recognizes the different signs, gestures and conveys the
information to the normal people. It bridges the gap between physically
challenged people and normal people.
1.1 IMAGE PROCESSING

Image processing is a method to perform some operations on an image, in
order to get an enhanced image or to extract some useful information from it. It
is a type of signal processing in which input is an image and output may be
image or characteristics/features associated with that image. Nowadays, image
processing is among rapidly growing technologies. It forms core research area
within engineering and computer science disciplines too.
Image processing basically includes the following three steps:
 Importing the image via image acquisition tools.
 Analyzing and manipulating the image.
 Output in which result can be altered image or report that is based on
image analysis.
There are two types of methods used for image processing namely,
Analogue and Digital Image Processing. Analogue image processing can be
used for the hard copies like printouts and photographs. Image analysts use
various fundamentals of interpretation while using these visual techniques.
Digital image processing techniques help in manipulation of the digital images
by using computers. The three general phases that all types of data have to
undergo while using digital technique are pre-processing, enhancement, and
display, information extraction.
1
Digital image processing:
Digital image processing consists of the manipulation of images using
digital computers. Its use has been increasing exponentially in the last decades.
Its applications range from medicine to entertainment, passing by geological
processing and remote sensing. Multimedia systems, one of the pillars of the
modern information society, rely heavily on digital image processing.
Digital image processing consists of the manipulation of those finite
precision numbers. The processing of digital images can be divided into several
classes: image enhancement, image restoration, image analysis, and image
compression. In image enhancement, an image is manipulated, mostly by
heuristic techniques, so human viewer can extract useful information from it.
Digital image processing is to process images by computer. Digital image
processing can be defined as subjecting a numerical representation of an object
to a series of operations in order to obtain a desired result.
Pattern recognition
On the basis of image processing, it is necessary to separate objects from
images by pattern recognition technology, then to identify and classify these
objects through technologies provided by statistical decision theory. Under the
conditions that an image includes several objects, the pattern recognition
consists of three phases, as shown in Fig.
x1
x2
.....
xN
Feature
Segmentation Classification
Extraction
Input image Object image Feature vector Object type
Fig 1.1 Phases of pattern recognition
2
The first phase includes the image segmentation and object separation. In
this phase, different objects are detected and separate from other background.
The second phase is the feature extraction. In this phase, objects are measured.
The measuring feature is to quantitatively estimate some important features of
objects, and a group of the features are combined to make up a feature vector
during feature extraction. The third phase is classification. In this phase, the
output is just a decision to determine which category every object belongs to.
Therefore, for pattern recognition, what input are images and what output are
object types and structural analysis of images. The structural analysis is a
description of images in order to correctly understand and judge for the
important information of images.
1.2 SIGN LANGUAGE
It is a language that includes gestures made with the hands and other
body parts, including facial expressions and postures of the body.It used
primarily by people who are deaf and dumb. There are many different sign
languages as, British, Indian and American sign languages. British sign
language (BSL) is not easily intelligible to users of American sign Language
(ASL) and vice versa .
A functioning signing recognition system could provide a chance for the
inattentive communicate with non-signing people without the necessity for an
interpreter. It might be wont to generate speech or text making the deaf more
independent. Unfortunately there has not been any system with these
capabilities thus far. During this project our aim is to develop a system which
may classify signing accurately.
American Sign Language (ASL) is a complete, natural language that has
the same linguistic properties as spoken languages, with grammar that differs
from English. ASL is expressed by movements of the hands and face. It is the
primary language of many North Americans who are deaf and hard of hearing,
and is used by many hearing people as well.
3
1.3 SIGN LANGUAGE AND HAND GESTURE RECOGNITION
The process of converting the signs and gestures shown by the user into
text is called sign language recognition. It bridges the communication gap
between people who cannot speak and the general public. Image processing
algorithms along with neural networks is used to map the gesture to appropriate
text in the training data and hence raw images/videos are converted into
respective text that can be read and understood.
Dumb people are usually deprived of normal communication with other
people in the society. It has been observed that they find it really difficult at
times to interact with normal people with their gestures, as only a very few of
those are recognized by most people. Since people with hearing impairment or
deaf people cannot talk like normal people so they have to depend on some sort
of visual communication in most of the time.
As like any other language it has also got grammar and vocabulary but
uses visual modality for exchanging information. The problem arises when
dumb or deaf people try to express themselves to other people with the help of
these sign language grammars. This is because normal people are usually
unaware of these grammars. As a result it has been seen that communication of
a dumb person are only limited within his/her family or the deaf community.
The importance of sign language is emphasized by the growing public approval
and funds for international project.
At this age of Technology the demand for a computer based system is
highly demanding for the dumb community. However, researchers have been
attacking the problem for quite some time now and the results are showing
some promise. Interesting technologies are being developed for speech
recognition but no real commercial product for sign recognition is actually there
in the current market. The idea is to make computers to understand human
language and develop a user friendly human computer interfaces (HCI). Making
a computer understand speech, facial expressions and human gestures are some
steps towards it.
4
1.4 MOTIVATION
The 2011 Indian census cites roughly 1.3 million people with “hearing
impairment”. In contrast to that numbers from India‟s National Association of
the Deaf estimates that 18 million people –roughly 1 per cent of Indian
population are deaf. These statistics formed the motivation for our project. As
these speech impairment and deaf people need a proper channel to communicate
with normal people there is a need for a system. Not all normal people can
understand sign language of impaired people. Our project hence is aimed at
converting the sign language gestures into text that is readable for normal
people.
1.5 PROBLEM STATEMENT
Speech impaired people use hand signs and gestures to communicate.
Normal people face difficulty in understanding their language. Hence there is a
need of a system which recognizes the different signs, gestures and conveys the
information to the normal people. It bridges the gap between physically
challenged people and normal people.
1.6 ORGANISATION OF THESIS
The book is organized as follows:
Part 1: The various technologies that are studied are introduced and the
problem statement is stated along with the motivation to our project.
Part 2: The Literature survey is put forth which explains the various other
works and their technologies that are used for Sign Language Recognition.
Part 3: Explains the methodologies in detail, represents the architecture and
algorithms used.
Part 4: Represents the project in various designs.
Part 5: Provides the experimental analysis, the code involved and the results
obtained.
Part 6: Concludes the project and provides the scope to which the project can
be extended.
5
CHAPTER 2
LITERATURE SURVEY
Introduction
The various research works on the Sign Language recognition model
using neural networks are discussed and analyzed.
Sign Language Alphabet Recognition Using Convolution Neural Network
Sign Language Alphabet Recognition Using Convolution Neural
Network is developed. In a paper by Mayand Kumar, Piyush Gupta and Rahul
Kumar Jha [1], the authors pre-processed the images by applying various
operations like hand color filtering, morphological transformations, etc. Later
the CNN was used on the dataset. A Kalman estimator was used so that
whenever a gesture is identified the mouse pointer will move in response to it.
Lastly, the system was able to achieve 91.8 % of accuracy on 16 gestures.
Video-based Sign Language Recognition without Temporal Segmentation
In a paper by J.Huang, W.Zhou and Q.Zhang et al. [2], the authors
recognized problems in SLR such as problems in recognition when the signs are
broken down to individual words and the issues with continuous SLR. They
decided to solve the problem without isolating individual signs, which removes
an extra level of preprocessing (temporal segmentation) and another extra layer
of post-processing because they believed that temporal segmentation is crucial
to SLR and without its errors propagate into subsequent steps. Combined with
the strenuous labelling of individual words adds a huge challenge to SLR
without temporal segmentation. They addressed this issue with a new
framework called (LS-HAN), which eliminates the preprocessing of temporal
segmentation. The framework consists of a two-stream CNN for video feature
representation generation, a Latent Space for semantic gap bridging and a
Hierarchical Attention Network for space-based recognition.
6
American Sign Language Recognition Using Leap Motion Controller with
Machine Learning Approach
Other approaches to SLR include using an external device such as a
Leap Motion controller to recognize movement and gestures such as the work
done by Chong et al. [3]. The study differs from other work because it includes
the complete grammar of the American Sign Language which consists of 26
letters and 10 digits. The work is aimed dynamic movements and extracting
features to study and classify them. The experimental results have been
promising with accuracies of 80.30% for Support Vector Machines (SVM) and
93.81% for Deep Neural Networks (DNN).
Dynamic hand gesture recognition using RGB-D data for natural human-
computer interaction
Research in the fields of hand gesture recognition also aid to SLR
research such as the work by Linqin et al. [4] . In it, the authors have used RGB-
D data to recognize human gestures for human-computer interaction. They
approach the problem by calculating Euclidean distance between hand joints
and shoulder features to generate a unifying feature descriptor. A dynamic time
warping (IDTW) algorithm is proposed to obtain final recognition results,
which works by applying weighted distance and restricted search path to avoid
major computation costs unlike conventional approaches. The experimental
results of this method show an average accuracy of 96.5% and better. The idea
is to develop real time gesture recognition which could also be extended to
SLR.
7
CHAPTER 3
EXISTING SYSTEM
Sign Language Alphabet Recognition Using Convolution Neural

Network is developed. In a paper by Mayand Kumar, Piyush Gupta and Rahul
Kumar Jha [1], the authors pre-processed the images by applying various
operations like hand color filtering, morphological transformations, etc. Later
the CNN was used on the dataset. A Kalman estimator was used so that
whenever a gesture is identified the mouse pointer will move in response to it.
Lastly, the system was able to achieve 91.8 % of accuracy on 16 gestures.
The first approach in relation to sign language recognition was by Bergh

in 2011 [7]. Haar wavelets and database searching were employed to build a
hand gesture recognition system. Although this system gives good results, it
only considers six classes of gestures. Many types of researches have been
carried out on different sign languages from different countries. For example, a
BSL recognition model, which understands finger-spelled signs from a video,
was built [8]. As Initial, a histogram of gradients (HOG) was used to recognize
letters, and then, the system used hidden Markov models (HMM) to recognize
words. In another paper, a system was built to recognize sentences made of 3-5
words. Each word ought to be one of 19 signs in their thesaurus.
Problems Identified in Existing System
 They are costly and are difficult to be used commercially.

 Classification methods are also varying from researchers.
 Researchers tend to develop their own concept, based on known
methods, to give better result in recognizing the sign language.
8
CHAPTER 4
PROPOSED SYSYTEM
4.1 INTRODUCTION
Our proposed system is sign language recognition system using

convolution neural networks which recognizes various hand gestures by
capturing video and converting it into frames. Then the hand pixels are
segmented and the image it obtained and sent for comparison to the trained
model. Thus our system is more robust in getting exact text labels of letters.
Image Hand
Segmentation
Acquisition detecting
And tracking
Signer Camera
Performs
SL
Feature
Training Preprocessing
Extraction
Recognition
Training
Database
Example Output
(text)
Fig 4.1 ASL Recognition System Architecture
9
4.2 TRAINING MODULE:
Supervised machine learning: It is one of the ways of machine learning
where the model is trained by input data and expected output data. Тo create
such model, it is necessary to go through the following phases:
 model construction
 model training
 model testing
 model evaluation
Model construction: It depends on machine learning algorithms. In this
projects case, it was neural networks. Such an algorithm looks like:
 begin with its object: model = Sequential()
 then consist of layers with their types: model.add(type_of_layer())
 after adding a sufficient number of layers the model is compiled. At this
moment Keras communicates with TensorFlow for construction of the model.
During model compilation it is important to write a loss function and an
optimizer algorithm. It looks like:
model.comile(loss=„name_of_loss_function‟,
optimizer=„name_of_opimazer_alg‟ )
The loss function shows the accuracy of each prediction made by the model.
Before model training it is important to scale data for their further use.
Model training:
After model construction it is time for model training. In this phase, the
model is trained using training data and expected output for this data. It‟s look
this way: model.fit(training_data, expected_output). Progress is visible on the
console when the script runs. At the end it will the final accuracy of the model.
Model Testing:
During this phase a second set of data is loaded. This data set has never
been seen by the model and therefore it‟s true accuracy will be verified. After
the model training is complete, and it is understood that the model shows the
10
right result, it can be saved by: model.save(“name_of_file.h5”). Finally, the
saved model can be used in the real world. The name of this phase is model
evaluation. This means that the model can be used to evaluate new data.
4.2.1 Preprocessing:
Uniform aspect ratio
Understanding aspect ratios: An aspect ratio is a proportional relationship
between an image's width and height. Essentially, it describes an image's shape.
Aspect ratios are written as a formula of width to height, like this: For example,
a square image has an aspect ratio of 1:1, since the height and width are the
same. The image could be 500px × 500px, or 1500px × 1500px, and the aspect
ratio would still be 1:1.As another example, a portrait-style image might have a
ratio of 2:3. With this aspect ratio, the height is 1.5 times longer than the width.
So the image could be 500px × 750px, 1500px × 2250px, etc.
Cropping to an aspect ratio
Aside from using built in site style options , you may want to manually
crop an image to a certain aspect ratio. For example, if you use product images
that have same aspect ratio, they'll all crop the same way on your site.
Option 1 - Crop to a pre-set shape
Use the built-in Image Editor to crop images to a specific shape. After opening
the editor, use the crop tool to choose from preset aspect ratios.
Option 2 - Custom dimensions
To crop images to a custom aspect ratio not offered by our built-in Image
Editor, use a third-party editor. Since images don‟t need to have the
samedimensions to have the same aspect ratio, it‟s better to crop them to a
specific ratio than to try to matchtheir exact dimensions.
 For instance, if your image is 1500px × 1200px, and you want an aspect
ratio of 3:1, crop the shorter side to make the image 1500px × 500px.
 Don't scale up the longer side; this can make your image blurry.
11
Image scaling:
 In computer graphics and digital imaging , image scaling refers to the
resizing of a digital image. In video technology, the magnification of
digital material is known as up scaling or resolution enhancement .
 When scaling a raster graphics image, a new image with a higher or lower
number of pixels must be generated. In the case of decreasing the pixel
number (scaling down) this usually results in a visible quality loss.
 In computer graphics and digital imaging, image scaling refers to the
resizing of a digital image.
DATASETS USED FOR TRAINING
Fig 4.2 Dataset used for training the model

TRAINING DATA
After model construction it is time for model training. In this phase, the
model is trained using training data and expected output for this data. Progress
is visible on the console when the script runs.
Fig 4.3 Sample Pictures of Training Data
12
4.3 ALGORITHM: CONVOLUTION NEURAL NETWORK
Image classification is the process of taking an input(like a picture) and
outputting its class or probability that the input is a particular class. Neural
networks are applied in the following steps:
 One hot encode the data: A one-hot encoding can be applied to the
integer representation. This is where the integer encoded variable is
removed and a new binary variable is added for each unique integer
value.
 Define the model: A model said in a very simplified form is nothing but
a function that is used to take in certain input, perform certain operation
to its best on the given input (learning and then predicting/classifying)
and produce the suitable output.
 Compile the model: The optimizer controls the learning rate. We will be
using „adam‟ as our optimizer. Adam is generally a good optimizer to use
for many cases. The adam optimizer adjusts the learning rate throughout
training.
 Learning rate: The learning rate determines how fast the optimal
weights for the model are calculated. A smaller learning rate may lead to
more accurate weights (up to a certain point), but the time it takes to
compute the weights will be longer.
 Train the model: Training a model simply means learning (determining)
good values for all the weights and the bias from labeled examples. In
supervised learning, a machine learning algorithm builds a model by
examining many examples and attempting to find a model that minimizes
loss; this process is called empirical risk minimization.
 Test the model: A convolutional neural network convolves learned
featured with input data and uses 2D convolution layers.
13
Convolution Operation:
In purely mathematical terms, convolution is a function derived from two
given functions by integration which expresses how the shape of one is
modified by the other.
Convolution formula:
Here are the three elements that enter into the convolution operation:
 Input image
 Feature detector
 Feature map
 You place it over the input image beginning from the top-left corner
within the borders you see demarcated above, and then you count the
number of cells in which the feature detector matches the input image.
 The number of matching cells is then inserted in the top-left cell of the
feature map
 You then move the feature detector one cell to the right and do the same
thing. This movement is called a and since we are moving the feature
detector one cell at time, that would be called a stride of one pixel.
 What you will find in this example is that the feature detector's middle-
left cell with the number 1 inside it matches the cell that it is standing
over inside the input image.
 That's the only matching cell, and so you write “1” in the next cell in the
feature map, and so on and so forth.
 After you have gone through the whole first row, you can then move it
over to the next row and go through the same process.
 There are several uses that we gain from deriving a feature map.
14
Relu Layer:
Rectified linear unit is used to scale the parameters to non negative
values. We get pixel values as negative values too. In this layer we make them
as 0‟s. The purpose of applying the rectifier function is to increase the non-
linearity in our images.
The reason we want to do that is that images are naturally non-linear. The
rectifier serves to break up the linearity even further in order to make up for the
linearity that we might impose an image when we put it through the convolution
operation. What the rectifier function does to an image like this is remove all
the black elements from it, keeping only those carrying a positive value (the
grey and white colors).
The essential difference between the non-rectified version of the image
and the rectified one is the progression of colors. After we rectify the image,
you will find the colors changing more abruptly. The gradual change is no
longer there. That indicates that the linearity has been disposed of.
Pooling Layer:
The pooling (POOL) layer reduces the height and width of the input. It
helps reduce computation, as well as helps make feature detectors more
invariant to its position in the input This process is what provides the
convolutional neural network with the “spatial variance” capability. In addition
to that, pooling serves to minimize the size of the images as well as the number
of parameters which, in turn, prevents an issue of “over fitting” from coming
up. Over fitting in a nutshell is when you create an excessively complex model
in order to account for the idiosyncrasies we just mentioned.
The result of using a pooling layer and creating down sampled or pooled
feature maps is a summarized version of the features detected in the input. They
are useful as small changes in the location of the feature in the input detected by
the convolutional layer will result in a pooled feature map with the feature in the
same location.
15
Fully Connected Layer:
The role of the artificial neural network is to take this data and combine
the features into a wider variety of attributes that make the convolutional
network more capable of classifying images, which is the whole purpose from
creating a convolutional neural network. It has neurons linked to each other ,and
activates if it identifies patterns and sends signals to output layer .the output
layer gives output class based on weight values.
Which we then use in optimizing our network in order to increase its
effectiveness. That requires certain things to be altered in our network. These
include the weights and the feature detector since the network often turns out to
be looking for the wrong features and has to be reviewed multiple times for the
sake of optimization, which is the whole purpose from creating a convolutional
neural network.
This full connection process practically works as follows:

 The neuron in the fully-connected layer detects a certain feature; say, a nose.
 It preserves its value.
 It communicates this value to the classes trained images.
4.4 SEGMENTATION
Image segmentation is the process of partitioning a digital image into
multiple segments(sets of pixels, also known as image objects). The goal of
segmentation is to simplify and/or change the representation of an image into
something that is more meaningful and easier to analyze. Modern image
segmentation techniques are powered by deep learning technology. Here are
several deep learning architectures used for segmentation. Let know more about
the segmentation progress in the following explanations given below.
16
Importance of Image Segmentation
If we take an example of Autonomous Vehicles, they need sensory input
devices like cameras, radar, and lasers to allow the car to perceive the world
around it, creating a digital map. Autonomous driving is not even possible
without object detection which itself involves image
classification/segmentation.
Image Segmentation working principle :
Image Segmentation involves converting an image into a collection of
regions of pixels that are represented by a mask or a labeled image. By dividing
an image into segments, you can process only the important segments of the
image instead of processing the entire image. A common technique is to look
for abrupt discontinuities in pixel values, which typically indicate edges that
define a region. Some techniques that follow this approach are region growing,
clustering, and thresholding. A variety of other approaches to perform image
segmentation have been developed over the years using domain-specific
knowledge to effectively solve segmentation problems in specific application
areas.
4.5 Feature Extraction
Predefined features such as form, contour, geometrical feature (position,
angle, distance, etc. ), colour feature, histogram, and others are extracted from
the preprocessed images and used later for sign classification or recognition.
Feature extraction is a step in the dimensionality reduction process that divides
and organises a large collection of raw data.reduced to smaller, easier-to-
manage classes As a result, processing would be simpler. The fact that these
massive data sets have a large number of variables is the most important feature.
To process these variables, a large amount of computational power is needed.
As a result, function extraction aids in the extraction of the best feature from
large data sets by selecting and combining variables into functions.reducing the
size of the data
17
These features are simple to use while still accurately and uniquely
describing the actual data collection.
Fig 4.4 Feature Extraction (a) original image with hand; (b) image of hand after
skin color detection; (c) after morphological operations and binarization; (d)
image of hand after background extraction ; (e) after binarization and
morphological operations; (f) hand’s contour made of image c and image e
concatenation
For feature extraction pre-trained model are used by adding fully-

connected layers on top of it. The model are trained with the original dataset
after loading the saved weights.
18
CHAPTER 5
MODULES DESCRIPTION
5.1 Image Acquisition:

It is the action of extracting an image from a source, typically a hardware-
based source, for process of image processing. Web Camera is the hardware-
based source in our project. It is the first step in the workflow sequence because
no processing can be done without an image. The picture that is obtained has
not been processed in any way.
5.2 Segmentation:
The method of separating objects or signs from the context of a captured
image is known as segmentation. Text subtracting, skin-color detection, and
edge detection are all used in the segmentation process. The motion and
location of the hand must be detected and segmented in order to recognize
gestures.
5.3 Features Extraction:
Predefined features such as form, contour, geometrical feature (position,
angle, distance, etc. ), color feature, histogram, and others are extracted from the
preprocessed images and used later for sign classification or recognition.
Feature extraction is a step in the dimensionality reduction process that divides
and organizes a large collection of raw data reduced to smaller, easier-to-
manage classes As a result, processing would be simpler.
The fact that these massive data sets have a large number of variables is
the most important feature. To process these variables, a large amount of
computational power is needed.
As a result, function extraction aids in the extraction of the best feature
from large data sets by selecting and combining variables into functions
reducing the size of the data.
19
5.4 Preprocessing:
Each picture frame is preprocessed to eliminate noise using a variety of
filters including erosion, dilation, and Gaussian smoothing, among others. The
size of an image is reduced when a color image is transformed to gray scale. A
common method for reducing the amount of data to be processed is to convert
an image to grey scale. The phases of preprocessing are as follows:
5.4.1 Morphological Transform:
Morphological operations use a structuring feature on an input image to
create a similar-sized output image. It compares the corresponding pixel in the
input image with its neighbors to determine the value of each pixel in the output
image. There are two different kinds of morphological transformations Erosion
and Dilation. ITM Web of Conferences 40, 03004 (2021) ICACC-2021
https://doi.org/10.1051/itmconf/20214003004 Fig 4.5. (a) original image with
hand; (b) image of hand after skin color detection; (c) after morphological
operations and binarization; (d) image of hand after background extraction ; (e)
after binarization and morphological operations; (f) hand‟s contour made of
image c and image e concatenation i. Dilation:
The maximum value of all pixels in the neighborhood is the value of the
output pixel. A pixel in a binary image is set to 1 if all of its neighbours have
the value 1 Morphological dilation increases the visibility of artefacts and fills
in small gaps. ii. Erosion: The o/p pixel‟s value is the minimum of all pixels in
the neighbourhood. A pixel in a binary image is set to 0 if all of its neighbours
have the value 0.small artefacts are eroded away by morphological erosion,
leaving behind substantial objects.
5.4.2 Blurring:
Adding a low-pass filter to an image is an example of blurring. The word
"low-pass filter" refers to eliminating noise from an image while leaving the rest
of the image intact in computer vision. A blur is a simple operation that must be
completed before other tasks such as edge detection.
20
5.4.3 Thresholding:
Thresholding is a form of image segmentation in which the pixels of an
image are changed to make it easier to interpret the image. Thresholding is the
process of converting a color or gray scale image into a binary image, which is
simply black and white. We most commonly use thresholding to pick areas of
interest in a picture while ignoring the sections we are not concerned with.
5.4.4 Recognition:
We‟ll use classifiers in this case. Classifiers are the methods or
algorithms that are used to interpret the signals. Popular classifiers that identify
or understand sign language include the Hidden Markov Model (HMM),
KNearest Neighbor classifiers, Support Vector Machine (SVM), Artificial
Neural Network (ANN), and Principle Component Analysis (PCA), among
others. However, in this project, the classifier will be CNN. Because of its high
precision, CNNs are used for image classification and recognition. The CNN
uses a hierarchical model that builds a network, similar to a funnel, and then
outputs a fully-connected layer in which all neurons are connected to each other
and the output is processed.
5.4.5 Text output:
Understanding human behavior and identifying various postures and body
movements, as well as translating them into text.
21
CHAPTER 6
SYSTEM SPECIFICATION
6.1: System Requirements:
6.1.1: Software Requirements:
Operating System: Windows 7 or above
SDK: Anaconda, OpenCV, TensorFlow, Keros, Numpy,spyder.
6.1.2: Hardware Requirements:
The Hardware Interfaces Required are:
Web Camera: Good quality,3MP
Ram: Minimum 4GB or higher
GPU: 4GB dedicated
Processor: Intel Core i3 or higher
HDD: 100GB or higher
Monitor: 15” or 17” color monitor
Mouse: Scroll or Optical Mouse or Touch Pad
Keyboard: Standard 110 keys keyboard
22
6.2 PYTHON:
Python Features:
 Easy-to-learn: Python has few keywords, simple structure, and a clearly
 defined syntax. This allows the student to pick up the language quickly.
 Easy-to-read: Python code is more clearly defined and visible to the
eyes.
 Easy-to-maintain: Python's source code is fairly easy-to-maintain.
 A broad standard library: Python's bulk of the library is very portable
and cross platform compatible on UNIX, Windows, and Macintosh.
Python has a large standard library that provides a rich set of modules
and functions so you do not have to write your own code for every single
thing. 
 Interactive Mode: Python has support for an interactive mode which
allows interactive testing and debugging of snippets of code.
 Portable: Python can run on a wide variety of hardware platforms and
has the same interface on all platforms.
 Extendable: You can add low-level modules to the Python interpreter.
These modules enable programmers to add to or customize their tools to
be more efficient.
 Databases: Python provides interfaces to all major commercial databases
 GUI Programming: Python supports GUI applications that can be
created and portedto many system calls, libraries, and windows systems,
such as Windows MFC, Macintosh, and the X Window system of Unix.
 Scalable: Python provides a better structure and support for large
programs than shell scripting.
 Python is Integrated Language: Python is also an Integrated language
because we can easily integrated python with other languages like c, c++,
etc.
23
Characteristics of python:
 It supports functional and structured programming methods as well as
OOP.
 It can be used as a scripting language or can be compiled to byte-code for
building large applications.
 It provides very high-level dynamic data types and supports dynamic type
checking.
 It supports automatic garbage collection.
 It can be easily integrated with C, C++, COM, ActiveX, CORBA, and
Java.
Installing Python:
Python distribution is available for a wide variety of platforms. You need
to download only the binary code applicable for your platform and install
Python.
Python ─ Environment If the binary code for your platform is not
available, you need a C compiler to compile the source code manually.
Compiling the source code offers more flexibility in terms of choice of features
that you require in your installation. Here is a quick overview of installing
Python on windows platform:
Windows Installation:
Here are the steps to install Python on Windows machine. Open a Web
browser and go to http://www.python.org/download/
Follow the link for the Windows installer python-XYZ.msi file where XYZ is
the version.
 You need to install. To use this installer python-XYZ.msi, the
Windows system must support Microsoft
 Installer 2.0. Save the installer file to your local machine and then run
it to find out if your machine supports MSI.
24
 Run the downloaded file. This brings up the Python install wizard,
which is really easyto use.
Setting path at Windows:
To add the Python directory to the path for a particular session in Windows:
At the command prompt: type path %path%;C:\Python and press Enter.
Note: C:\Python is the path of the Python directory.
Application of Python:
 Easy-to-learn − Python has few keywords, simple structure, and a
clearly defined syntax. This allows the student to pick up the language
quickly.
 Easy-to-read − Python code is more clearly defined and visible to the
eyes.
 Easy-to-maintain − Python's source code is fairly easy-to-
maintain.
 A broad standard library − Python's bulk of the library is very
portable and cross-platform compatible on UNIX, Windows, and
Macintosh.
 Interactive Mode − Python has support for an interactive mode which
allows interactive testing and debugging of snippets of code.
 Portable − Python can run on a wide variety of hardware platforms and
has the same interface on all platforms.
 Extendable − You can add low-level modules to the Python interpreter.
These modules enable programmers to add to or customize their tools to
be more efficient.
 Databases − Python provides interfaces to all major commercial
databases.
 Frontend and Backend development: With a new project pyscript you
can run and write python codes in HTML with the help of some tags.
25
 GUI Programming − Python supports GUI applications that can be
created and ported to many system calls, libraries and windows systems,
such as Windows MFC, Macintosh, and the X Window system of Unix.
 Scalable − Python provides a better structure and support for large
programs than shell scripting.
6.3 ANACONDA
Anaconda is a distribution of the Python and R programming
languages for scientific computing (data science, machine learning applications,
large-scale data processing, predictive analytics, etc.), that aims to simplify
package management and deployment. The distribution includes data-science
packages suitable for Windows, Linux, and mac OS. It is developed and
maintained by Anaconda, Inc., which was founded by Peter Wang and Travis
Oliphant in 2012. As an Anaconda, Inc. product, it is also known as Anaconda
Distribution or Anaconda Individual Edition.
6.4 ANACONDA-PACKAGES:
Package versions in Anaconda are managed by the package management
system conda. This package manager was spun out as a separate open-source
package as it ended up being useful on its own and for other things than Python.
There is also a small, bootstrap version of Anaconda called Miniconda, which
includes only conda, Python, the packages they depend on, and a small number
of other packages.
Anaconda distribution comes with over 250 packages automatically
installed, and over 7,500 additional open-source packages can be installed from
PyPI as well as the conda package and virtual environment manager. It also
includes a GUI, Anaconda Navigator, as a graphical alternative to the
command line interface (CLI).
The big difference between conda and the pip package manager is in how
package dependencies are managed, which is a significant challenge for Python
data science and the reason conda exists.
26
When pip installs a package, it automatically installs any dependent
Python packages without checking if these conflict with previously installed
packages. It will install a package and any of its dependencies regardless of the
state of the existing installation. Because of this, a user with a working
installation of, for example, Google Tensorflow, can find that it stops working
having used pip to install a different package that requires a different version of
the dependent numpy library than the one used by Tensorflow. In some cases,
the package may appear to work but produce different results in detail.
In contrast, conda analyses the current environment including everything
currently installed, and, together with any version limitations specified (e.g. the
user may wish to have Tensorflow version 2,0 or higher), works out how to
install a compatible set of dependencies, and shows a warning if this cannot be
done.
Open source packages can be individually installed from the Anaconda
repository. Anaconda Cloud (anaconda.org), or the user's own private repository
or mirror, using the conda install command. Anaconda, Inc. compiles and
builds the packages available in the Anaconda repository itself, and provides
binaries for Windows 32/64 bit, Linux 64 bit and MacOS 64-bit. Anything
available on PyPI may be installed into a conda environment using pip, and
conda will keep track of what it has installed itself and what pip has installed.
Custom packages can be made using the conda build command, and can
be shared with others by uploading them to Anaconda Cloud, PyPI or other
repositories.
The default installation of Anaconda2 includes Python 2.7 and
Anaconda3 includes Python 3.7. However, it is possible to create new
environments that include any version of Python packaged with conda.
Anaconda Navigator
Anaconda Navigator is a desktop graphical user interface (GUI) included
in Anaconda distribution that allows users to launch applications and manage
27
conda packages, environments and channels without using command-line
commands. Navigator can search for packages on Anaconda Cloud or in a local
Anaconda Repository, install them in an environment, run the packages and
update them. It is available for Windows, macOS and Linux. The following
applications are available by default in Navigator:
 JupyterLab
 Jupyter Notebook
 QtConsole
 Spyder
 Glue
 Orange
 RStudio
 Visual Studio Code












Fig 6.1 Anaconda navigator


28
CONDA:
Conda is an open source, cross-platform, language-agnostic package
manager and environment management system that installs, runs, and updates
packages and their dependencies. It was created for Python programs, but it can
package and distribute software for any language (e.g., R), including multi-
language projects.
Fig 6.2 Anaconda environments
Fig 6.3 PIP package manager
29
PIP (package manager):
One major advantage of pip is the ease of its command-line interface,
which makes installing Python software packages as easy as issuing a
command:
pip install some-package-name

Users can also easily remove the package:
pip uninstall some-package-name

Most importantly, pip has a feature to manage full lists of packages and
corresponding version numbers, possible through a "requirements" file. This
permits the efficient re-creation of an entire group of packages in a separate
environment (e.g. another computer) or virtual environment. This can be
achieved with a properly formatted file and the following command, where
requirements.txt is the name of the file:
pip install -r requirements.txt

To install some package for a specific python version, pip provides the
following command, where ${version} is replaced by 2, 3, 3.4, etc.:
pip${version} install some-package-name
Anaconda Jupyter Notebook:
Installation Instructions
 Download the Anaconda distribution for your OS
 Download the individual version. Run the installer.
 Once downloaded and installed, launch the Anaconda Navigator
graphical interface. You should see Jupyter Notebook as one of the pre-
installed applications.
30
Fig 6.4 Anaconda Jupyter Notebook insallation wizard
Fig 6.5 Jupyter Notebook running
31
6.4.1 SPYDER (SOFTWARE):
Spyder is an open-source cross-platform integrated development
environment (IDE) for scientific programming in the Python language.
Spyder integrates with a number of prominent packages in the scientific
Python stack, including NumPy, SciPy, Matplotlib, pandas, IPython,
SymPy and Cython, as well as other open-source software. It is released
under the MIT license.
Initially created and developed by Pierre Raybaut in 2009, since
2012Spyder has been maintained and continuously improved by a team of
scientific Python developers and the community.
Spyder is extensible with first-party and third-party plugins, includes
support for interactive tools for data inspection and embeds Python-specific
code quality assurance and introspection instruments, such as Pyflakes, Pylint
and Rope.
It is available cross-platform through Anaconda, on Windows, on
macOS through MacPorts, and on major Linux distributions such as Arch
Linux, Debian, Fedora, Gentoo Linux, openSUSE and Ubuntu. Spyder uses Qt
for its GUI and is designed to use either of the PyQt or PySide Python
bindings. QtPy, a thin abstraction layer developed by the Spyder project and
later adopted by multiple other packages, provides the flexibility to use either
backend.
FEATURE:
 An editor with syntax highlighting, introspection, code completion
 Support for multiple IPython consoles
 The ability to explore and edit variables from a GUI
 A Help pane able to retrieve and render rich text documentation on
functions, classes and methods automatically or on-demand
 A debugger linked to IPdb, for step-by-step execution
 Static code analysis, powered by Pylint
32
 A run-time Profiler, to benchmark code
 Project support, allowing work on multiple development efforts
simultaneously
 A built-in file explorer, for interacting with the filesystem and managing
projects
 A "Find in Files" feature, allowing full regular expression search over a
specified scope
 An online help browser, allowing users to search and view Python and
package documentation inside the IDE
 A history log, recording every user command entered in each console
 An internal console, allowing for introspection and control over Spyder's
own operation
6.4.2 TENSORFLOW
TensorFlow is a Python-friendly open source library for numerical
computation that makes machine learning faster and easier.
TensorFlow is a free and open-source software library for machine
learning. It can be used across a range of tasks but has a particular focus on
training and inference of deep neural networks.
6.4.3 OPENCV:
OpenCV is a cross-platform library using which we can develop real-
time computer vision applications. It mainly focuses on image processing,
video capture and analysis including features like face detection and object
detection. In this tutorial, we explain how you can use OpenCV in your
applications.
OpenCV is the huge open-source library for the computer vision,
machine learning, and image processing and now it plays a major role in real-
time operation which is very important in today‟s systems. By using it, one can
process images and videos to identify faces even handwriting of a human.When
it integrated with various libraries, such as Numpuy, python is capable of
33
processing the OpenCV array structure for analysis. To Identify image pattern
and its various features we use vector space and perform mathematical
operations on these features.
6.5 WEB CAMERA (Hardware) :
A webcam is a video camera that feeds or streams an image or video in
real time to or through a computer to a computer network, such as the Internet.
Webcams are typically small cameras that sit on a desk, attach to a user's
monitor, or are built into the hardware. Webcams can be used during a video
chat session involving two or more people, with conversations that include live
audio and video.
By using the live video stream from the webcam. OpenCV provides a
video capture object which handles everything related to opening and closing of
the webcam. All we need to do is create that object and keep reading frames
from it.
The following code will open the webcam, capture the frames, scale them
down by a factor of 2, and then display them in a window. You can press the
Esc key to exit.
Under the hood
As we can see in the preceding code, we use OpenCV's VideoCapture

function to create the video capture object cap.Once it's created, we start an
infinite loop and keep reading frames from the webcam until we encounter a
keyboard interrupt. In the first line within the while loop, we have the following
line:
ret, frame = cap.read()
Here, ret is a Boolean value returned by the read function, and it indicates
whether or not the frame was captured successfully. If the frame is captured
34
correctly, it's stored in the variable frame. This loop will keep running until we
press the Esc key. So we keep checking for a keyboard interrupt in the
following line:
if c == 27:
As we know, the ASCII value of Esc is 27. Once we encounter it, we

break the loop and release the video capture object. The line cap.release() is
important because it gracefully closes the webcam.
Capture Video from Camera
Often, we have to capture live stream with camera. OpenCV provides a

very simple interface to this. Let‟s capture a video from the camera (I am using
the in-built webcam of my laptop), convert it into grayscale video and display it.
Just a simple task to get started.
To capture a video, you need to create a VideoCapture object. Its

argument can be either the device index or the name of a video file. Device
index is just the number to specify which camera. Normally one camera will be
connected (as in my case). So I simply pass 0 (or -1). You can select the second
camera by passing 1 and so on. After that, you can capture frame-by-frame. But
at the end, don‟t forget to release the capture.
35
CHAPTER 7
IMPLEMENTATION AND RESULTS

DATASET COLLECTION 1
Fig 7.1 Dataset collection 1
DATASET COLLECTION 2
Fig 7.2 Dataset collection 2
36
CAPTURIING SIGN FROM WEB CAMERA
Fig 7.3 Capturing Sign from web camera
DATASETS LOCATION
Fig 7.4 Dataset location path 1
37
Fig 7.5 Dataset location path 2
RECOGNITION OF THE HAND GESTURES
Fig 7.6 Recognition of the hand gestures
38
CHAPTER 8
CONCLUSION AND FUTURE SCOPE
Nowadays, applications need several kinds of images as sources of

information for elucidation and analysis. Several features are to be extracted so
as to perform various applications. When an image is transformed from one
form to another such as digitizing, scanning, and communicating, storing, etc.
degradation occurs. Therefore, the output image has to undertake a process
called image enhancement, which contains of a group of methods that seek to
develop the visual presence of an image. Image enhancement is fundamentally
enlightening the interpretability or awareness of information in images for
human listeners and providing better input for other automatic image processing
systems. Image then undergoes feature extraction using various methods to
make the image more readable by the computer. Sign language recognition
system is a powerful tool to prepare an expert knowledge, edge
Detect and the combination of inaccurate information from different sources.
The intend of convolution neural network is to get the appropriate classification.
Future work
The proposed sign language recognition system used to recognize sign
language letters can be further extended to recognize gestures facial
expressions. Instead of displaying letter labels it will be more appropriate to
display sentences as more appropriate translation of language. This also
increases readability. The scope of different sign languages can be increased.
More training data can be added to detect the letter with more accuracy. This
project can further be extended to convert the signs to speech.
39
APPENDIX-I
CODING
TrainingDataCollection.py:
# Importing the Libraries Required
import cv2
import numpy as np
import os
# Creating and Collecting Training Data
mode = 'trainingData'
directory = 'dataSet/' + mode + '/'
minValue = 70
capture = cv2.VideoCapture(0)
interrupt = -1
while True:
_, frame = capture.read()
# Simulating mirror Image
frame = cv2.flip(frame, 1)
# Getting count of existing images
count = {
'zero': len(os.listdir(directory+"/0")),
'a': len(os.listdir(directory+"/A")),
'b': len(os.listdir(directory+"/B")),
'c': len(os.listdir(directory+"/C")),
'd': len(os.listdir(directory+"/D")),
'e': len(os.listdir(directory+"/E")),
'f': len(os.listdir(directory+"/F")),
'g': len(os.listdir(directory+"/G")),
40
'h': len(os.listdir(directory+"/H")),
'i': len(os.listdir(directory+"/I")),
'j': len(os.listdir(directory+"/J")),
'k': len(os.listdir(directory+"/K")),
'l': len(os.listdir(directory+"/L")),
'm': len(os.listdir(directory+"/M")),
'n': len(os.listdir(directory+"/N")),
'o': len(os.listdir(directory+"/O")),
'p': len(os.listdir(directory+"/P")),
'q': len(os.listdir(directory+"/Q")),
'r': len(os.listdir(directory+"/R")),
's': len(os.listdir(directory+"/S")),
't': len(os.listdir(directory+"/T")),
'u': len(os.listdir(directory+"/U")),
'v': len(os.listdir(directory+"/V")),
'w': len(os.listdir(directory+"/W")),
'x': len(os.listdir(directory+"/X")),
'y': len(os.listdir(directory+"/Y")),
'z': len(os.listdir(directory+"/Z")),
}
# Printing the count of each set on the screen
cv2.putText(frame, "ZERO : " +str(count['zero']), (10, 60),
cv2.FONT_HERSHEY_PLAIN, 1, (0,255,255), 1)
cv2.putText(frame, "a : " +str(count['a']), (10, 70),
cv2.putText(frame, "b : " +str(count['b']), (10, 80),
cv2.putText(frame, "c : " +str(count['c']), (10, 90),
41
cv2.putText(frame, "d : " +str(count['d']), (10, 100),
cv2.putText(frame, "e : " +str(count['e']), (10, 110),
cv2.putText(frame, "f : " +str(count['f']), (10, 120),
cv2.putText(frame, "g : " +str(count['g']), (10, 130),
cv2.putText(frame, "h : " +str(count['h']), (10, 140),
cv2.putText(frame, "i : " +str(count['i']), (10, 150),
cv2.putText(frame, "k : " +str(count['k']), (10, 160),
cv2.putText(frame, "l : " +str(count['l']), (10, 170),
cv2.putText(frame, "m : " +str(count['m']), (10, 180),
cv2.putText(frame, "n : " +str(count['n']), (10, 190),
cv2.putText(frame, "o : " +str(count['o']), (10, 200),
cv2.putText(frame, "p : " +str(count['p']), (10, 210),
cv2.putText(frame, "q : " +str(count['q']), (10, 220),
cv2.putText(frame, "r : " +str(count['r']), (10, 230),
cv2.putText(frame, "s : " +str(count['s']), (10, 240),
42
cv2.putText(frame, "t : " +str(count['t']), (10, 250),
cv2.putText(frame, "u : " +str(count['u']), (10, 260),
cv2.putText(frame, "v : " +str(count['v']), (10, 270),
cv2.putText(frame, "w : " +str(count['w']), (10, 280),
cv2.putText(frame, "x : " +str(count['x']), (10, 290),
cv2.putText(frame, "y : " +str(count['y']), (10, 300),
cv2.putText(frame, "z : " +str(count['z']), (10, 310),
# Coordinates of the ROI
x1 = int(0.5*frame.shape[1])
y1 = 10
x2 = frame.shape[1]-10
y2 = int(0.5*frame.shape[1])
# Drawing the ROI
# The increment/decrement by 1 is to compensate for the bounding box
cv2.rectangle(frame, (x1-1, y1-1), (x2+1, y2+1), (255,0,0) ,1)
# Extracting the ROI
roi = frame[y1:y2, x1:x2]
cv2.imshow("Frame", frame)
# Image Processing
gray = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (5,5), 2)
43
th3 = cv2.adaptiveThreshold(blur, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV,
11, 2)
ret, test_image = cv2.threshold(th3, minValue, 255,
cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)
# Output Image after the Image Processing that is used for data collection
test_image = cv2.resize(test_image, (300,300))
cv2.imshow("test", test_image)
# Data Collection
interrupt = cv2.waitKey(10)
if interrupt & 0xFF == 27:
# esc key
break
if interrupt & 0xFF == ord('0'):
cv2.imwrite(directory+'0/'+str(count['zero'])+'.jpg', roi)
if interrupt & 0xFF == ord('a'):
cv2.imwrite(directory+'A/'+str(count['a'])+'.jpg', roi)
if interrupt & 0xFF == ord('b'):
cv2.imwrite(directory+'B/'+str(count['b'])+'.jpg', roi)
if interrupt & 0xFF == ord('c'):
cv2.imwrite(directory+'C/'+str(count['c'])+'.jpg', roi)
if interrupt & 0xFF == ord('d'):
cv2.imwrite(directory+'D/'+str(count['d'])+'.jpg', roi)
if interrupt & 0xFF == ord('e'):
cv2.imwrite(directory+'E/'+str(count['e'])+'.jpg', roi)
if interrupt & 0xFF == ord('f'):
cv2.imwrite(directory+'F/'+str(count['f'])+'.jpg', roi)
if interrupt & 0xFF == ord('g'):
cv2.imwrite(directory+'G/'+str(count['g'])+'.jpg', roi)
44
if interrupt & 0xFF == ord('h'):
cv2.imwrite(directory+'H/'+str(count['h'])+'.jpg', roi)
if interrupt & 0xFF == ord('i'):
cv2.imwrite(directory+'I/'+str(count['i'])+'.jpg', roi)
if interrupt & 0xFF == ord('j'):
cv2.imwrite(directory+'J/'+str(count['j'])+'.jpg', roi)
if interrupt & 0xFF == ord('k'):
cv2.imwrite(directory+'K/'+str(count['k'])+'.jpg', roi)
if interrupt & 0xFF == ord('l'):
cv2.imwrite(directory+'L/'+str(count['l'])+'.jpg', roi)
if interrupt & 0xFF == ord('m'):
cv2.imwrite(directory+'M/'+str(count['m'])+'.jpg', roi)
if interrupt & 0xFF == ord('n'):
cv2.imwrite(directory+'N/'+str(count['n'])+'.jpg', roi)
if interrupt & 0xFF == ord('o'):
cv2.imwrite(directory+'O/'+str(count['o'])+'.jpg', roi)
if interrupt & 0xFF == ord('p'):
cv2.imwrite(directory+'P/'+str(count['p'])+'.jpg', roi)
if interrupt & 0xFF == ord('q'):
cv2.imwrite(directory+'Q/'+str(count['q'])+'.jpg', roi)
if interrupt & 0xFF == ord('r'):
cv2.imwrite(directory+'R/'+str(count['r'])+'.jpg', roi)
if interrupt & 0xFF == ord('s'):
cv2.imwrite(directory+'S/'+str(count['s'])+'.jpg', roi)
if interrupt & 0xFF == ord('t'):
cv2.imwrite(directory+'T/'+str(count['t'])+'.jpg', roi)
if interrupt & 0xFF == ord('u'):
cv2.imwrite(directory+'U/'+str(count['u'])+'.jpg', roi)
if interrupt & 0xFF == ord('v'):
45
cv2.imwrite(directory+'V/'+str(count['v'])+'.jpg', roi)
if interrupt & 0xFF == ord('w'):
cv2.imwrite(directory+'W/'+str(count['w'])+'.jpg', roi)
if interrupt & 0xFF == ord('x'):
cv2.imwrite(directory+'X/'+str(count['x'])+'.jpg', roi)
if interrupt & 0xFF == ord('y'):
cv2.imwrite(directory+'Y/'+str(count['y'])+'.jpg', roi)
if interrupt & 0xFF == ord('z'):
cv2.imwrite(directory+'Z/'+str(count['z'])+'.jpg', roi)
capture.release()
cv2.destroyAllWindows()
FoldersCreation.py:
# Importing the Libraries Required
import os
import string
# Creating the directory Structure
if not os.path.exists("dataSet"):
os.makedirs("dataSet")
if not os.path.exists("dataSet/trainingData"):
os.makedirs("dataSet/trainingData")
if not os.path.exists("dataSet/testingData"):
os.makedirs("dataSet/testingData")
# Making folder 0 (i.e blank) in the training and testing data folders
respectively
for i in range(0):
if not os.path.exists("dataSet/trainingData/" + str(i)):
os.makedirs("dataSet/trainingData/" + str(i))
46
if not os.path.exists("dataSet/testingData/" + str(i)):
os.makedirs("dataSet/testingData/" + str(i))
# Making Folders from A to Z in the training and testing data folders
respectively
for i in string.ascii_uppercase:
if not os.path.exists("dataSet/trainingData/" + i):
os.makedirs("dataSet/trainingData/" + i)
if not os.path.exists("dataSet/testingData/" + i):
os.makedirs("dataSet/testingData/" + i)
Application.py:
# Importing Libraries
import numpy as np
import cv2
import os, sys
import time
import operator
from string import ascii_uppercase
import tkinter as tk
from PIL import Image, ImageTk
from hunspell import Hunspell
import enchant
from keras.models import model_from_json
os.environ["THEANO_FLAGS"] = "device=cuda, assert_no_cpu_op=True"
#Application :
class Application:
def __init__(self):
self.hs = Hunspell('en_US')
self.vs = cv2.VideoCapture(0)
47
self.current_image = None
self.current_image2 = None
self.json_file = open("Models\model_new.json", "r")
self.model_json = self.json_file.read()
self.json_file.close()
self.loaded_model = model_from_json(self.model_json)
self.loaded_model.load_weights("Models\model_new.h5")
self.json_file_dru = open("Models\model-bw_dru.json" , "r")
self.model_json_dru = self.json_file_dru.read()
self.json_file_dru.close()
self.loaded_model_dru = model_from_json(self.model_json_dru)
self.loaded_model_dru.load_weights("Models\model-bw_dru.h5")
self.json_file_tkdi = open("Models\model-bw_tkdi.json" , "r")
self.model_json_tkdi = self.json_file_tkdi.read()
self.json_file_tkdi.close()
self.loaded_model_tkdi = model_from_json(self.model_json_tkdi)
self.loaded_model_tkdi.load_weights("Models\model-bw_tkdi.h5")
self.json_file_smn = open("Models\model-bw_smn.json" , "r")
self.model_json_smn = self.json_file_smn.read()
self.json_file_smn.close()
self.loaded_model_smn = model_from_json(self.model_json_smn)
self.loaded_model_smn.load_weights("Models\model-bw_smn.h5")
self.ct = {}
self.ct['blank'] = 0
self.blank_flag = 0
for i in ascii_uppercase:
self.ct[i] = 0
print("Loaded model from disk")
self.root = tk.Tk()
48
self.root.title("Sign Language To Text Conversion")
self.root.protocol('WM_DELETE_WINDOW', self.destructor)
self.root.geometry("900x900")
self.panel = tk.Label(self.root)
self.panel.place(x = 100, y = 10, width = 580, height = 580)
self.panel2 = tk.Label(self.root) # initialize image panel
self.panel2.place(x = 400, y = 65, width = 275, height = 275)
self.T = tk.Label(self.root)
self.T.place(x = 60, y = 5)
self.T.config(text = "Sign Language To Text Conversion", font =
("Courier", 30, "bold"))
self.panel3 = tk.Label(self.root) # Current Symbol
self.panel3.place(x = 500, y = 540)
self.T1 = tk.Label(self.root)
self.T1.place(x = 10, y = 540)
self.T1.config(text = "Character :", font = ("Courier", 30, "bold"))
self.panel4 = tk.Label(self.root) # Word
self.T2.place(x = 10,y = 595)
self.T2.config(text = "Word :", font = ("Courier", 30, "bold"))
self.panel5 = tk.Label(self.root) # Sentence
self.T3.place(x = 10, y = 645)
self.T3.config(text = "Sentence :",font = ("Courier", 30, "bold"))
self.T4.place(x = 250, y = 690)
self.T4.config(text = "Suggestions :", fg = "red", font = ("Courier", 30,
49
"bold"))
self.bt1 = tk.Button(self.root, command = self.action1, height = 0, width =
0)
self.bt1.place(x = 26, y = 745)
0)
0)
self.str = ""
self.word = " "
self.current_symbol = "Empty"
self.photo = "Empty"
self.video_loop()
def video_loop(self):
ok, frame = self.vs.read()
if ok:
cv2image = cv2.flip(frame, 1)
x1 = int(0.5 * frame.shape[1])
y1 = 10
x2 = frame.shape[1] - 10
y2 = int(0.5 * frame.shape[1])
cv2.rectangle(frame, (x1 - 1, y1 - 1), (x2 + 1, y2 + 1), (255, 0, 0) ,1)
cv2image = cv2.cvtColor(cv2image, cv2.COLOR_BGR2RGBA)
self.current_image = Image.fromarray(cv2image)
imgtk = ImageTk.PhotoImage(image = self.current_image)
self.panel.imgtk = imgtk
self.panel.config(image = imgtk)
50
cv2image = cv2image[y1 : y2, x1 : x2]
gray = cv2.cvtColor(cv2image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (5, 5), 2)
th3 = cv2.adaptiveThreshold(blur, 255
,cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV,
11, 2)
ret, res = cv2.threshold(th3, 70, 255, cv2.THRESH_BINARY_INV +
cv2.THRESH_OTSU)
self.predict(res)
self.current_image2 = Image.fromarray(res)
imgtk = ImageTk.PhotoImage(image = self.current_image2)
self.panel2.imgtk = imgtk
self.panel2.config(image = imgtk)
self.panel3.config(text = self.current_symbol, font = ("Courier", 30))
self.panel4.config(text = self.word, font = ("Courier", 30))
self.panel5.config(text = self.str,font = ("Courier", 30))
predicts = self.hs.suggest(self.word)
if(len(predicts) > 1):
self.bt1.config(text = predicts[0], font = ("Courier", 20))
else:
self.bt1.config(text = "")
else:
else:
51
self.root.after(5, self.video_loop)
def predict(self, test_image):
test_image = cv2.resize(test_image, (128, 128))
result = self.loaded_model.predict(test_image.reshape(1, 128, 128, 1))
result_dru = self.loaded_model_dru.predict(test_image.reshape(1 , 128 ,
128 , 1))
result_tkdi = self.loaded_model_tkdi.predict(test_image.reshape(1 , 128 ,
128 , 1))
result_smn = self.loaded_model_smn.predict(test_image.reshape(1 , 128 ,
128 , 1))
prediction = {}
prediction['blank'] = result[0][0]
inde = 1
prediction[i] = result[0][inde]
inde += 1
#LAYER 1
prediction = sorted(prediction.items(), key = operator.itemgetter(1), reverse
= True)
self.current_symbol = prediction[0][0]
#LAYER 2
if(self.current_symbol == 'D' or self.current_symbol == 'R' or
self.current_symbol == 'U'):
prediction = {}
prediction['D'] = result_dru[0][0]
prediction['R'] = result_dru[0][1]
prediction['U'] = result_dru[0][2]
prediction = sorted(prediction.items(), key = operator.itemgetter(1),
reverse = True)
52
if(self.current_symbol == 'D' or self.current_symbol == 'I' or
self.current_symbol == 'K' or self.current_symbol == 'T'):
prediction = {}
prediction['D'] = result_tkdi[0][0]
prediction['I'] = result_tkdi[0][1]
prediction['K'] = result_tkdi[0][2]
prediction['T'] = result_tkdi[0][3]
prediction = sorted(prediction.items(), key = operator.itemgetter(1),
reverse = True)
if(self.current_symbol == 'M' or self.current_symbol == 'N' or
self.current_symbol == 'S'):
prediction1 = {}
prediction1['M'] = result_smn[0][0]
prediction1['N'] = result_smn[0][1]
prediction1['S'] = result_smn[0][2]
prediction1 = sorted(prediction1.items(), key = operator.itemgetter(1),
reverse = True)
if(prediction1[0][0] == 'S'):
self.current_symbol = prediction1[0][0]
else:
if(self.current_symbol == 'blank'):
self.ct[i] = 0
self.ct[self.current_symbol] += 1
if(self.ct[self.current_symbol] > 60):
53
if i == self.current_symbol:
continue
tmp = self.ct[self.current_symbol] - self.ct[i]
if tmp < 0:
tmp *= -1
if tmp <= 20:
self.ct[i] = 0
return
self.ct[i] = 0
if self.current_symbol == 'blank':
if self.blank_flag == 0:
self.blank_flag = 1
if len(self.str) > 0:
self.str += " "
self.str += self.word
self.word = ""
else:
if(len(self.str) > 16):
self.str = ""
self.blank_flag = 0
self.word += self.current_symbol
def action1(self):
self.word = ""
54
self.str += " "
self.str += predicts[0]
def action2(self):
self.word = ""
self.str += " "
def action3(self):
self.word = ""
self.str += " "
def action4(self):
self.word = ""
self.str += " "
def action5(self):
self.word = ""
self.str += " "
def destructor(self):
print("Closing Application...")
self.root.destroy()
55
self.vs.release()
cv2.destroyAllWindows()
print("Starting Application...")
(Application()).root.mainloop()
56
APPENDIX-II
REFERENCES
[1] Mayand Kumar, Piyush Gupta an Rahul Kumar Jha "Sign Language
Alphabet Recognition using Convolutional Neural Network," 2021 IEEE Fifth
International Conference on Intelligent Computing and Control Systems
(ICICCS 2021).
[2] J. Huang, W. Zhou and Q. Zhang, "Video-based Sign Language Recognition

without Temporal Segmentation," arXiv, 2018.
[3] T.-W. Chong and B.-G. Lee, "American Sign Language Recognition Using
Leap Motion Controller with Machine Learning Approach," Sensors, vol. 18,
2018.
[4] C. Linqin, C. Shuangjie and X. Min, "Dynamic hand gesture recognition

using RGB-D data for natural human-computer interaction," Journal of
Intelligent and Fuzzy Systems, 2017.
[5] F. Ronchetti, Q. Facundo and A. E. Cesar, "Handshake recognition for

argentinian sign language using probsom," Journal of Computer Science &
Technology, 2016.
[6] Itkarkar, R. and Nandi, Anil. (2013). Hand gesture to speech conversion
using Matlab. “2013 4th International Conference on Computing,
Communications and Networking Technologies, ICCCNT 2013”. 1-4.
10.1109/ICCCNT.2013.6726505.
57
[7]. Van den Bergh, M., & Van Gool, L. (2011, January). Combining RGB and
ToF cameras for real-time 3D hand gesture interaction. In 2011 IEEE workshop
on applications of computer vision (WACV) (pp. 66-72). IEEE.
[8]. Liwicki, S., & Everingham, M. (2009, June). Automatic recognition of

fingerspelled words in british sign language. In 2009 IEEE computer society
conference on computer vision and pattern recognition workshops (pp. 50-57).
IEEE.
[9]. Zafrulla, Z., Brashear, H., Starner, T., Hamilton, H., & Presti, P. (2011,
November). American sign language recognition with the kinect. In Proceedings
of the 13th international conference on multimodal interfaces (pp. 279-286).
[10]. Pugeault, N., & Bowden, R. (2011, November). Spelling it out: Real-time
ASL fingerspelling recognition. In 2011 IEEE International conference on
computer vision workshops (ICCV workshops) (pp. 1114-1119). IEEE.
[11] Dutta, Kusumika and Bellary, Sunny. (2017). Machine Learning

Techniques for Indian Sign Language Recognition. 333-336.
10.1109/CTCEEC.2017.8454988.
[12] Pansare, Jayashree and Gawande, Shravan and Ingle, Maya. (2012). Real-
Time Static Hand Gesture Recognition for American Sign Language (ASL) in
Complex Background. Journal of Signal and Information Processing. 03. 364-
367. 10.4236/jsip.2012.33047.
[13] P.K. Athira, C.J. Sruthi and A. Lijiya ," Sign Language Recognition with
Co-articulation Elimination from Live Videos: An Indian Scenario",Journal of
King Saud University - Computer and Information Sciences.
58
[14] Pia Breuer Christian Eckes Stefan Müller,"Hand Gesture Recognition with
a Novel IR Time-of-Flight Range Camera–A Pilot Study",,”International
Conference on Computer Vision / Computer Graphics Collaboration
Techniques and Applications,MIRAGE 2007”: Computer Vision/Computer
Graphics Collaboration Techniques pp 247-260.
[15] Li, Zhi and Jarvis, Ray. (2009). Real time hand gesture recognition using a
range camera. “Proceedings of Australasian Conference on Robotics and
Automation (ACRA)”.
[16] P. Vijayalakshmi and M. Aarthi, "Sign language to speech conversion,"

2016 International “Conference on Recent Trends in Information Technology
(ICRTIT), Chennai, 2016, pp. 1-6”.
[17] Hasan, Mokhtar and Mishra, Pramod. (2010). HSV Brightness Factor
Matching for Gesture Recognition System. International Journal of Image
Processing.
[18] Warrier, Keerthi and Sahu, Jyateen and Halder, Himadri and Koradiya,
Rajkumar and Raj, V. (2016). Software based sign language converter. 1777-
1780. 10.1109/ICCSP.2016.7754472.
[19] R. R. Itkarkar and A. V. Nandi, "Hand gesture to speech conversion using

Matlab," 2013 “Fourth International Conference on Computing,
Communications and Networking Technologies (ICCCNT), Tiruchengode,
2013, pp. 1-4”.
[20] Lee, G. C., Yeh, F.-H., and Hsiao, Y.-H. (2016). Kinect-based taiwanese
sign-language recognition system. Multimedia Tools and Applications, 75(1).
59
[21] Sy Bor Wang, A. Quattoni, L. -. Morency, D. Demirdjian and T. Darrell,
"Hidden Conditional Random Fields for Gesture Recognition," ,“2006 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR'06), New York, NY, USA, 2006”, pp. 1521-1527.
[22] Elmezain, M., Al-Hamadi, A., Appenrodt, J., and Michaelis, B. (2008). A
hidden markov model-based continuous gesture recognition system for hand
motion trajectory. In Pattern Recognition, 2008. ICPR 2008.” “19th
International Conference on (pp. 1–4). Tampa, FL, USA”.
[23] Liang, R. H., and Ouhyoung, M. (1998). A real-time continuous gesture

recognition system for sign language. In “Third IEEE International Conference
on Automatic Face and Gesture Recognition (pp. 558–567)”. Nara,
Japan.Shrawankar, Urmila and Dixit, Sayli. (2016). Framing Sentences from
Sign Language symbols using NLP.
[24] Wang, H., Leu, M. C., and Oz, C. (2006). American Sign Language
Recognition Using Multi-dimensional Hidden Markov Models. Journal of
Information Science and Engineering, 22(5), 1109–1123.
60

Deep Learning Based Sign Language Recognition System Using Convolutional Neural Network

Uploaded by

Copyright:

Available Formats

Deep Learning Based Sign Language Recognition System Using Convolutional Neural Network

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning Based Sign Language Recognition System Using Convolutional Neural Network

Uploaded by

Copyright:

Available Formats

What is the project report about?

What is the project report about?

What techniques are discussed for sign language recognition?

What techniques are discussed for sign language recognition?

DEEP LEARNING BASED SIGN LANGUAGE

RECOGNITION SYSTEM FOR STATIC SIGNS USING

partial fulfillment for the award of the degree

COMPUTER SCIENCE AND ENGINEERING

SRI VIDYA COLLEGE OF ENGINEERING & TECHNOLOGY

VIRUDHUNAGAR 626 005

ANNA UNIVERSITY, CHENNAI 600 025

Certified that this, project report “DEEP LEARNING BASED SIGN

Internal Examiner External Examiner

We express our sincere and respectful regards to chairman

We endow our thank to Principal, Dr.T.LOUIE FRANGO M.Tech.,

At the same time, we wish to record our deep sense of gratitude to

We wish to thank Mrs.M.MOHANA M.E., Project Guide, generously

We wish to thank Mrs.M.MOHANA M.E., Project Coordinator, who

Sign language is a form of communication used by people with impaired

Keywords - Sign language recognition, Segmentation, Feature extraction,

FIG.NO NAME OF THE FIGURES PAGE NO

4.1 ASL Recognition System Architecture 9

4.2 Dataset used for training the model 12

4.3 Sample pictures of training data 12

4.4 Feature Extraction 18

6.1 Anaconda navigator 28

6.2 Anaconda environments 29

6.3 PIP package manager 29

6.4 Anaconda Jupyter Notebook insallation wizard 31

6.5 Jupyter Notebook running 31

7.1 Dataset Collection 1 36

7.2 Dataset Collection 2 36

7.3 Capturing Sign from web camera 37

7.4 Dataset Location path 1 37

7.5 Dataset Location path 2 38

7.6 Recognition of the hand gestures 38

CNN - Convolutional Neural Network

PIP - Preferred Installer Program

DNN - Deep Neural Network

RNN - Recurrent Neural Network

FPS - Frames Per Second

RELU - Rectified Linear Unit

MNN - Matcher Neural Network

SRN - Simple Recurrent Network

TDNN - Time Delay Neural Network

Speech impaired people use hand signs and gestures to communicate.

1.1 IMAGE PROCESSING

Input image Object image Feature vector Object type

Fig 1.1 Phases of pattern recognition

Sign Language Alphabet Recognition Using Convolution Neural

The first approach in relation to sign language recognition was by Bergh

Problems Identified in Existing System

 They are costly and are difficult to be used commercially.

Our proposed system is sign language recognition system using

Fig 4.1 ASL Recognition System Architecture

Fig 4.2 Dataset used for training the model

Fig 4.3 Sample Pictures of Training Data

This full connection process practically works as follows: