Cognitive Model For Object Detection Based On Speech-to-Text Conversion
Cognitive Model For Object Detection Based On Speech-to-Text Conversion
Shahana Bano
Ramini Manideep
Department Of CSE
Department Of CSE
Koneru Lakshmaiah Education
Koneru Lakshmaiah Education
Foundation
Foundation
Vaddeswaram, India
Vaddeswaram, India
shahanabano@icloud.com
manideepramini@gmail.com
Abstract— The goal of this paper is to develop a model which is will be trained using the Darknet framework on the cloud
the integrated version of both SpeechRecognition and Object using the free GPU and then the weights file and
detection. This model is developed after undergoing the configuration file are generated. Now, the YOLO object
literature survey and the existing models that are related to detection model comes into the picture. The model checks the
Object Detection and Speech Recognition. There are several confidence score of each pixel of the image. If the object is
types of Speech Recognition and Object Detection models
available so far. In addition to the existing models, this paper
near to the score of the trained image then a boundary box is
proposes a new model named “Cognitive Model for Object generated and the required object is detected.
Detection based on Speech-to-Text Conversion,” which is an The user needs to import few packages that are required for
integrated version of both Speech Recognition and Object implementing the SpeechRecogntion and Object detection
Detection models. Firstly, A speech command is provided as an modules they are SpeechRecognition and PyAudio packages
input to the model, it takes the command and processes the data, and for performing computer vision operations on an image
and then it detects the specified object from a source of images. the OpenCV library is to be imported. The training of the
The detected object is represented with a rectangular box. This dataset of images is done with the Darknet Framework. This
approach is implemented with the help of Google Speech model of approach is done using the Google
Recognition and YOLO object detection models utilizing the
Darknet and OpenCV frameworks.
SpeechRecongnition model and Deep Neural Networks
algorithm named YOLO object detection model.
Keywords— SpeechRecognition, YOLO, Object Detection, II. LITERATURE SURVEY
Google cloud GPU, Darknet, LabelImg, OpenCV.
MIT computer scientists [1] have developed a model
that identifies and detects objects within an image, based on
I. INTRODUCTION a spoken description of the image. The input of the model will
Cognitive Models play a vital role in the enhancement of AI- be an image and an audio caption, the model will highlight in
enabled technologies, Speech is one of the finest means of real-time the relevant regions or boundary boxes of the
communication. It is used for exchanging feelings and image.
thoughts with one another. Object detection facilitates us to
identify and locate objects in an image or video. Bishal Heuju, Bishal Lakha, Dipkamal Bhusal, and
As mentioned earlier speech is the finest means of Kanhaiya Lal Shrestha [2] in 2016 have developed a voice-
communication and the way of speaking needs to be clear to command-based object recognizing robot using speech and
avoid ambiguity between each other. Few people can orally image feature extraction. The robot is developed to recognize
communicate but cannot type and interact with a computer or the command ‘identify’ and then it will identify the object
device. Considering this aspect, this model is developed to when the command is given as ‘follow’, the robot needs to
detect a specified object from an image that contains multiple track the object.
objects based on the speech commands that are provided by
the user. This approach can be implemented in several ways, Swetha V Patil and Pradeep N [3] in 2019 have
but this model that receives the speech input from the user proposed a speech to speech system. Firstly, the user has to
through a microphone and further, the SpeechRecognition give the speech input in any of the four languages among
will convert into text, and then the backend process is English, Kannada, Hindi, and Telugu, and later the model
performed by the Darknet framework and YOLO object converts the speech to text using the Google API.
detection model which provides us the output as the object
that is detected among the other objects that are available in Satoshi Nakamura, Konstantin Markov, Takatoshi
the image. When a user provides the voice input to the system Jitsuhiro, Jin-Song Zhang, Hirofumi Yamamoto, and
it converts the speech to text and the text is stored as the name Genichiro Kikui [4] are currently developing a speech-to-
of the object in a file named coco names. Later, the images speech translation system at Advanced Telecommunication
Authorized licensed use limited to: Nottingham Trent University. Downloaded on June 24,2021 at 04:14:20 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
Research Institute, Kyoto, Japan. It is a multi-lingual speech processing, and computer engineering. Many modern devices
recognition system that supports Japanese, English, and are associated with speech recognition functions for enabling
Chinese languages. Although the better integration of speech hands-free use of a device.
to text and image recognition or identification gives a very
accurate and better novelty in this research proposed method. Existing Models for Speech Recognition are Speech
Recognition Using Google Cloud Speech API, Speech
Sandeep Kumar, Aman Balyan, and Manvi Chawla [5] Recognition Using Deep Neural Networks, and Speech
in 2017 have proposed an object detection model named as Recognition Using Hidden Markov Models.
“Easynet model”. Easynet model looks at the complete image With these existing models, it can be utilized as an
during the testing phase so its predictions are informed by the application of speech to text translation for further
global context. During the prediction time, their model identification of objects.
generates confidence scores for the presence of the object in
a particular category. The model makes predictions with a Object Detection:
Single network evaluation. Object detection is a computer vision technique that
allows us to locate or detect particular objects with a
Krishnaveni, G., Lalitha Bhavani, B., Vijaya Lakshmi, bounding box.
N.V.S.K. [6] have proposed an enhanced approach for object • Input: An image with multiple objects.
detection using a wavelet-based neural network. In their • Output: Bounding box around the specified object
work, a novel characterization system called Affluence based and a class label for the bounding box.
Image Classification (AIC) is proposed utilizing a wavelet-
based neural network system (WNS). Existing Models for Object Detection are Custom Object
Detection using TensorFlow, YOLO Object Detection using
Inthiyaz, S., Ahammad, S.H., Sai Krishna, A., OpenCV, and SpeechYOLO: Detection and Localization of
Bhargavi, V., Govardhan, D., and Rajesh, V. [7] have Speech Objects.
proposed a YOLO medical image model which computes
network as relapse of an input image and builds individual
bounding boxes for every associated class object with more III. PROPOSED WORK
accuracy. As the model convolves itself into a lone neural
network, jumps straight to the image pixels to bounding box A. Importing all the packages :
coordinates and object classes. To run this model, you need to import all the
essential packages and libraries. To obtain the speech input,
Mandhala, V.N., Bhattacharyya, D., Vamsi, B., the user needs to import the SpeechRecognition package
Thirupathi Rao, N. [8] in 2018 have developed an Object which contains many inbuilt methods like the listen method
Detection model to Assist Visually Impaired People. Their and the recognizer_google method. The recognizer method
contribution focused on developing computer vision helps to recognize the speech and convert it into the desired
algorithms combined with a deep neural network to assist text.
visually impaired individual’s mobility in clinical Later, you need to install the PyAudio package, this is
environments by accurately detecting doors, stairs, and used to record the voice data of the user through the
signages, the most remarkable landmarks. microphone of the device. When you run the
SpeechRecognition module it takes the input from speech and
J. Olabe; A. Santos; R. Martinez; E. Munoz; M. it is converted to text and stored in a file and the content in
Martinez; A. Quilis; J. Bernstein. [9] have developed a Real- the file is used as a coco name.
time text-to-speech conversion system for Spanish, that YOLO object detection model helps us to detect the
accepts a continuous source of alphanumeric characters (up object and to generate a boundary box around the object. The
to 250 words per minute) and produces good quality, natural Darknet framework helps to train the model. You
Spanish output as described by the user. need to download and compile the darknet to work with
the GPU in the cloud.
Kishan Kumar; Shyam Nandan; Ashutosh Mishra;
B. Methodology:
Kanv Kumar; V. K. Mittal. [10] have developed a Voice-
controlled object tracking smart robot. The robot navigates its This algorithm for Object Detection based on
way as per the voice-command signal, and it also tracks the Speech-to-Text conversion works when the user gives a
desired object. The voice-command signal processing is speech command to the system to detect a particular object
carried out in real-time, using an on-time cloud server that among the various objects that are available on the screen.
converts it to text format. The command signal text is then Google speech recognition learns from real search engine
transferred to the robot via Bluetooth network to control its strings and it can recognize different kinds of accents and it
differential drive. is currently working based on Long Short-term Memory
Recurrent Neural Networks (LSTM RNNs). These LSTM
Speech Recognition: RNNs have recurrent connections and memory cells that
Speech recognition is the ability of a device or system to allow them to remember the data that is provided by the user.
recognize spoken words and convert them into readable text. Once the model collects your speech data it breaks
Speech recognition is utilized in different fields of research down the whole audio data into individual sound waves and
in computer science, linguistics, Natural language these sound waves are converted into a digital format and
Authorized licensed use limited to: Nottingham Trent University. Downloaded on June 24,2021 at 04:14:20 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
then the model finds the most probable word fit in that utilized in running the YOLO object detection model. Now
language by looking at the different Treebanks that are with the help of the coco names file, weights file,
available online, this entire process is done by using some configuration file, and the trained image the user needs to run
models like Hidden Markov Model and Natural language the object detection model. It checks the confidence score of
processing techniques. The next step that is performed by the each pixel of the image. If the object is near to the confidence
model is converting the speech into text. So, for converting score of the trained image then a boundary box is generated
the speech into text the recognize_google method is used and then the required object is detected and represented with
which is available in the speech_recognition package. a rectangular box specifying the name of the object.
After receiving the input and converting it into text
the remaining process is done backend with help of the The dimensions of the rectangular boundary box is
Darknet Framework. The converted text data is stored in a generated based on the below equations:
file and the text in the file is used as a coco name, which acts
as training data for the Yolo object detection model. Later, center_x = int (detection [0] * width)
the images are labeled using LabelImg software, which is an center_y = int (detection [1] * height)
image annotation tool, the user needs to install this tool and w = int (detection [2] * width)
for this model, the annotations should be saved in YOLO h = int (detection [3] * height)
format instead of PASCAL VOC format and then you need
to click and release the mouse to choose a region in the image Step 1: Start.
to annotate or label the rectangular box as per your desired Step 2: Place the microphone near the person or device by
object name and the annotation will be saved to the specified which the speech
folder. After labeling the images the user has to upload the input is going to be given.
labeled image into the google cloud along with the classes file Step 3: The user will give his speech data to the model.
and the file containing the coordinates of the Image. Now the Step 4: If the system recognizes the speech then it is recorded
image is trained using the Darknet framework in the cloud and used for translation of speech to text using Google
using a free GPU. Speech Recognition API.
Firstly, the user needs to install the darknet Else ask the user to speak clearly.
framework and then mount the google drive so that the model Step 5: SpeechRecognition package performs the recognition
can access the training data and the user needs to customize of speech.
the configuration file based on the number of classes that they Step 6: Speech is converted to text using the
are having in their classes file. Now with all the training data recognize_google method.
and the configuration file, the User needs to train the model Step 7: The obtained text is stored in a coco names file.
using darknet, and then after running the model for a few Step 8: Labelling the images using LabelImg software.
iterations the weights file is obtained. This weight file is
C. Flow chart:
Step 9: Training the image using the Darknet framework in
the cloud using free GPU. The workflow of this model begins when the user gives
Step 10: Obtain the weights file after training the image using input to the system and the recognized speech is converted
the darknet framework. into text using Google Speech Recognition API. The obtained
Step 11: Running the YOLO object detection model with all text is stored in a file named coco names. Now, the Labelling
the training data. of the images is done using LabelImg software, and then the
Step 12: It checks the confidence score of each pixel of the images are trained using the Darknet framework in the cloud
image. using free GPU. The weights file is obtained after training the
Step 13: If the object is near to the confidence score of the image using the darknet framework. Later, the user needs to
trained image then a bounding box is generated and the run the YOLO object detection model with all the training
required object is detected. data, the Configuration file. The model checks the confidence
Step 14: Else, Repeat the steps 1,2,3,4,5,6,7,8,9,10,11,12,13. score of each pixel of the image. If the object is near to the
Step 15: End score of the trained image then a boundary box is generated
Authorized licensed use limited to: Nottingham Trent University. Downloaded on June 24,2021 at 04:14:20 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
IV. RESULTS
Fig.5: Output of Cat object If the given input is Cat.
Initially when you run the model. The process starts by
receiving the speech input and then the speech is converted
to text. This converted acts as the name of the object that is
to be detected by the model. Here in fig.4, if the user gives
the input as ‘Clock’ so the output should be bounded box
named clock that is going to be displayed on the output
screen.
Authorized licensed use limited to: Nottingham Trent University. Downloaded on June 24,2021 at 04:14:20 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
Authorized licensed use limited to: Nottingham Trent University. Downloaded on June 24,2021 at 04:14:20 UTC from IEEE Xplore. Restrictions apply.