Cognitive Model For Object Detection Based On Speech-to-Text Conversion

Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]
IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3
Cognitive Model for Object Detection based on

Speech-to-Text Conversion
Pavuluri Jithendra Tummala Vinay Sai Raj Kumar Mannam
Department Of CSE Department Of CSE Department Of CSE
Koneru Lakshmaiah Education Koneru Lakshmaiah Education Koneru Lakshmaiah Education
Foundation Foundation Foundation
2020 3rd International Conference on Intelligent Sustainable Systems (ICISS) | 978-1-7281-7089-3/20/$31.00 ©2020 IEEE | DOI: 10.1109/ICISS49785.2020.9315985
Vaddeswaram, India Vaddeswaram, India Vaddeswaram, India

jithendrapavuluri43@gmail.com vinaiinfo18@gmail.com raj9948622139@gmail.com
Shahana Bano
Ramini Manideep
Department Of CSE
Department Of CSE
Koneru Lakshmaiah Education
Koneru Lakshmaiah Education
Foundation
Foundation
Vaddeswaram, India
Vaddeswaram, India
shahanabano@icloud.com
manideepramini@gmail.com
Abstract— The goal of this paper is to develop a model which is will be trained using the Darknet framework on the cloud
the integrated version of both SpeechRecognition and Object using the free GPU and then the weights file and
detection. This model is developed after undergoing the configuration file are generated. Now, the YOLO object
literature survey and the existing models that are related to detection model comes into the picture. The model checks the
Object Detection and Speech Recognition. There are several confidence score of each pixel of the image. If the object is
types of Speech Recognition and Object Detection models
available so far. In addition to the existing models, this paper
near to the score of the trained image then a boundary box is
proposes a new model named “Cognitive Model for Object generated and the required object is detected.
Detection based on Speech-to-Text Conversion,” which is an The user needs to import few packages that are required for
integrated version of both Speech Recognition and Object implementing the SpeechRecogntion and Object detection
Detection models. Firstly, A speech command is provided as an modules they are SpeechRecognition and PyAudio packages
input to the model, it takes the command and processes the data, and for performing computer vision operations on an image
and then it detects the specified object from a source of images. the OpenCV library is to be imported. The training of the
The detected object is represented with a rectangular box. This dataset of images is done with the Darknet Framework. This
approach is implemented with the help of Google Speech model of approach is done using the Google
Recognition and YOLO object detection models utilizing the
Darknet and OpenCV frameworks.
SpeechRecongnition model and Deep Neural Networks
algorithm named YOLO object detection model.
Keywords— SpeechRecognition, YOLO, Object Detection, II. LITERATURE SURVEY
Google cloud GPU, Darknet, LabelImg, OpenCV.
MIT computer scientists [1] have developed a model
that identifies and detects objects within an image, based on
I. INTRODUCTION a spoken description of the image. The input of the model will
Cognitive Models play a vital role in the enhancement of AI- be an image and an audio caption, the model will highlight in
enabled technologies, Speech is one of the finest means of real-time the relevant regions or boundary boxes of the
communication. It is used for exchanging feelings and image.
thoughts with one another. Object detection facilitates us to
identify and locate objects in an image or video. Bishal Heuju, Bishal Lakha, Dipkamal Bhusal, and
As mentioned earlier speech is the finest means of Kanhaiya Lal Shrestha [2] in 2016 have developed a voice-
communication and the way of speaking needs to be clear to command-based object recognizing robot using speech and
avoid ambiguity between each other. Few people can orally image feature extraction. The robot is developed to recognize
communicate but cannot type and interact with a computer or the command ‘identify’ and then it will identify the object
device. Considering this aspect, this model is developed to when the command is given as ‘follow’, the robot needs to
detect a specified object from an image that contains multiple track the object.
objects based on the speech commands that are provided by
the user. This approach can be implemented in several ways, Swetha V Patil and Pradeep N [3] in 2019 have
but this model that receives the speech input from the user proposed a speech to speech system. Firstly, the user has to
through a microphone and further, the SpeechRecognition give the speech input in any of the four languages among
will convert into text, and then the backend process is English, Kannada, Hindi, and Telugu, and later the model
performed by the Darknet framework and YOLO object converts the speech to text using the Google API.
detection model which provides us the output as the object
that is detected among the other objects that are available in Satoshi Nakamura, Konstantin Markov, Takatoshi
the image. When a user provides the voice input to the system Jitsuhiro, Jin-Song Zhang, Hirofumi Yamamoto, and
it converts the speech to text and the text is stored as the name Genichiro Kikui [4] are currently developing a speech-to-
of the object in a file named coco names. Later, the images speech translation system at Advanced Telecommunication
978-1-7281-7089-3/20/$31.00 ©2020 IEEE 843
Authorized licensed use limited to: Nottingham Trent University. Downloaded on June 24,2021 at 04:14:20 UTC from IEEE Xplore. Restrictions apply.
Research Institute, Kyoto, Japan. It is a multi-lingual speech processing, and computer engineering. Many modern devices
recognition system that supports Japanese, English, and are associated with speech recognition functions for enabling
Chinese languages. Although the better integration of speech hands-free use of a device.
to text and image recognition or identification gives a very
accurate and better novelty in this research proposed method. Existing Models for Speech Recognition are Speech
Recognition Using Google Cloud Speech API, Speech
Sandeep Kumar, Aman Balyan, and Manvi Chawla [5] Recognition Using Deep Neural Networks, and Speech
in 2017 have proposed an object detection model named as Recognition Using Hidden Markov Models.
“Easynet model”. Easynet model looks at the complete image With these existing models, it can be utilized as an
during the testing phase so its predictions are informed by the application of speech to text translation for further
global context. During the prediction time, their model identification of objects.
generates confidence scores for the presence of the object in
a particular category. The model makes predictions with a Object Detection:
Single network evaluation. Object detection is a computer vision technique that
allows us to locate or detect particular objects with a
Krishnaveni, G., Lalitha Bhavani, B., Vijaya Lakshmi, bounding box.
N.V.S.K. [6] have proposed an enhanced approach for object • Input: An image with multiple objects.
detection using a wavelet-based neural network. In their • Output: Bounding box around the specified object
work, a novel characterization system called Affluence based and a class label for the bounding box.
Image Classification (AIC) is proposed utilizing a wavelet-
based neural network system (WNS). Existing Models for Object Detection are Custom Object
Detection using TensorFlow, YOLO Object Detection using
Inthiyaz, S., Ahammad, S.H., Sai Krishna, A., OpenCV, and SpeechYOLO: Detection and Localization of
Bhargavi, V., Govardhan, D., and Rajesh, V. [7] have Speech Objects.
proposed a YOLO medical image model which computes
network as relapse of an input image and builds individual
bounding boxes for every associated class object with more III. PROPOSED WORK
accuracy. As the model convolves itself into a lone neural
network, jumps straight to the image pixels to bounding box A. Importing all the packages :
coordinates and object classes. To run this model, you need to import all the
essential packages and libraries. To obtain the speech input,
Mandhala, V.N., Bhattacharyya, D., Vamsi, B., the user needs to import the SpeechRecognition package
Thirupathi Rao, N. [8] in 2018 have developed an Object which contains many inbuilt methods like the listen method
Detection model to Assist Visually Impaired People. Their and the recognizer_google method. The recognizer method
contribution focused on developing computer vision helps to recognize the speech and convert it into the desired
algorithms combined with a deep neural network to assist text.
visually impaired individual’s mobility in clinical Later, you need to install the PyAudio package, this is
environments by accurately detecting doors, stairs, and used to record the voice data of the user through the
signages, the most remarkable landmarks. microphone of the device. When you run the
SpeechRecognition module it takes the input from speech and
J. Olabe; A. Santos; R. Martinez; E. Munoz; M. it is converted to text and stored in a file and the content in
Martinez; A. Quilis; J. Bernstein. [9] have developed a Real- the file is used as a coco name.
time text-to-speech conversion system for Spanish, that YOLO object detection model helps us to detect the
accepts a continuous source of alphanumeric characters (up object and to generate a boundary box around the object. The
to 250 words per minute) and produces good quality, natural Darknet framework helps to train the model. You
Spanish output as described by the user. need to download and compile the darknet to work with
the GPU in the cloud.
Kishan Kumar; Shyam Nandan; Ashutosh Mishra;
B. Methodology:
Kanv Kumar; V. K. Mittal. [10] have developed a Voice-
controlled object tracking smart robot. The robot navigates its This algorithm for Object Detection based on
way as per the voice-command signal, and it also tracks the Speech-to-Text conversion works when the user gives a
desired object. The voice-command signal processing is speech command to the system to detect a particular object
carried out in real-time, using an on-time cloud server that among the various objects that are available on the screen.
converts it to text format. The command signal text is then Google speech recognition learns from real search engine
transferred to the robot via Bluetooth network to control its strings and it can recognize different kinds of accents and it
differential drive. is currently working based on Long Short-term Memory
Recurrent Neural Networks (LSTM RNNs). These LSTM
Speech Recognition: RNNs have recurrent connections and memory cells that
Speech recognition is the ability of a device or system to allow them to remember the data that is provided by the user.
recognize spoken words and convert them into readable text. Once the model collects your speech data it breaks
Speech recognition is utilized in different fields of research down the whole audio data into individual sound waves and
in computer science, linguistics, Natural language these sound waves are converted into a digital format and
978-1-7281-7089-3/20/$31.00 ©2020 IEEE 844
then the model finds the most probable word fit in that utilized in running the YOLO object detection model. Now
language by looking at the different Treebanks that are with the help of the coco names file, weights file,
available online, this entire process is done by using some configuration file, and the trained image the user needs to run
models like Hidden Markov Model and Natural language the object detection model. It checks the confidence score of
processing techniques. The next step that is performed by the each pixel of the image. If the object is near to the confidence
model is converting the speech into text. So, for converting score of the trained image then a boundary box is generated
the speech into text the recognize_google method is used and then the required object is detected and represented with
which is available in the speech_recognition package. a rectangular box specifying the name of the object.
After receiving the input and converting it into text
the remaining process is done backend with help of the The dimensions of the rectangular boundary box is
Darknet Framework. The converted text data is stored in a generated based on the below equations:
file and the text in the file is used as a coco name, which acts
as training data for the Yolo object detection model. Later, center_x = int (detection [0] * width)
the images are labeled using LabelImg software, which is an center_y = int (detection [1] * height)
image annotation tool, the user needs to install this tool and w = int (detection [2] * width)
for this model, the annotations should be saved in YOLO h = int (detection [3] * height)
format instead of PASCAL VOC format and then you need
to click and release the mouse to choose a region in the image Step 1: Start.
to annotate or label the rectangular box as per your desired Step 2: Place the microphone near the person or device by
object name and the annotation will be saved to the specified which the speech
folder. After labeling the images the user has to upload the input is going to be given.
labeled image into the google cloud along with the classes file Step 3: The user will give his speech data to the model.
and the file containing the coordinates of the Image. Now the Step 4: If the system recognizes the speech then it is recorded
image is trained using the Darknet framework in the cloud and used for translation of speech to text using Google
using a free GPU. Speech Recognition API.
Firstly, the user needs to install the darknet Else ask the user to speak clearly.
framework and then mount the google drive so that the model Step 5: SpeechRecognition package performs the recognition
can access the training data and the user needs to customize of speech.
the configuration file based on the number of classes that they Step 6: Speech is converted to text using the
are having in their classes file. Now with all the training data recognize_google method.
and the configuration file, the User needs to train the model Step 7: The obtained text is stored in a coco names file.
using darknet, and then after running the model for a few Step 8: Labelling the images using LabelImg software.
iterations the weights file is obtained. This weight file is
Fig 1: Block Diagram of the process.
C. Flow chart:
Step 9: Training the image using the Darknet framework in
the cloud using free GPU. The workflow of this model begins when the user gives
Step 10: Obtain the weights file after training the image using input to the system and the recognized speech is converted
the darknet framework. into text using Google Speech Recognition API. The obtained
Step 11: Running the YOLO object detection model with all text is stored in a file named coco names. Now, the Labelling
the training data. of the images is done using LabelImg software, and then the
Step 12: It checks the confidence score of each pixel of the images are trained using the Darknet framework in the cloud
image. using free GPU. The weights file is obtained after training the
Step 13: If the object is near to the confidence score of the image using the darknet framework. Later, the user needs to
trained image then a bounding box is generated and the run the YOLO object detection model with all the training
required object is detected. data, the Configuration file. The model checks the confidence
Step 14: Else, Repeat the steps 1,2,3,4,5,6,7,8,9,10,11,12,13. score of each pixel of the image. If the object is near to the
Step 15: End score of the trained image then a boundary box is generated
978-1-7281-7089-3/20/$31.00 ©2020 IEEE 845
and the required object is detected. Else you need to repeat

the process. Fig.2.
Fig.4: Output of Clock object If the given input is Clock.
A similar approach is followed for fig.5,6 ‘Cat’ and

‘Bottle’ is given as speech input to identify those objects in
the image. Here the speech to text is converted by utilizing
Google Speech Recognition API. The Labelling of images is
performed using LabelImg software, and then training the
image is done with the Darknet framework in the cloud using
free GPU. The weights file is obtained after training the
image using the darknet framework. Later, you should run the
YOLO object detection model with all the training data, the
Configuration file. The model checks the confidence score of
each pixel of the image. If the object is near to the score of
the trained image then a boundary box is generated and the
required object is detected.
Fig 2: Overview of the process.
IV. RESULTS
Fig.5: Output of Cat object If the given input is Cat.
Initially when you run the model. The process starts by
receiving the speech input and then the speech is converted
to text. This converted acts as the name of the object that is
to be detected by the model. Here in fig.4, if the user gives
the input as ‘Clock’ so the output should be bounded box
named clock that is going to be displayed on the output
screen.
Input: The user needs to specify the object name through

the microphone.
Fig.6: Output of Bottle object If the given input is Bottle.
During the training phase, the model is trained using the

Fig.3: Output provided by the model after Speech is converted to Darknet Framework with the labeled image, configuration
text. file, and object name that is specified by the user. You will
get different accuracy rates based on the number of iterations
that you perform on the model. During every iteration, the
model will compare the specified object with the existing or
pre-trained YOLO weights and generates the accuracy
percentage of the detected object. Table.1
978-1-7281-7089-3/20/$31.00 ©2020 IEEE 846
[10] K. Kumar, S. Nandan, A. Mishra, K. Kumar and V. K. Mittal, "Voice-

Accuracy controlled object tracking smart robot," 2015 International Conference
Detected on Signal Processing, Computing and Control (ISPCC), Waknaghat,
Object Image Trained up Image Trained up 2015, pp. 40-45, doi: 10.1109/ISPCC.2015.7374995.
to 100 Iterations to 1000 Iterations [11] Bennilo Fernandes, J., Mannepalli, K.P., Saravanan, R.A., Kumar,
Clock 87% 98% K.T.P.S., “Fuzzy utilization in speech recognition and its different
application,” International Journal of Engineering and Advanced
Cat 95% 100% Technology,Volume: 8, Issue: 5, Special Issue: 3, July 2019, Pages:
Bottle 89% 97% 261-266.
[12] Shahana Bano., Pavuluri Jithendra., Gorsa Lakshmi Niharika.,
Table 1: Accuracy metrics. Yalavarthi Sikhi., "Speech to Text Translation enabling
Multilingualism," 2020 IEEE International Conference for Innovation
V. CONCLUSION in Technology (INOCON) Bengaluru, India. Nov 6-8, 2020.
By implementing this integrated model, it has been [13] Shahana Bano., Lakshmi Niharika Gorsa., Jithendra Pavuluri., Sikhi
Yalavarthi., "Proposed Cognitive Model for Detection of Objects based
concluded that there are many possible techniques to perform on Speech Recognition," 2020 IEEE International Conference for
Speech Recognition and object detection. But this approach Innovation in Technology (INOCON) Bengaluru, India. Nov 6-8,
yields in providing better and accurate results as you are 2020.
uploading all the images online on Google servers they can [14] Rajesh Kumar, T., Videla, L.S., Sivakumar, S., Gupta, A.G., Haritha,
be processed for the training, this doesn't require a powerful D., "Murmured speech recognition using hidden markov model," 7th
International Conference on Smart Structures and Systems, ICSSS
computer or GPU because all the processing will occur online 2020, July 2020, Article number: 9202163.
using the Google cloud GPU. This model will be helpful for [15] S. M. Chittajallu, N. Lakshmi Deepthi Mandalaneni, D. Parasa and
the people who are illiterates and don't know how to operate Shahana Bano, "Classification of Binary Fracture Using CNN," 2019
a device, and this can also be utilized as a part of Virtual Global Conference for Advancement in Technology (GCAT),
BANGALURU, India, 2019, pp. 1-5, doi:
teaching. 10.1109/GCAT47503.2019.8978468.
[16] P. Vishal, L. K. Snigdha, Shahana Bano, “An Efficient Face
Recognition System using Local Binary Pattern”, International Journal
REFERENCES of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878,
[1] “Machine-learning system tackles speech and object recognition” Volume-7, Issue-5S4, February 2019.
https://news.mit.edu/machine-learning-image-object-recognition- [17] Shariff, M.N., Saisambasivarao, B., Vishvak, T., Rajesh Kumar, T.,
0918. "Biometric user identity verification using speech recognition based on
[2] Lakha, Bishal, Heuju, Bishal, Bhusal, Dipkamal, Shrestha, kanhaiya. ANN/HMM," Journal of Advanced Research in Dynamical and
(2016). “Voice Command Based Object Recognizing Robot Using Control Systems, Volume: 9, Issue: 12 Special issue, 2017, Pages:
Speech and Image Feature Extraction.” 1739-1748.
10.13140/RG.2.2.33893.81128. [18] Teju, V., Bhavana, D., "An efficient object detection using OFSA for
[3] Swetha V Patil, Pradeep N, “Speech translation system for language thermal imaging," International Journal of Electrical Engineering
barrier reduction,” International Research Journal of Engineering and Education, 2020.
Technology (IRJET), Volume: 06, Issue: 08, August 2019. [19] D. Choi, and M. Kim, “Trends on Object Detection Techniques Based
[4] “Development and Application of Multilingual Speech Translation,” on Deep Learning,” Electronics and Telecommunications Trends,
Satoshi Nakamura', Spoken Language Communication Research Vol.33, No.4, pp.23-32, Aug.2018.
Group Project, National Institute of Information and Communications [20] Mrinalini Ket al: Hindi-English Speech-to-Speech Translation System
Technology, Japan. for Travel Expressions, 2015 International Conference On
[5] Manvi Chawla, Sandeep Kumar, Aman Balyan, “Object Detection and Computation Of Power, Energy, Information And Communication.
Recognition in Images,” International Journal of Engineering [21] Speech-to-Speech Translation: A Review, Mahak Dureja Department
Development and Research, Volume: 05, Issue: 04, ISSN: 2321-9939, of CSE The NorthCap University, Gurgaon Sumanlata Gautam
2017. Department of CSE The NorthCap University, Gurgaon. International
[6] Krishnaveni, G., Lalitha Bhavani, B., Vijaya Lakshmi, N.V.S.K., "An Journal of Computer Applications (0975 – 8887) Volume 129 – No.13,
enhanced approach for object detection using wavelet based neural November2015.
network," International Conference on Computer Vision and Machine [22] J. Redmon et al., “You Only Look Once: Unified, Real-Time Object
Learning 2018, ICCVML 2018, Volume: 1228, Issue: 1. Detection,” in Proc. Of the IEEE Conference on Computer Vision and
[7] Inthiyaz, S., Ahammad, S.H., Sai Krishna, A., Bhargavi, V., Pattern Recognition(CVPR), pp.779-788, Jun. 2016.
Govardhan, D., Rajesh, V., "YOLO (YOU ONLY LOOK ONCE) [23] K. Kumar, S. Nandan, A. Mishra, K. Kumar and V. K. Mittal, "Voice-
making object detection work in medical imaging on convolution controlled object tracking smart robot," 2015 International Conference
detection system," International Journal of Pharmaceutical Research, on Signal Processing, Computing and Control (ISPCC), Waknaghat,
Volume: 12, Issue: 2, April-June 2020, Pages: 312-326. 2015, pp. 40-45, doi: 10.1109/ISPCC.2015.7374995.
[8] Mandhala, V.N., Bhattacharyya, D., Vamsi, B., Thirupathi Rao, N., [24] Das, Prerana & Acharjee, Kakali & Das, Pranab & Prasad, Vijay.
"Object detection using machine learning for visually impaired (2015). Voice Recognition System: Speech-To-Text. Journal of
people," International Journal of Current Research and Review, Applied and Fundamental Sciences. 1. 2395-5562.
Volume: 12, Issue: 20, October 2020, Pages: 157-167.
[25] Mohamed, A. R., Dahl, G. E., and Hinton, G., “Acoustic Modelling
[9] Olabe, J. C.; Santos, A.; Martinez, R.; Munoz, E.; Martinez, M.; Quilis, using Deep Belief Networks”, submitted to IEEE TRANS. On audio,
A.; Bernstein, J., “Real time text-to speech conversion system for speech, and language processing, 2010.
spanish," Acoustics, Speech, and Signal Processing, IEEE International
Conference on ICASSP '84., vol.9, no., pp.85,87, Mar 1984.
978-1-7281-7089-3/20/$31.00 ©2020 IEEE 847

Cognitive Model For Object Detection Based On Speech-to-Text Conversion

Uploaded by

Cognitive Model For Object Detection Based On Speech-to-Text Conversion

Uploaded by

Proceedings of the Third International Conference on Intelligent Sustainable Systems [ICISS 2020]

IEEE Xplore Part Number: CFP20M19-ART; ISBN: 978-1-7281-7089-3

Cognitive Model for Object Detection based on

Vaddeswaram, India Vaddeswaram, India Vaddeswaram, India

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 843

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 844

Fig 1: Block Diagram of the process.

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 845

and the required object is detected. Else you need to repeat

Fig.4: Output of Clock object If the given input is Clock.

A similar approach is followed for fig.5,6 ‘Cat’ and

Fig 2: Overview of the process.

Input: The user needs to specify the object name through

During the training phase, the model is trained using the

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 846

[10] K. Kumar, S. Nandan, A. Mishra, K. Kumar and V. K. Mittal, "Voice-

978-1-7281-7089-3/20/$31.00 ©2020 IEEE 847

You might also like