(IJCST-V9I2P18) :swati, Harpreet Kaur

International Journal of Computer Science Trends and Technology (IJCST) – Volume 9 Issue 2, Mar-Apr 2021
RESEARCH ARTICLE OPEN ACCESS
Voice Recognition and Text to Speech

Swati [1], Harpreet Kaur [2]
Supporting Team: Appidi Moni, Battu Mercy, Rotta Deven, P. Nishanth, G. Chandra Sekhar Naidu
Computer Science and Enigineering Department
Lovely Professional University - Punjab
ABSTRACT
This is a software in which the user control computer functions and dictates text into voice. The project is having two
components, first one is for processing signal which will be captured by microphone, the output we will be having is
words to signals. The second component is to capture the signals and then we will be getting signals to words. We know
that for interacting with other people voice is the basic communication method. This technology helps to respond
correctly to human voices and provide valuable services. Using voice commands communicating with a computer
is f ast.
Keywords: - Voice Recognition, Text to Speech
I. INTRODUCTION
Voice is one of the form of communication. This is for
We will be having pre-processing and feature
interacting with people. These days they are so many
extraction,acoustic,language and pattern steps. These works
speech technologies but with different tasks. People prefer
as follows:
these kind of services because communicating with
computer is faster than any other alternatives. 1. Pre-processing:
By developing voice recognition and text to speech The voice signals known as analog signals will be
system this task can be accomplished. This allows transformed into digital signals. These signals later used
computer to translate voice into text. It is the process of for processing. Firstly these digital signals will be moved
converting an signalinto words and vice versa. into filters for flattening of filters. This will help in
increasing the signals energy for more frequency.
II. VOICE RECOGNITION
Speech recognition is a technique which is used to
identify all the words and it will be convert to the format
which machine understands. This technology will be based
on some parameters like vocal sound and vocabulary. We
will be having different parameters[1] to recognize speech.
They arevocal sounds and vocabulary. Each person will be
having different tone of voice. So our project design will
be having different modes of tones which matches to
different speakersvoice.
1. Vocal Sound: One of the main role of this project is
recognizing the voice of speaker. Some people have a
habit of speaking continuously and others speak by giving
gaps between the words. So the project needs to catch all
kinds oftones and phrases.
2. Vocabulary: To understand what a person is speaking
vocabulary is the most important part. Our system needs Fig.1. Architecture of Speech Recognition
to perform good and it needs to determine the complexity
2. Feature Extraction:
of the system.Now we will see the basic speech
recognition modeland speech to text conversion methods. In this process parameters will be having relation with
speech signals. These parameters will be processed to the
Speech Recognition Model: wave forms. Now we can say that the main focus of these
The Speech recognition model will follow the following parameters are for representation of the input signal. For
steps: this feature extraction process commonly used technique
is
ISSN: 2347-8578 www.ijcstjournal.org Page 114

Linear Predictive Coding (LPC): [2] This is the basic Then the signal undergoes these steps:[3]
speech recognition technique. By using this technique a. Pre-processing: This is the first step where speech signals will
the digital signal will be blocked into frames of some X be converted into frames and produces a unique sample.
samples. Each frame will be used to minimize any
discontinuities in signal. Then we get analysis in the last b. Training: This part is the representative of features which use
step. one or more testing patterns which corresponds to speech signals.
c. Recognition: This is the process of converting unknown test

pattern with the original sound pattern and the system measures
the similarity.
2. Recognition Accuracy:
Recognition is the process of converting unknown test pattern

with the original pattern as we discussed above. This is for
Fig.2. LPC Feature Extraction Process measuring the similarity between the patterns. This is the
important factor for recognition.
3. Acoustic Models:
For any Automated Speech Recognition(ASR) system this is Pre- processing is important part of speech recognition. This is
the basic fundamental part. Because this where phonetics used in removing unwanted wave-forms of a signal. These signals
establishes. This establishes a relation between phonetics and are used for filters which removes the noises coming from any of
basic speech. the background.
4. Language Models: 3B. TEXT TO SPEECH CONVERSION
This is used to find the probability of occurrence of the word
after the sequence of the word. Mainly this is used to find the Text to speech conversion is a process in which system converts
word and phrase which will be having similar sound. the input text into voice output.Firstly, the input text will be
5. Pattern Classification: analyzed, then it will be processed and then system understands
that text. Finally the text will be converted into digital audio and
The purpose of this is to find the similarity between unknown the system speak the text.
patterns and the existing sounds. After the completion of The process of Text to speech conversion is as follows:
training we will test the system. At the time of testing the
patterns will be classified and will be able to recognize the
speech.
Also before this process we will store representing all
dictionary words. We will give training to the system in
getting knowledge on all kinds of aspects.
III. 3A. SPEECH TO TEXT CONVERSION

It is the process of converting spoken words into written
texts. In the last part we have talked about voice recognition
process. This speech to text conversion is similar to that of
voice recognition process. It is used in understanding what
we have spoken and then the system converts those voice
signals into the written format. This speech to text follows
same steps as we discussed in Fig.1.
In this process we will be using some models which are
useful for real time speech to text conversion. This will be
more useful for mobile users. This model is based on the
following parameters: Recognition speed and accuracy.
1. Recognition Speed:
Fig.3. Text-to-speech conversion
If in the case the system is responding for a long time to
recognize the speech, the system looses its significance.

Text to speech conversion follows the following steps: In every sentence, in the target language it is a translation with
Processing of Text: Firstly, input text will be analyzed, then the probability from source language. Higher the probability,
the system handles all the abbreviations and matches the text higher will be the accuracy of translation and vice versa.
and finally it will be converted into some phonetic
representations. IV. 4A.OBSERVATIONS
After converting into phonetics, the next we will be having is
speech synthesis.[10] All observations of different techniques discussed
above:
Speech Synthesis: [5]We had some speech synthetic
techniques. They are as follows: MODELS TECHNIQUES FINDINGS
1. Format Synthesis: Speech Linear Feature
On the basis of parameters, representations of speech of stored. recognition: Predictive extraction
For better performance the combination of cascade and parallel Feature Coding(LPC) method is
formats is used. We can also use these two formats individually, Extraction used. Analysis
but combination of both results best. is done using a
2. Concatenative Synthesis: fixed resolution
This type is used in speech synthesis to generate the sequence along with a
of sounds from different users recordings, which ware stored in frequency
scale.
a database. The recordings may be from phone, diphone and a
triphone. Where phone is a single unit of sound, diphone is for
the signal from the midpoint or change within the phone and Speech Template Based Errors like
triphone is for signal taking in sequence going in between the Recognition: segmentation
phone through the next one or the other. Pattern or
Matching classifications
are avoided.
3C. LANGUAGE TRANSLATION Knowledge System uses
There were many languages in India and all over the world. As Based the
there were many languages we need to have applications and information
the processes which converts a text from one language to other. like phonetics.
In Artificial Intelligence, Machine Translation is a field, which Neural Based Complicated
deals in translating from one language to other. This will be recognition
done by using machine translation system. Now let us see some kind of tasks
of the machine translation models. use this kind of
1. Rule Based Machine Translation:[8] method.
Translation is generated on the basis of analyzing both the Speech to Artificial neural Increases the
source and target languages. So, for such kinds of translations text network based accuracy of
system consists of some rules. The most important rule is conversion speech
grammar. The grammar rules basically contains syntax, recognition.[4]
semantic and parts of speech features.Along with these
grammar structures system should also consists of dictionary
words for translation.
2. Example Based Machine Translation:
In this model we use the texts which were already translated.
The translated texts will be placed again, so that the original
texts will be displayed. These original sentences can be
translated into any of the target language. To form a complete
translation all the phrases are to be kept together.
3. Statistical Machine Translation:
This model is characterized by using Machining Learning
methods. This model treats translation as a mathematical
problem.

Multichannel Speech to text recognizes

Text to speech Format This method is Recognition various distinct channels in
conversion synthesis used to filter the multichannel situations like
speech. video conferences. Also, it
Concatenate Duration is annotates the transcripts to
synthesis varied on preserve the order.
implementation Noise robustness Many environments will
and it is not have noise problems.
fixed. Speech to text can handle
these kind of noisy audio
Machine Rule based This method from the environments
Translation machine collects having noise issues.
translation grammar riles Domain specific models For domain specific quality
and structure. requirements, we can select
Example based This methods trained models for voice
machine main concept is controls, phone calls and
translation about parallel video transcriptions.
texts. Content filtering This is useful in detecting
Statistical This is on the inappropriate content in
machine basis of audio data and filter out the
translation statistics and unprofessional words in the
probability. text results.
Auto Detect Language We can up to any of the 4
languages and in the speech
to text conversion. In this
All features of Voice Recognition and Text to Speech case even if there were any
mistakes spoken the text
Vocabulary Voice recognition and text to output doesn’t contain any
speech should have language mistakes.
support in all languages. Automatic Punctuation In speech to text, system
automatically contains
Speech We need to receive real time
punctuation like commas,
Recognition speech recognition results as
question marks etc.
API processes audio input .
Speech Adaption To provide hints and to speed
up the accuracy of transcription, 4B. KEY FEATURES
speech recognition need to Speech Adaption:
transcribe specific terms. So we To provide hints and to speed up the accuracy of
can automatically convert the transcription, speech recognition need to transcribe specific
spoken terms. So we can automatically convert the spoken
names,numbers,addresses,years names,numbers,addresses,years and many more.
and many more.
Speech to text on While grasping Googles speech Domain Specific Models:
premises recognition technology,system For domain specific quality requirements, we can select
need to have full control over trained models for voice controls, phone calls and video
infrastructure and speech data transcriptions.
protection in our data centers.

Speech recognition:
We need to receive real time speech recognition results as Text to Speech:
APIprocesses audio input.[17] Text-to-speech is a common assistive technology in which a
Speech to Text On- premises: monitor or tablet reads out loud to the user the words on the page.
While grasping Googles speech recognition This technology is common among students who have reading
technology,system need to have full control over disabilities, especially those who have trouble decoding. By
infrastructure and speech data protection in our data centers. viewing the terms in an audible way, the student will reflect on
the meaning of the words rather than expending all of their mental
V. VOICE RECOGNITION AND TEXT resources attempting to They are devoting all of their mental
resources to deciphering the sentences. While this device will
TO SPEECH VOICE RECOGNITION:
help students overcome their reading challenges and gain access
to instructional materials, it does not aid in the development of
Technology in the field of communication is rapidly evolving.
reading skills. The amount of TTS software installed on both
It has become very simple to use voice recognition to enter
Android and Apple devices has steadily increased in recent years.
code,correct pronunciation, and dictate texts. The microphone
It is also become common in the office as a method to assist users
icon is found on most on-screen keyboards, allowing users to
in proofreading their work.
quickly switch from typing to voice recognition. Speech
recognition opens up a world of productive possibilities for
certain disabled people who find it difficult or impossible to VI. CONCLUSION
work with a mouse or keyboard. It may benefit people with
physical disabilities and reduce the risk of repetitive strain We have learned about Voice recognition and text to speech
injury from repeated typing or mouse use by freeing them along with its techniques. Along with these we have also learned
from typing and keyboard use. Dyslexics can write more about some of their applications and their usages. From all the
fluently, correctly, and concisely. Speech recognition can be techniques we had learned about voice recognition and text to
done easilyand is less painful than traditional speech we can conclude as:
handwriting or typing.Enabling voice recognition in systems In voice recognition also known as speech to text, we can
and promoting its use in the workplace can be a quot positive conclude that it works as a better speech signal to text. The only
adjustment quot for employers, avoiding discrimination against drawback in this technique was its feasibility.
disabled employeesand increasing their productivity. Now coming to Text to Speech, it makes use of parallel synthesis
Most devices with capable hardware have voice recognition, which works as best text to speech converter. Also in text to
so higher-end phones and tablets would have strong speech conversion we learned about hybrid machine translation.
microphonesthat enable voice input. This translation technique is used widely because both ruled
Computers, too, often have built-in cameras, microphones, based and statistical based machine translation techniques were
and speakers. Instead of clicking on a keyboard, voice used simultaneously.
recognition can be used. At its most basic level, it offers a
quick way towrite on a computer, tablet, or smartphone. The AKNOWLEDGEMENT
user speaks through a headset, external microphone, or built-in
microphone,and their words appear on the screen as text. This We humbly take this opportunity to express our gratitude to
may be in a search engine text box, a chat or messenger all of the guideposts who served as lightening pillars in guiding
programme, or an email or paper. us through this project, resulting in the fruitful and satisfactory
completion of this research.
Speech recognition is a function of certain systems and
programmes that can be set up to do more than just input text. REFERENCES
It is possible to use it to power computers. With the right
configuration, simple spoken commands will start and shut 1. Bansal, Dipali, Neelam Turk, and Sunanda Mendiratta.
down a computer, as well as open and run various "Automatic speech recognition by cuckoo search optimization
programmes and applications. This is especially important for based artificial neural network classifier." In 2015 International
people with physical disabilities who can control their devices Conference on Soft Computing Techniques and Implementations
using only voice commands. Speech recognition software is (ICSCTI), pp. 29-34. IEEE, 2015.
now integrated into many modern computers, laptops, and 2. Seide, Frank, Gang Li, Xie Chen, and Dong Yu. "Feature
smartphones. However, depending on the system or device, engineering in context-dependent deep neural networks for
specialist software can be required to achieve a high degree of conversational speech transcription." In 2011 IEEE Workshop on
control and functionality. A wide range of potential users will Automatic Speech Recognition & Understanding, pp. 24-29.
benefit greatly from voice recognition. Obviously, someone IEEE, 2011.
with a physical disability who finds typing challenging, 3. Saksamudre, Suman K., P. P. Shrishrimal, and R. R.
painful, or impossible will benefit greatly from it. It may also Deshmukh. "A review on different approaches for speech
help to reduce the risk of developing a repetitive strain injury recognition system." International Journal of Computer
(RSI) orto better control any upper limb condition. Applications 115, no. 22 (2015).

4. Jadhav, Ms Anuja, and Arvind Patil. "Real Time Speech 16. Thomas Hain, Asmaa El Hannani, “Automatic Speech
to Text Converter for Mobile Users." Recognition for scientific purposes.
5. Mache, Suhas R., Manasi R. Baheti, and C. Namrata 17. Anusuya, M. A., and Shriniwas K. Katti. "Speech
Mahender. "Review on text-to-speech recognition by machine, a review." arXiv preprint
synthesizer." International Journal of Advanced Research in arXiv:1001.2267 (2010).
Computer and Communication Engineering 4, no. 8 (2015): 18. Reddy, D. Raj. "Approach to computer speech recognition
54-59. by direct analysis of the speech wave." The Journal of the
6. Kurzekar, Pratik K., Ratnadeep R. Deshmukh, Vishal B. Acoustical Society of America 40, no. 5 (1966): 1273-1273.
Waghmare, and Pukhraj P. Shrishrimal. "A comparative study 19. Weintraub, Mitch, Hy Murveit, Michael Cohen, Patti Price,
of feature extraction techniques for speech recognition Jared Bernstein, Gay Baldwin, and Don Bell. "Linguistic
constraints in hidden Markov model based speech
system." International Journal of Innovative Research in
recognition." In International Conference on Acoustics,
Science, Engineering and Technology 3, no. 12 (2014): 18006-
Speech, and Signal Processing,, pp. 699-702. IEEE, 1989..
18016. 20. Abdulla, Waleed H. "HMM-based techniques for speech
7. Kalyani, Aditi, and Priti S. Sajja. "A review of machine segments extraction." Scientific Programming 10, no. 3
translation systems in india and different translation evaluation (2002): 221-239.
methodologies." International Journal of Computer
Applications 121, no. 23 (2015).
8. Alawneh, Mouiad Fadiel, and Tengku Mohd Sembok.
"Rule-based and example-based machine translation from
English to Arabic." In 2011 Sixth International Conference on
Bio-Inspired Computing: Theories and Applications, pp. 343-
347. IEEE, 2011.
9. Rabiner, Lawrence. "Fundamentals of speech
recognition." Fundamentals of speech recognition (1993).
10. Tokuda, Keiichi, Yoshihiko Nankaku, Tomoki Toda,
Heiga Zen, Junichi Yamagishi, and Keiichiro Oura. "Speech
synthesis based on hidden Markov models." Proceedings of the
IEEE 101, no. 5 (2013): 1234-1252.
11. Anusuya, M. A., and Shriniwas K. Katti. "Speech
recognition by machine, a review." arXiv preprint
arXiv:1001.2267 (2010).
12. Price, Michael, James Glass, and Anantha P. Chandrakasan.
"A low-power speech recognizer and voice activity detector
using deep neural networks." IEEE Journal of Solid-State
Circuits 53, no. 1 (2017): 66-75.
13. Fohr, Dominique, Odile Mella, and Irina Illina. "New
paradigm in speech recognition: deep neural networks."
In IEEE international conference on information systems and
economic intelligence. 2017.
14. Benkerzaz, Saliha, Youssef Elmir, and Abdeslam Dennai.
"A study on automatic speech recognition." Journal of
Information Technology Review 10, no. 3 (2019): 77-85.
15. Katyal, Anchal, Amanpreet Kaur, and Jasmeen Gill.
"Automatic speech recognition: a review." International
Journal of Engineering and Advanced Technology
(IJEAT) 3, no. 3 (2014): 71-74.

(IJCST-V9I2P18) :swati, Harpreet Kaur

Uploaded by

Copyright:

Available Formats

(IJCST-V9I2P18) :swati, Harpreet Kaur

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(IJCST-V9I2P18) :swati, Harpreet Kaur

Uploaded by

Copyright:

Available Formats

International Journal of Computer Science Trends and Technology (IJCST) – Volume 9 Issue 2, Mar-Apr 2021

RESEARCH ARTICLE OPEN ACCESS

Voice Recognition and Text to Speech

ISSN: 2347-8578 www.ijcstjournal.org Page 114

c. Recognition: This is the process of converting unknown test

Recognition is the process of converting unknown test pattern

III. 3A. SPEECH TO TEXT CONVERSION

ISSN: 2347-8578 www.ijcstjournal.org Page 115

ISSN: 2347-8578 www.ijcstjournal.org Page 116

Multichannel Speech to text recognizes

ISSN: 2347-8578 www.ijcstjournal.org Page 117

ISSN: 2347-8578 www.ijcstjournal.org Page 118

ISSN: 2347-8578 www.ijcstjournal.org Page 119

You might also like