Hindi Speech Important Recognition System Using HTK
Hindi Speech Important Recognition System Using HTK
Hindi Speech Important Recognition System Using HTK
Abstract: Speech recognition is the process of converting an acoustic waveform into the
text similar to the information being conveyed by the speaker. In the present era, mainly
Hidden Markov Model (HMMs) based speech recognizers are used. This paper aims to
build a speech recognition system for Hindi language. Hidden Markov Model Toolkit
(HTK) is used to develop the system. It recognizes the isolated words using acoustic
word model. The system is trained for 30 Hindi words. Training data has been collected
from eight speakers. The experimental results show that the overall accuracy of the
presented system is 94.63%.
Keywords: HMM; HTK; Mel Frequency Cepstral Coefficient (MFCC); Automatic Speech
Recognition (ASR); Hindi; Isolated word ASR.
1. Introduction
Speech is the most natural way of communication. Everyone knows his tongue language
from his childhood. It also provides an efficient means of man-machine communication.
Generally, transfer of information between human and machine is accomplished via
keyboard, mouse etc. But human can speak more quickly instead of typing. Speech
input offers high bandwidth information and relative ease of use. It also permits the
user’s hands and eyes to be busy with a task, which is particularly valuable when users
are in motion or in natural field settings (Al-Qatab et al., 2010). Similarly speech output is
International Journal of Computing and Business Research
ISSN (Online) : 2229-6166
Volume 2 Issue 2 May 2011
more impressive and understandable than the text output. Speech interfacing provides
the ways to these issues. Speech interfacing involves speech synthesis and speech
recognition. Speech synthesizer takes the text as input and converts it into the speech
output i.e. it act as text to speech converter. Speech recognizer converts the spoken
word into text. This paper aims to develop and implements speech recognition system
for Hindi language.
1.1 Motivation
At present, due to its versatile applications, speech recognition is the most promising field of
research. Our daily life activities, like mobile applications, weather forecasting, agriculture,
healthcare etc. involves speech recognition. Communicating vocally to get information regarding
weather, agriculture etc. on internet or on mobile is much easier than communicating via
keyboard or mouse. Many international organizations like Microsoft, SAPI and Dragon-
Naturally-Speech as well as research groups are working on this field especially for European
languages. However some works for south Asian languages including Hindi have also been
done (Pruthi et al., 2000; Gupta, 2006; Rao et al., 2007; Deivapalan and Murthy, 2008; Elshafei
et al., 2008; Syama, 2008; Al-Qatab et al., 2010) but no one provides efficient solution for Hindi
language. The lack of effective Hindi speech recognition system and its local relevance has
motivated the authors to develop such small size vocabulary system.
The authors have developed Hindi speech recognition system for isolated word. Hidden Markov
Model (HMM) is used to train and recognize the speech that uses MFCC to extract the features
from the speech-utterances. To accomplish this, Hidden Markov Model toolkit (HTK) (Young et
al., 2009; Hidden Markov Model Toolkit, 2011) designed for speech recognition is used. HTK is
developed in 1989 by Steve Young at the Speech Vision and Robotics Group of the
Cambridge University Engineering Department (CUED). Initially, HTK training tools are used
to train HMMs using training utterances from a speech corpus. Then, HTK recognition tools are
International Journal of Computing and Business Research
ISSN (Online) : 2229-6166
Volume 2 Issue 2 May 2011
used to transcribe unknown utterances and to evaluate system performance by comparing them
to reference transcriptions.
Apart from introduction in section 1, the paper is organized as follows. Some of the related
works are presented in section 2. Section 3 presents the architecture and functioning of
proposed ASR. Section 4 describes the Hidden Markov Models and HTK. Hindi character set is
shown in section 5. Section 6 deals with implementation work. Section 7 concludes the paper.
2. Related work
In the past decade, much works have been done in the field of speech recognition for Hindi
language. Tarun Pruthi et al. (2000) describe a speaker-dependent, real-time, isolated word
recognizer for Hindi. System uses a standard implementation. Features are extracted using
LPC and recognition is carried out using HMM. System was designed for two male speakers.
The recognition vocabulary consists of Hindi digits (0, pronounced as “shoonya” to 9,
pronounced as “nau”). However the system is giving good performance, but the design is
speaker specific and uses very small vocabulary.
An Isolated word speech recognition tool for Hindi language is designed by Gupta (2006) using
continuous HMM. The system uses word acoustic model for recognition. Again the word
vocabulary contains Hindi digits. Recognizer gives good results when tested for sound used for
training the model. For other sounds too, the results are satisfactory. System is highly efficient
but vocabulary size is too small. This paper tries to overcome these shortcomings by using a
vocabulary size of thirty words. The system is showing good performance for speaker
independent environments.
The developed speech recognition system architecture is shown in figure 1. It consists of two
modules, training module and testing module. Training module generates the system model
which is to be used during testing. The various phases used during ASR are:
digitized. The digitized (sampled) speech-signal is then processed through the first-order filters
to spectrally flatten the signal. This process, known as pre-emphasis, increases the magnitude
of higher frequencies with respect to the magnitude of lower frequencies. The next step is to
block the speech-signal into the frames with frame size ranging from 10 to 25 milliseconds and
an overlap of 50%−70% between consecutive frames.
Feature Extraction: The goal of feature extraction is to find a set of properties of an utterance
that have acoustic correlations to the speech-signal, that is parameters that can some how be
computed or estimated through processing of the signal waveform. Such parameters are termed
as features. The feature extraction process is expected to discard irrelevant information to the
task while keeping the useful one. It includes the process of measuring some important
characteristic of the signal such as energy or frequency response (i.e. signal measurement),
Preprocessing Feature
Extraction
Spoken
word
Acoustic Language
Models Model
Corpus
Model Generation: The model is generated using various approaches such as Hidden Markov
Model (HMM) (Huang et al., 1990), Artificial Neural Networks (ANN) (Wilinski et al., 1998),
Dynamic Bayesian Networks (DBN) (Deng, 2006), Support Vector Machine (SVM) (Guo and Li,
International Journal of Computing and Business Research
ISSN (Online) : 2229-6166
Volume 2 Issue 2 May 2011
2003) and hybrid methods (i.e. combination of two or more approaches). Hidden Markov model
has been used in some form or another in virtually every state-of-the-art speech and speaker
recognition system (Aggarwal and Dave, 2010).
Pattern Classifier: Pattern classifier component recognizes the test samples based on the
acoustic properties of word. The classification problem can be stated as finding the most
probable sequence of words W given the acoustic input O (Jurafsky and Martin, 2009), which is
computed as:
P (O | W ). P (W )
P (W | O ) = … (1)
P (O )
Given an acoustic observation sequence O, classifier finds the sequence W of words which
maximizes the probability P (O | W ). P (W ) . The quantity P (W ) , is the prior probability of the word
Hidden Markov Model (HMM) (Rabiner, 1989) is a doubly stochastic process with one that is not
directly observable. This hidden stochastic process can be observed only through another set of
stochastic processes that can produce the observation sequence. HMMs are the so far most
widely used acoustic models. The reason is just it provides better performance than other
methods. HMMs are widely used for both training and recognition of speech system.
HMM are statistical frameworks, based on the Markov chain with unknown parameters. Hidden
Markov Model is a system which consists of nodes representing hidden states. The nodes are
interconnected by links which describes the conditional transition probabilities between the
states. Each hidden state has an associated set of probabilities of emitting particular visible
states.
HTK is a toolkit for building Hidden Markov Models (HMMs). It is an open source set of modules
written in ANSI C which deal with speech recognition using the Hidden Markov Model. HTK
International Journal of Computing and Business Research
ISSN (Online) : 2229-6166
Volume 2 Issue 2 May 2011
mainly runs on the Linux platform. However, to run it on Windows, interfacing package Cygwin
(Cygwin, 2011) is used.
Hindi is mostly written in a script called Nagari or Devanagari which is phonetic in nature. Hindi
sounds are broadly classified as the vowels and consonants (Velthuis, 2011).
Vowels: In Hindi, there is separate symbol for each vowel. There are 12 vowels in Hindi
language. The consonants themselves have an implicit vowel + (अ). To indicate a vowel sound
other than the implicit one (i.e. अ), a vowel-sign (Matra) is attached to the consonant. The
Vowel अ आ इ ई उ ऊ ए ऐ ओ औ ऋ ॠ
Matra - ◌ा ◌ ◌ी ◌ु ◌ू ◌े ◌ै ◌ो ◌ौ ◌ृ ◌ॄ
Consonants: The consonant set in Hindi is divided into different categories according to the
place and manner of articulation. There are divided into 5 Vargs (Groups) and 9 non-Varg
consonants. Each Varg contains 5 consonants, the last of which is a nasal one. The first four
consonants of each Varg, constitute the primary and secondary pair. The primary consonants
are unvoiced whereas secondary consonants are voiced sounds. The second consonant of
each pair is the aspirated counterpart (has an additional "h" sound) of the first one. Thus four
consonants of each Vargs are [unvoiced], [unvoiced, aspirated], [voiced], [voiced, aspirated]
respectively. Remaining 9 non Varg consonants are divided as 5 semivowels, 3 sibilants and 1
aspirate (Rai, 2005). The complete Hindi consonant set with their phonetic property is given in
table 3.
International Journal of Computing and Business Research
ISSN (Online) : 2229-6166
Volume 2 Issue 2 May 2011
Gutturals (कवग) क ख ग घ ङ
Patatals (चवग) च छ ज झ ञ
Cerebrals (टवग) ट ठ ड ढ ण
Dental (तवग) त थ द ध न
Labials (पवग) प फ ब भ म
Semivowels य, र, ल, व
Sibilants श, ष, स
Aspirate ह
Other Characters: Apart from consonants and vowels, there are some other characters used in
Hindi language are: anuswar (◌ं), visarga (◌ः), chanderbindu (◌ँ), >, ऽ, @, ौ. Anuswar indicates
the nasal consonant sounds. Anuswar sound depends upon the character following it.
Depending upon the varg of following character, sound wise it represents the nasal consonants
of that vargs.
6. IMPLEMENTATION
In this section, implementation of the speech system based upon the developed system
architecture has been presented.
Hindi Speech recognition system is developed using HTK toolkit on the Linux platform. HTK
v3.4 and ubuntu10.04 are used. Firstly, the HTK training tools are used to estimate the
International Journal of Computing and Business Research
ISSN (Online) : 2229-6166
Volume 2 Issue 2 May 2011
parameters of a set of HMMs using training utterances and their associated transcriptions.
Secondly, unknown utterances are transcribed using the HTK recognition tools (Hidden Markov
Model Toolkit, 2011). System is trained for 30 Hindi words. Word model is used to recognize the
speech.
Training and testing a speech recognition system needs a collection of utterances. System uses
a data-set of 30 words. The data is recorded using unidirectional microphones. Distance of
approximately 5-10 cm is used between mouth of the speaker and microphone. Recording is
carried out at room environment. Sounds are recorded at a sampling rate of 16000 Hz. Voices
of eight people (5 male and 3 female) are used to train the system. Each one is asked to utter
each word four times. Thus giving a total of 960 ((8*4)*30) speech files. Speech files are stored
in .wav format. Velthuis (Velthuis, 2011) transliteration developed in 1996 by Frans Velthuis is
used for transcription.
During this step, the data recorded is parameterized into a sequence of features. For this
purpose, HTK tool HCopy is used. The technique used for parameterization of the data is Mel
Frequency Cepstral Coefficient (MFCC). The input speech is sampled at 16 kHz, and then
processed at 10 ms frame rate with a Hamming window of 25 ms. The acoustic parameters are
39 MFCCs with 12 mel cepstrum plus log energy and their first and second order derivatives.
For training the HMM, a prototype HMM model is created, which are then re-estimated using the
data from the speech files. Apart from the models of vocabulary words, model for silent (sil)
must be included.
For prototype models, authors uses 5-11 state HMM in which the first and last are non- emitting
states. The prototype models are initialized using the HTK tool HInit which initializes the HMM
model based on one of the speech recordings. Then HRest is used to re-estimate the
parameters of the HMM model based on the other speech recordings in the training set.
International Journal of Computing and Business Research
ISSN (Online) : 2229-6166
Volume 2 Issue 2 May 2011
During evaluation, system is responsible for generating the transcription for an unknown
utterance. The model generated during the training phase is responsible for evaluation. In order
to evaluate the system performance, speakers are asked to utter each word at least once a
time. For testing five speakers are used. The recognition results are shown in table 4. Overall
word-accuracy and word-error rate of the system is 94.63% and 5.37% respectively
International Journal of Computing and Business Research
ISSN (Online) : 2229-6166
Volume 2 Issue 2 May 2011
7. CONCLUSION
In this paper, the speech recognition system for Hindi language has been developed. The
presented system recognizes the isolated words using acoustic word model. The training of the
system has been done using 30 Hindi words. During the development of the system, the training
data has been collected from the eight different speakers. The system has also been tested in
the room environment. The implementation of the system has been done using Hidden Markov
Model Toolkit (HTK). It has been observed from the performed experiments that the accuracy
and word error rate of the proposed system is 94.63% and 5.37%. The future works involves the
development of system for more vocabulary size and to improve the accuracy of the system.
REFERENCES
Rabiner, L R (1989) A Tutorial on Hidden Markov Models and Selected Applications in Speech
Recognition, Proceedings of the IEEE, Vol.77, No.2, pp. 257-286.
Huang, X D, Ariki, Y and Jack M A (1990) Hidden Markov Models for Speech Recognition.
Edinburg University Press.
Wilinski, P, Solaiman, B, Hillion A and Czamecki, W (1998) Towards the Border between Neural
and Markovian Paradigms. IEEE Transactions on Systems, Man and Cybernetics. Vol.
28, No. 2, pp. 146-159.
International Journal of Computing and Business Research
ISSN (Online) : 2229-6166
Volume 2 Issue 2 May 2011
Indian Script Code for Information Interchange – ISCII (1999) Bureau of Indian Standards. New
Delhi. India.
Pruthi T, Saksena, S and Das, P K (2000) Swaranjali: Isolated Word Recognition for Hindi
Language using VQ and HMM. International Conference on Multimedia Processing and
Systems (ICMPS), IIT Madras.
Guo, G and Li, S Z (2003) Content Based Audio Classification and Retrieval by SVMs. IEEE
Trans. Neural Networks, 14, (January 2003), pp. 209-215.
Rai, N (2005) Isolated word speaker Independent Speech recognition for Indian Languages,
Department of Computer Science and Engineering, Indian Institute of Technology,
Kanpur.
Deng, Li (2006) Dynamic Speech Models: Theory, Applications, and Algorithms. Morgan and
Claypool.
Gupta, R (2006) Speech Recognition for Hindi, M. Tech. Project Report, Department of
Computer Science and Engineering, Indian Institute of Technology, Bombay, Mumbai.
Becchetti, C and Ricotti, L P (2008) Speech Recognition Theory and C++ Implementation, John
Wiley & Sons.
Deivapalan, P G and Murthy, H A (2008) A syllable-based isolated word recognizer for Tamil
handling OOV words, The National Conference on Communications, pp. 267-271.
Syama, R (2008) Speech Recognition System for Malayalam. Department of Computer Science
Cochin University of Science & Technology, Cochin.
International Journal of Computing and Business Research
ISSN (Online) : 2229-6166
Volume 2 Issue 2 May 2011
Jurafsky, D and Martin, J H (2009) Speech and Language Processing, Pearson Education, New
Delhi, India.
Aggarwal, R K and Dave, M (2010) Fitness Evaluation of Gaussian Mixtures in Hindi Speech
Recognition System, First International Conference on Integrated Intelligent Computing,
SJB Institute of Technology, Bangalore.
Al-Qatab, B A Q and Ainon, R N (2010) Arabic Speech Recognition Using Hidden Markov
Model Toolkit (HTK), International Symposium in Information Technology (ITSim). June
15-17, Kuala Lumpur.
Jain, A, Aggarwal, R, Garg, A and Kumar, K (2010) Speech Recognition System using MFCC,
Proceedings of All India Conference on Recent Emergence and Scope of Electronics
Architecture, Haryana, India.
Hidden Markov Model Toolkit (HTK), Retrieved Jan 10, 2011, from http://htk.eng.cam.ac.uk.