FARSDAT

Sameti et al.
EURASIP Journal on Audio, Speech, and Music Processing 2011, 2011:6

http://asmp.eurasipjournals.com/content/2011/1/6
RESEARCH Open Access
A large vocabulary continuous speech

recognition system for Persian language
Hossein Sameti*, Hadi Veisi, Mohammad Bahrani, Bagher Babaali and Khosro Hosseinzadeh
Abstract
The first large vocabulary speech recognition system for the Persian language is introduced in this paper. This
continuous speech recognition system uses most standard and state-of-the-art speech and language modeling
techniques. The development of the system, called Nevisa, has been started in 2003 with a dominant academic
theme. This engine incorporates customized established components of traditional continuous speech recognizers
and its parameters have been optimized for real applications of the Persian language. For this purpose, we had to
identify the computational challenges of the Persian language, especially for text processing and extract statistical
and grammatical language models for the Persian language. To achieve this, we had to either generate the
necessary speech and text corpora or modify the available primitive corpora available for the Persian language.
In the proposed system, acoustic modeling is based on hidden Markov models, and optimized decoding, pruning
and language modeling techniques were used in the system. Both statistical and grammatical language models
were incorporated in the system. MFCC representation with some modifications was used as the speech signal
feature. In addition, a VAD was designed and implemented based on signal energy and zero-crossing rate. Nevisa
is equipped with out-of-vocabulary capability for applications with medium or small vocabulary sizes. Powerful
robustness techniques were also utilized in the system. Model-based approaches like PMC, MLLR and MAP, along
with feature robustness methods such as CMS, PCA, RCC and VTLN, and speech enhancement methods like
spectral subtraction and Wiener filtering, along with their modified versions, were diligently implemented and
evaluated in the system. A new robustness method called PC-PMC was also proposed and incorporated in the
system. To evaluate the performance and optimize the parameters of the system in noisy-environment tasks, four
real noisy speech data sets were generated. The final performance of Nevisa in noisy environments is similar to the
clean conditions, thanks to the various robustness methods implemented in the system. Overall recognition
performance of the system in clean and noisy conditions assures us that the system is a real-world product as well
as a competitive ASR engine.
1 Introduction were employed in the 80 s. In the next decade, robust

Since the start of developing speech recognizers at AT&T continuous speech recognition and spoken language
Bell labs in the 1950’s, enormous efforts and investments understanding were popular topics. In the last decade,
were directed towards automatic speech recognition researchers and investors introduced spoken dialogue
(ASR) research and development. In the 1960s, the ASR systems and tried to implement conversational speech
research was focused on phonemes and isolated word recognition systems capable of recognizing and under-
recognition. Later, in the 70 s and 80 s, connected words standing spontaneous speech. Machine learning techni-
and continuous speech recognition were the major trends ques and artificial intelligence (AI) concepts entered into
of ASR research. To accomplish these targets, researchers the ASR research literature and contributed considerably
introduced linear predictive coding (LPC) and used pat- to fulfilling the human speech recognition needs. Up
tern recognition and clustering methods. Hidden Markov until recent years, speech recognition systems were con-
models (HMM), cepstral analysis and neural networks sidered as luxury tools or services and were not usually
taken seriously by users. In the past 5-10 years, we have
* Correspondence: [email protected] seen that ASR engines have played genuinely beneficial
Department of Computer Engineering, Sharif University of Technology,
Tehran, Iran
roles in several areas, especially in telecommunication
© 2011 Sameti et al; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons Attribution
License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
Sameti et al. EURASIP Journal on Audio, Speech, and Music Processing 2011, 2011:6 Page 2 of 12
services and important enterprise applications such as built up. This enabled us to move towards a practical
customer relationship management (CRM) frameworks. ASR system capable of being utilized as Persian dictation
Several successful ASR systems having good perfor- software also called Nevisa [10].
mances are found in the literature [1-3]. The most suc- In the remainder of this paper, in Sect. 2, the character-
cessful approaches to ASR are the ones based on pattern istics of the Persian language, and speech and text cor-
recognition and using statistical and AI techniques pora of the Persian language are reviewed. An overview
[1,3,4]. The front end of a speech recognizer is a feature of Nevisa Persian speech recognition system and overall
extraction block. The most common features used for features of this system is given in Sect. 3. This section
ASR are Mel-frequency cepstral coefficients (MFCC) [4]. provides a review on acoustic modeling, robustness tech-
Once the features are extracted, modeling is performed niques used in the system, and building statistical and
usually based on artificial neural network (ANN) or grammatical language models for the Persian language.
HMM. Linguistic information is also used extensively in In Sect. 4 the details of the experiments and the recogni-
an ASR system. Statistical (n-gram) and grammatical (i.e., tion results are given. Finally, Sect. 5 gives a brief sum-
structural) language models [4,5] are used for this mary and conclusion of the paper.
purpose.
One essential problem with putting the speech recogni- 2 Persian language and corpora
tion systems into practice is the variety of languages peo- 2.1 Persian language
ple around the world speak. ASR systems are highly The Persian language, also known as Farsi, is an Iranian
dependent on the language spoken. We can categorize language within the Indo-Iranian branch of Indo-European
the research areas of speech recognition into two major languages. It is natively spoken by about seventy million
classes; first, acoustic and signal processing which is very people in Iran, Afghanistan and Tajikistan as the official
much the same for ASR in every language; second, nat- language. It is also widely spoken in Uzbekistan and, to
ural language processing (NLP) which is dependent on some extent, in Iraq and Bahrain. This language has
the language. Obviously, this language dependency hin- remained remarkably stable since the eighth century
ders the implementation and utilization of ASR systems although local environments, such as the Arabic language,
for any new language. have influenced it. The Arabic language has heavily influ-
We have focused our research on Persian speech recog- enced Persian, but has not changed its structure. In other
nition during recent years. Persian ASR systems have words, Persian has only borrowed a large number of lexical
been addressed and developed to different extents [6-10]. words from Arabic. Therefore, in spite of this influence,
There are other works on the development of Persian Arabic has not affected the syntactic and morphological
continuous speech recognition system [11-14]. However, forms of Persian; as a result, the language models of Per-
in the most of them, a medium vocabulary continuous sian and Arabic are fundamentally differences. Although
speech recognition system with high word error rate is there are several similar phonemes in Arabic and Persian,
presented. Our developed large vocabulary continues and they use similar scripts, the phonetic structure of these
speech recognition system for Persian, called Nevisa, was languages has principal differences; therefore, the acoustic
first introduced in [6,7] as Sharif speech recognition sys- models of Persian and Arabic are not the same. Conse-
tem. It employs the cepstral coefficients as the acoustic quently, the development of a speech recognition system in
features and continuous density hidden Markov model Arabic and Persian are different due to distinctions in their
(CDHHM) as the acoustic model [4,15]. A time-synchro- acoustic and language models.
nous left-to-right Viterbi beam search, in combination The grammar of Persian language is similar to that of
with a tree-organized pronunciation lexicon is used for many contemporary European languages. Normal
decoding [16,17]. To limit the search space, two pruning declarative sentences in Persian are structured as “(S) (O)
techniques are employed in the decoding process. Due to V”. This means sentences can comprise of optional sub-
our practical approach in using this system, Nevisa is jects and objects, followed by a required verb. If the
equipped with established robustness techniques for object is specific, then it is followed by the word/r∂/.
handling speaker variation and environmental noise. Despite the normal structure, there is a large potential in
Various data compensation and model compensation the language to be free-word-order, especially in preposi-
methods are used to achieve this objective. Also class- tion adjunction and complements. For example, adverbs
based n-gram language models (LM) [18,19] with gener- could be placed at the beginning, at the end or in the
alized phrase structure grammar (GPSG)-based Persian middle of sentences, often without changing the meaning
grammar [20] are utilized as word-level and sentence- of the sentences. This flexibility in word ordering makes
level linguistic information. The frameworks for testing the task of Persian grammar extraction a difficult one.
and comparing the effects of the implemented methods Written style of Persian is right to left and it uses Arabic
and also for optimizing the parameters were gradually script. In Arabic script, short vowels (/a/,/e/,/o/) are not
usually written. This results in ambiguities in pronuncia- training and evaluating limited speech recognition sys-
tion of words in Persian. Persian has 6 vowels and 23 tems in laboratories. This speech corpus is comparable
consonants. Three vowels of the language are considered with TIMIT corpus in English. Large Farsdat is another
long (/i/,/u/,/∂/) and the other three are short vowels or Persian speech database that removes some of the defi-
diacritics (/e/,/o/,/a/). Although usually named as long ciencies of the small Farsdat.
and short vowels, the three long vowels are currently dis- Large Farsdat Large Farsdat [22] includes about 140 h
tinguished from their short counterparts by position of of speech signals, all segmented and labeled in word
articulation, rather than by length. The phonemes of Per- level. This corpus is uttered by 100 speakers from the
sian are shown in Table 1 where Farsi letters, codes and most common dialects of the Persian language. Each
IPA notations are shown, too. speaker utters 20-25 pages of text from various subjects.
Persian uses the same alphabet as Arabic with four In contrast with small Farsdat, which is recorded in a
additional letters. Therefore, the number of letters in quiet and reverberation-free room, large Farsdat is
the Persian alphabet is 32 as compared to 28 in Arabic. recorded in office environment. Four microphones, a
Each additional Persian letter represents a phoneme not unidirectional desktop microphone, two lapel micro-
present in the Arabic phoneme set, namely/p/,/t∫/,/ℑ/ phones and a headset microphone are used to record
and/g/. In addition, Persian has four other phonemes the speech signals. All the speech signals in this corpus
(/v/,/k/,/?/,/G/) which are pronounced differently from are recorded using two microphones simultaneously, the
their Arabic counterpart. On the other hand, Arabic has desktop microphone is used in all of the recording ses-
its own unique phonemes (about ten) not defined in the sions and each of the other three microphones is used
Persian language. Persian makes extensive use of word in about one-third of the sessions. Totally, the desktop
building and combining affixes, stems, nouns and adjec- microphone is used for about 70 h of recorded speech
tives. Persian frequently uses derivational agglutination and the other three microphones are used for the 70
to form new words from nouns, adjectives and verbal remaining hours. The average SNR of the desktop
stems. New words are extensively formed by compound- microphone is about 28 dB. The sampling rate is
ing two existing words, as is common in German. Suf- 16 kHz for the whole corpus.
fixes predominate Persian morphology, though there are The test set contains 750 sentences from seven speakers
a small number of prefixes. Verbs can express tense and (four male and three female) and is recorded using the
aspect, and they agree with the subject in person and desktop microphone of the large Farsdat database. We call
number. There is no gender in Persian, nor are pro- this set gFarsdat test. The average sentence length of this
nouns marked for natural gender. test set is 7.5 s. This set includes numbers, names and
some grammar free sentences and contains about 5000
2.2 Corpora different words. All other speech signals in the large Fars-
2.2.1 Speech corpus dat recorded with the desktop microphone are used here
Small Farsdat In this paper, two speech databases, as the train set, i.e. gFarsdat train. In this research only
small Farsdat [21] and large Farsdat [22], are used. those speech les of large Farsdat that are recorded using
Small Farsdat is a hand-segmented database in the pho- the desktop microphone, are used in the evaluations.
neme level which contains 6080 Persian sentences read Farsi noisy speech corpus To evaluate the performance
by 304 speakers. Each speaker has uttered 18 randomly of Nevisa in real applications and in noisy environments,
chosen sentences (from a set of 405 sentences) plus two Farsi Noisy speech (FANOS) database is recorded and
sentences which are common for all speakers. The sen- transcribed [23,24]. This database consists of four pair
tences are formed by using over 1,000 Persian words sets providing four tasks. As adaptation techniques are
and are designed artificially to cover the acoustic varia- used in our robustness methods, each task in this data-
tions of the Persian language. The speakers are chosen base includes two subsets identified as adaptation subset
from ten different dialect regions in Iran and the corpus and test subset. Each adaptation subset is arranged as fol-
contains the ten most common dialects of the Persian lows: 175 sentences (selected from Farsdat sentences) are
language. Male to female population ratio is 2:1. The uttered by seven speakers consisting of five male and two
database is recorded in a low-noise environment featur- female speakers. Each speaker reads 10 identical sen-
ing an average of 31 dB signal to noise ratio with a sam- tences (read by all speakers) plus 15 randomly selected
pling rate of 22,050 Hz. A clean test set, called the small sentences. In addition, each test subset consists of 140
Farsdat test set (sFarsdat test), is selected from this sentences uttered by five male and two female speakers,
database that contains 140 sentences from seven speak- each speaker reading 20 sentences. The average length of
ers. All the other sentences are used as train set (sFars- the sentences is 3.5 s. The transcriptions are at word
dat train). Small Farsdat, as its name indicates, is a level for test data and at phoneme level for adaptation
small size speech corpus and can be used only for data. Each task demonstrates a new environment which
Table 1 Phonemes of Persian language

IPA Char Code Farsi Letter Phonetic Description
i i 105 high front unrounded
e e 101 mid front unrounded
a a 97 low front unrounded
u u 117 high back unrounded
o o 111 mid back unrounded
/ 47 low back rounded
\ 92 unvoiced bilabial plosive closure
p p 112 unvoiced bilabial plosive
’ 96 voiced bilabial plosive closure
b b 98 voiced bilabial plosive
- 45 unvoiced alveolar plosive closure
t t 116 unvoiced dental plosive
= 61 voiced dental plosive closure
d d 100 voiced dental plosive
@ 64 unvoiced palatal plosive closure
c c 99 unvoiced bilabial plosive
* 42 unvoiced velar plosive closure
k k 107 unvoiced bilabial plosive
! 33 voiced palatal plosive closure
; 59 voiced palatal plosive
& 38 voiced velar plosive closure
g g 103 voiced velar plosive
^ 94 voiced uvular plosive closure
G q 113 voiced uvular plosive
( 40 glottal stop closure
] 93 glottal stop
$ 36 unvoiced alveopalatal affricate closure
’ 39 unvoiced alveopalatal affricate
# 35 voiced alveopalatal affricate closure
’ 44 voiced alveopalatal affricate
f f 102 unvoiced labiodental fricative
v v 118 voiced labiodental fricative
s s 115 unvoiced alveolar fricative
Z z 122 voiced alveolar fricative
· 46 unvoiced alveopalatal fricative
[ 91 voiced alveopalatal fricative
x 120 unvoiced uvular fricative
h h 104 unvoiced glottal fricative
l l 108 lateral alveolar
r r 114 trill alveolar
m m 109 nasal bilabial
n n 110 nasal alveolar
j y 121 approximant palatal
differs from the training environment. Tasks A and B are Table 2 summarizes the properties of the tasks in the
recorded in office environment with condenser and FANOS database.
dynamic microphones, respectively with average SNR 2.2.2 Text corpus
levels of 18 and 26 dB. Both tasks C and D are recorded In this research, we have used the two editions of Persian
with condenser microphone in office environment and in text corpus called “Peykare” [25,26]. The first edition of
the presence of exhibition and car noises respectively. this corpus consists of about ten million words and it
Corresponding SNR levels of these sets are 9 and 7 dB. was increased to about 100 million words in the second
Table 2 The specifications of tasks in FANOS database

Task Task A Task B Task C Task D
Environment Office Office Exhibition Car Noise
Microphone Condenser Dynamic Condenser Condenser
SNR(dB) 18 26 9 7
Number of files 315 (175 + 140) 315 (175 + 140) 315 (175 + 140) 315 (175 + 140)
(adapt + test)
Number of speakers 7 (5 + 2) 7 (5 + 2) 7 (5 + 2) 7 (5 + 2)
(male + female)
edition [26]. All words in the first edition are annotated represents a module that can be easily modified or
with part-of-speech (POS) tags. The texts of this corpus replaced. The modularity of the system makes it very
are gathered from various data sources like newspapers, flexible in developing CSR systems for various applica-
magazines, journals, books, letters, hand-written texts, tions and for trying out new ideas in different modules
movie scripts, news etc. This corpus is a complete set of for research works. The modules shown with dotted
Persian contemporary texts. The texts are about different blocks are robustness modules and can be used option-
subjects including politics, arts, culture, economics, ally. The MFCC module is used as the core of feature
sports, stories, etc. The tag set of Persian Text Corpus extraction unit and is supplied with vocal tract length
has 882 POS tags [18,19] that are reduced to 166 POS normalization (VTLN) [27-29], cepstral mean subtraction
tags in this work. (CMS) [3,23] and principal component analysis (PCA)
[30] robustness methods. In addition, voice activity
3 Nevisa speech recognition system detector (VAD) is used to separate speech segments from
3.1 Overview non-speech ones. Nevisa uses energy and zero-crossing
Nevisa is a Persian continuous speech recognition (CSR) based VAD in the pre-processing of speech signal. VAD
system that integrates state-of-the-art techniques of the is a useful block in the ASR systems, especially in real
field. The architecture of this system including feature applications. It specifies the beginning and the end of
extraction, training and decoding (i.e. recognition) blocks utterance and reduces the processing cost of feature
is shown in Figure 1. As this figure shows, each block extraction and decoding blocks. The modified VAD is
Figure 1 The architecture of Nevisa.

also used in spectral subtraction (SS) [3] and in PC-PMC problem, the state tying methods are used [35,36]. Two
[23,31,32] robustness methods to detect noise segments prevalent methods for state tying are data-driven cluster-
in the speech signal. In addition to speech enhancement ing [35] and decision tree-based state tying [36,37]. In
and feature robustness techniques, MLLR [33], MAP [34] these methods, at the first stage, all triphones that occur in
and PC-PMC model adaptation methods can be applied a speech corpus are trained using the available data. Then
optionally on acoustic models to adapt the acoustic the states of similar triphones are clustered into a small
model parameters to speaker variations and environmen- number of classes (the similar triphones are the triphones
tal noises. that have similar middle phoneme). In the last stage, the
The system uses context-dependent (CD) and context- states that lie in each cluster are tied together. The tied
independent (CI) acoustic models that are represented states are called senones [38].
by continuous density hidden Markov models. These Different numbers of senones and different numbers
models are mixtures of Gaussian distribution in cepstral of Gaussian distributions were evaluated in the Nevisa
domain. In this system, forward, skip and loop transi- system. The experimental results showed that clustering
tions between the states are allowed and the covariance triphone states to 500 senones for small Farsdat and
matrices are assumed diagonal [6,9,10]. The parameters 4,000 senones for large Farsdat leads to the best WER.
of the emission probabilities are trained using the maxi- The evaluation results are given in Sect. 4.
mum likelihood criterion and the training procedure is 3.2.1 Robustness methods
initialized by a linear segmentation. Each iteration of the Like all speech recognizers, the performance of the
training procedure consists of time alignment by Nevisa degrades in real applications and in the presence
dynamic programming (Viterbi algorithm) followed by of noise [23,31,39,40]. In order to make this system
parameter estimation, resulting in segmental k-means robust to speaker and environment variations, many of
training procedure [3,4]. In decoding phase, a Viterbi- the recent advanced methods in robustness are incorpo-
based search with beam and histogram pruning techni- rated. Differences between speakers, in background noise
ques are used. In this module, the recognized acoustic characteristics and channel noises (i.e. microphones), are
units are used to make active hypotheses via word deco- considered and tried to be dealt with. Nevisa uses data
der. The word decoder searches the lexicon tree simul- compensation and model compensation approaches as
taneously in interaction with the acoustic decoder and well as their combinations. In the data compensation
the pruning modules. The final active hypotheses are approach, clean data are estimated from their noisy sam-
rescored using language models. Both statistical and ples so as to make them similar to the training data.
grammatical language models can be used either in Nevisa uses spectral subtraction (SS) and Wiener filtering
word decoder or in rescoring modules. In Nevisa, by [23], cepstral mean subtraction (CMS) [3,23], principal
default, statistical LM is used in the word decoder, i.e., component analysis (PCA) [30] and vocal tract length
during the search, and the grammatical model is used in normalization (VTLN) [27,28,41,29] for this purpose. In
n-best re-scoring module optionally. Dotted arrows in the model-based approach, the models of various sounds
Figure 1 mean that statistical LM can be used in the used by the classifier are modified to become similar to
rescorer module, and grammatical LM can be utilized the test data models. Maximum likelihood linear regres-
during the search optionally. sion (MLLR) [33,42], maximum a posteriori (MAP)
[34,24], parallel model combination (PMC) [23,31,33]
3.2 Acoustic modeling and a novel enhanced version of PMC, PCA and CMS
For acoustic modeling we employ two approaches: con- based PMC (PC-PMC) [30] are well incorporated in the
text-independent (CI) and context-dependent (CD) mod- system. PC-PMC algorithm takes the advantages of addi-
eling. The standard phoneme set of Persian language tive noise compensation ability of PMC and convolu-
contains 29 phonemes. This phoneme set and extra HMM tional noise removal capability of both PCA and CMS
models for silence, noise and aspiration are considered in methods. The first problem that is to be solved for com-
the CI modeling. In sect. 4 where recognition results are bining these methods is that PMC algorithm requires
given, the details of modeling process, including number invertible modules in the front-end of the system while
of states and Gaussian mixtures, are presented. CMS normalization is not an invertible process. In addi-
For context-dependent modeling, we use triphones as tion, a framework is to be designed for the adaptation of
the phone units. The major problem in triphone modeling the PCA transform matrix in the presence of noise. The
is the trade-off between the number of triphones and the PC-PMC method provides solutions to these problems
size of available training data. There are a large number of [30].
triphones in a language, but many of them are unseen or The integration of these robustness modules in Nevisa
rarely used in speech corpora. So the amount of training are shown in the Figure 1. The modularity of the system
data is insufficient for many triphones. For solving this makes it very flexible to remove any one of the system
blocks, add new blocks, change or replace the existing Table 3 Examples of different writing styles for plural
ones. suffix “h/“ and imperfective prefix “mi“
Word Attached Intervening space Final form
3.3 Language modeling
Books
Linguistic knowledge is as important as acoustic knowl-
edge in recognizing natural speech. Language models They are going
depict the constraints on word sequences imposed by syn-
tax, semantics or pragmatics of the language [5]. In recog-
nizing continuous speech, the acoustic signal is too weak writing, making some words have different orthographic
to narrow down the number of word candidates. Hence, realizations. For example three possible forms for words
speech recognizers employ a language model that prunes “mas]uliyat“ (responsibility) and “majmu]eye“(the
out acoustic alternatives by taking the previous recognized set of) are shown below in Table 4.
words into account. In the most applications of speech Another issue is the inconsistency of text encoding in
recognition, it is crucial to exploit vast information about Persian electronic texts. This problem arises from the use
the order of the words. For this purpose, statistical and of different code pages by online publishers and people.
grammatical language modeling methods are common As a result, some letters such as ‘ye’ and ‘ke’ have var-
approaches utilized in spoken human-computer interac- ious encoding. For example, the letter ‘ye’ has three dif-
tion. These methods are used by Nevisa to improve its ferent encodings in Unicode, i.e., U+0649 and U+064A
accuracy. (Arabic letters ‘ye’) and U+06CC (Persian letter ‘ye’).
3.3.1 Statistical language modeling For solving these probleme, we must replace different
In statistical approaches, we take a probabilistic viewpoint orthographic forms of a word by a unique form. The
of language modeling and estimate the probability P(W) main corrections that are applied on corpus texts are as
for a given word sequence W = w1w2, ..., wn. The simplest below:
and most successful statistical language models are the
Markov chain (n-gram) source models, first explored by • All affixes that attached to the host word or sepa-
Shannon [43]. To build statistical language models, we rated by an intervening space are replaced with
have used the both first edition [25] and second edition affixes separated with final form character (zero-
[26] of the Peykare corpus. As mentioned in Sect. 2.2.2, width non-joiner character). For example, the words
the first edition of this corpus contains about ten million “ket/b h/“ (the books) and “miravand“ (they are
words that are annotated with POS tags. Using this cor- going) in the examples above are replaced by “ket/
pus, we constructed different types of n-gram language b~h/“ and “mi~ravand“.
models. Since the size of this edition of the corpus was not • Different orthographic realizations of a single word
enough for making a reliable word-based n-gram language are replaced with their standard form ac-cording to
model, we built POS-based and class-based n-gram lan- the standards of APLL (Academy of the Persian Lan-
guage models, in addition to the word-based n-gram guage and Literature) [44]. For example, all different
model. These language models are used in the intermedi- forms of words “mas]uliyat“ and “majmu]eye“
ate version of Nevisa. The final language model of the in the above example are replaced with their stan-
Nevisa has been constructed from the second edition of dard forms (form 1 in Table 4)
the Peykare corpus. • Different encodings of a specific character are
In building the language models using Peykare corpus, changed to a unique form. For example, all letters
we faced with two problems. The first problem was ‘ye’ that are encoded by U+0649 and U+064A are
orthographic inconsistency in the texts of the corpus. changed to the letter ‘ye’ encoded by U+06CC.
This problem arises from the fact that Persian writing • All diacritics (Bound graphemes) appearing in texts
system allows certain morphemes to appear either as are removed. For example, the consonant gemina-
bound to the host or as free affixes. Free affixes could be tion marker in the word “fann/vari“ (technology)
separated by a final form character or with an intervening is removed resulting in the word “fan/vari“[19].
space. As examples, three possible cases for the plural
suffix “h/“ and the imperfective prefix “mi“ are illu- Table 4 Examples of different orthographic realizations
strated in Table 3. In these examples, the tilde (~) is used for words “mas]uliyat“ and “majmu]eye“
to indicate the final form marker, which is represented as Word form 1 form 2 form 3
the control character\u200C in Unicode, also known as
the zero-width non-joiner. All the different surface forms Responsibility
of Table 3 are found in the Persian text corpus. Another
The set of
issue arises from the use of Arabic script in Persian
The multiplicity of the POS tags in the corpus was the hypothesis score after recognizing the word wn and wn+1
next problem to be solved. As mentioned earlier, the tag is the next recognized word after expanding the hypoth-
set includes 882 POS tags. While many of them contain esis, then the new hypothesis score in logarithm domain
detailed information about the words, they are rarely is as Eq. 1, where SAM(wn+1) is the acoustic model score
used in the corpus. This results in many different tags for word wn+1and SLM(wn+1) is its language model score.
for verbs, adjectives, nouns etc. As a solution, we Since the scales of SAM(wn+1) and SLM(wn+1) are differ-
decreased the number of POS tags by clustering them ent, a weight parameter (aLM ) is usually applied as lan-
manually according to their syntactical similarity. In guage model weight.
addition, for rare and syntactically insignificant POS
log Sn+1 = log Sn + log SAM (wn+1 ) + αLM · log SLM (wn+1 ) (1)
tags, we used the IGNORE tag. A NULL tag was defined
to mark the beginning of a sentence. These modifica- The score of POS-based bigram and trigram language
tions reduced the size of the tag set to 166. Finally, the models are respectively computed as Eqn. 2 and Eq. 3,
following statistics were extracted from the corpus to in which Tn and Tn-1 are the most probable POS tags
build the LMs [18,19]: unigram statistics of words (The for the words wn and wn-1.
20,000 most frequent words in the corpus were chosen as
pos
the vocabulary set); bigram statistics of words; trigram Sbi (wn+1 ) = max [P (Ti |Tn ) · P (wn+1 |Ti )] (2)
i
statistics of words; unigram statistics of POS tags (for
166 tags); bigram statistics of POS tags; trigram statistics
pos
of POS tags; number of assigning one POS tag to each Stri (wn+1 ) = max P (Ti |Tn−1 Tn ) · P (wn+1 |Ti ) (3)
i
word in the corpus (lexical generation statistics). After
extracting the word-based n-gram statistics, the back-o In addition, the language model score for class-based
trigram language model was built using Katz smoothing bigram and trigram language models can be computed
method [45]. [19]. As shown in Figure 1 by dotted line, the statistical
In addition to the word-based and POS-based bigram LM can be applied to the system at the end of the
and trigram models, class-based language models can be search by n-best re-scorer.
optionally used [46]. Class-based language modeling can 3.3.2 Grammatical language models
tackle the sparseness of data in the corpus. In this Grammar is a formal specification of permissible struc-
approach, words are grouped into classes and each word tures for the language that is used as another important
is assigned to one or more classes. To determine the linguistic knowledge source besides the statistical lan-
word classes, one can use the automatic word clustering guage models in speech recognition systems. In Nevisa,
methods like Brown’s and Martin’s algorithms [46,47]. In as in the most of the developed speech recognition sys-
these clustering methods, certain information theory cri- tems, the output is a set of n-best hypotheses that are
teria, such as average mutual information, are used to ordered based on their acoustic and language model
make different classes. In Nevisa, the basic idea of Mar- scores. The output sentences do not have the true syn-
tin’s algorithm [47] is used for word clustering. In this tactic structure necessarily. For making high scored syn-
algorithm, the words are clustered initially and they are tactic outputs a grammatical model of the language and
moved between classes iteratively in the direction of per- a syntactic parser are necessary. The grammatical model
plexity improvement. Although POS-based and class- includes a set of rules and syntactic features for each
based n-grams reduce the sparseness of the extracted word in the vocabulary. The rule set describes syntactic
bigram and trigram models, in many cases the probabil- structures of permissible sentences in the language. The
ities remain zero or close to zero. To overcome this pro- syntactic parser analyzes the output hypotheses of the
blem, various smoothing methods [48] such as add-one, recognition system and rejects the non-grammatical
Katz [45] and Witten-Bell smoothing [49] were evaluated hypotheses.
on POS-based and class-based n-gram probabilities. Various methods have been presented for specifying the
The various LMs mentioned above are incorporated in syntactic structure of a language in the last two decades
Nevisa in the word decoding phase (Figure 1). In this [51-53]. Generalized phrase structure grammar (GPSG)
method, language model scores and acoustic model [52] is a syntactic formalism that considers language sen-
scores are combined during the search in a semi- tences as sets of phrases by assuming each phrase as a
coupled manner [50]. In this case, when the search pro- combination of smaller phrases. Using linguistic expertise
cess recognizes a new word while expanding different and consultation, about 170 grammatical rules for Persian
hypotheses, the new hypothesis score is computed via language using GPSG idea [20] were extracted. The
multiplication of following three terms: the n-gram employed GPSG was modified to be consistent with the
score of new word, the acoustic model score of new Persian language. The little modified X-bar theory [54]
word and current hypothesis score. If Sn is the current was used for defining syntactic categories. Noun (N), verb
(V), adjective (ADJ), adverb (ADV) and preposition (P) sampling rate, and with 25 ms of speech signal and
were selected as the basic syntactic categories. These basic 15 ms of overlap in the case of 16 kHz sampling rate. A
categories could be used as the head for larger syntactic pre-emphasis filter with a factor of 0.97 is applied to each
categories like noun phrase, verb phrase, adjective phrase frame of speech. A Hamming window is also applied to
etc. For each syntactic category and phrase, we specify fea- the signal in order to reduce the effect of frame edge dis-
tures; the features describe the lexical, syntactic, and continuities. After performing fast Fourier transform
semantic characteristics of the words. To each feature, a (FFT), the magnitude spectrum is warped according to
name and its possible values are assigned. For example, the signal’s warping factor if the VTLN option is used.
Plurality (PLU) is a binarya feature and its possible values The obtained spectral magnitude spectrum values are
are + (plural) or - (singular) and Person (PER) is an atom- weighted and summed up using the coefficients of 40 tri-
icb feature and its possible values are 1, 2, 3. After specify- angular filters arranged on the Mel-frequency scale. The
ing categories and phrases, syntactic structures of various filter output is the logarithm of sum of the weighted
phrases are illustrated based on smaller syntactic cate- spectral magnitudes. Discrete cosine transform (DCT) is
gories. As an example, the following rule is one of the then applied resulting in 13 cepstral coefficients. The
grammatical rules that describe noun phrases (N1) in Per- first and the second derivatives of cepstral coefficients
sian. This rule shows the noun phrase structure when the are calculated using linear regression method [23] over a
noun combines with another noun phrase as a genitive. window covering seven neighboring cepstrum vectors.
This makes up vectors of 39 coefficients per speech
N1 → ∗N1− [GEN+, PRO−] N2(P2) (S [COMP+, GAP]) (4) frame. Finally, PCA and/or CMS are used in the cases
these options are activated.
In this rule, N1 - (a noun with possibly an adjective) Nevisa uses phone (context independent) and triphone
must have Ezafe C enclitic (GEN +) and non-pronoun (context dependent) HMM modeling. All HMMs are
(PRO -) head. N2 points to a complete Noun phrase (a left-to-right; forward, skips and self-loop transitions are
noun with pre-modifiers and post-modifiers). It means allowed. The elements of the feature vectors are assumed
that a complete Noun phrase can play the role of geni- uncorrelated resulting in diagonal covariance matrices.
tive for Noun. In addition, this rule shows that the The parameters are initialized using linear segmentation
other post-modifiers of noun (P2 and S) can be com- and then the segmental k-means re-estimation algorithm
bined optionally. P2 points to the prepositional phrase finalizes the parameters after ten iterations. The beam
and S[COMP +] points to the complement sentence width in the decoding process is 70 and the stack size is
(relative clause). The feature COMP with + value indi- 300.
cates that the sentence must have Persian complementi-
zer “ke“ (that, which). Similar to this rule, we write 4.2 Results of language model incorporation
other rules for describing various syntactic structures of In this section, the evaluation results of incorporating of
Persian. Furthermore, a 1,000-word vocabulary with syn- language models in the Nevisa system are reported. An
tactic features was annotated. intermediate version of Nevisa is used in the experiments
Analyzing a sentence and checking the compatibility of this section. The system is trained on 29 Persian pho-
of its structure with the grammar needs a parsing tech- nemes with silence as the 30th phoneme. All HMMs are
nique. Parsing algorithm offers a procedure that left-to-right and composed of six states and 16 Gaussian
searches through various ways of combining grammati- mixture components per state. The vocabulary size is
cal rules to find a combination that generates a tree to about 1,000 words and the first edition of the text corpus
illustrate the structure of the input sentence. This is is used for building the statistical language models. In
similar to the search problem in speech recognition. A these evaluations, sFarsdat train and sFarsdat test are
top-down chart parser [5] is incorporated in Nevisa. used as train and test sets, respectively. Two different cri-
The grammatical language model integration in Nevisa teria were used to evaluate the efficiency of the language
is done in a loosely-coupled manner, as shown in Figure 1, model variants: the perplexity and word error rate (WER)
at the end of the search process. The Parser takes the n- of the system.
best list from the word decoder, analyzes each sentence Table 5 shows the results of Nevisa system on sFarsdat
according to grammatical rules and accepts the grammati- test set using WER as the evaluation criteria. As men-
cally correct sentences as the output of the system. tioned in Sect. 2.1, the test set contains 140 sentences
from seven speakers. The Witten-Bell smoothing techni-
4 Experiments and results que [49] was used for POS-based and class-based language
4.1 System parameters models. In class-based evaluation, we used 200 classes. As
In the acoustic front-end, speech signal is blocked into 20 the results show, the base-line (BL) with no language
ms frames with 12 ms overlap if sampled with 22050 Hz model, results in high WER. The word-based statistical
Table 5 Performance of Nevisa in clean condition (word Table 7 Evaluation of Nevisa and the robustness
level) methods on FANOS noisy tasks (WER% on word level)
LM Method WER% Robustness Task A Task B Task C Task D
BL, No LM 38.14 None 74.04 75.32 116.41 105.94
POS-based trigram 24.68 VTLN+MLLR 30.37 32.87 82.52 60.07
Class-based trigram 23.40 PMC-MAP 38.63 50.49 69.36 50.22
Word-based trigram 21.76 PC-PMC+MLLR 31.33 28.70 56.17 42.11
POS-based trigram+Grammar 18.2
recognition rates on task C and task D are negative due

LM provides higher improvement compared to other sta- to the high insertion error rate. The performance of the
tistical LMs. Therefore, in all of the experiments in the fol- system is considerably improved by using speaker and
lowing sections, we use the word-based LM. In the results environment compensation methods. Table 7 shows the
of Table 5, the WER reduction obtained by using the improvements in WER as a result of applying robustness
grammar in the system is noticeable. methods. VTLN provides better compensation for less-
Table 6 shows the perplexity computed on the 750 sen- noisy environments like tasks A and B, while PMC and
tences (about 10,000 words) of gFarsdat test set based on PC-PMC result in higher compensation in more noisy
word-based n-gram model. In order to reduce the environments. In the PC-PMC method, the number of
required memory size for language model, infrequent n- features is reduced by 25% from 36 to 25. MLLR and
grams were removed from the model. The counts below MAP adapt the acoustic models to environmental con-
which the n-grams are discarded are referred to as cutoffs ditions, microphone and speaker’s signal properties.
[55]. Table 6 shows how the bigram and trigram cutoffs MAP results in high adaptation ability whenever the
affect the size (in Mega bytes) and perplexity of a trigram adaptation data is enough, and MLLR provides better
language model. This table shows that the cutoffs notice- adaptation in less-noisy conditions compared to noise-
ably reduce the size of language model, but do not dominant conditions. The combination of PC-PMC and
increase the perplexity significantly. Considering Table 6, MLLR results in high system robustness in the presence
we have chosen the cutoffs 0 and 1 for bigram and trigram of all noise types.
counts, respectively.
4.4 Final results
4.3 Results for robustness techniques The final results of continuous speech recognition using
The recognition system described in section 4.2 is used to Nevisa system are summarized in Table 8. According to
provide results for this section. Here, sFarsdat train is the intermediate experiments, some of which were
used to train phone models with six states for each model reported in previous sections, the final parameters of the
and 16 Gaussian mixture in each state. The vocabulary system are optimized. The parameters of the front-end
contains about 1,000 words and the word-based trigram are the values described in sect. 4.1. CMS normalization
language model is used. Evaluation test sets of FANOS is used as a permanent processing unit in the system.
database are used in these experiments. Context-independent (phone) and context-dependent
Like all other recognition systems, the performance of (triphone) modeling are done using both small and large
Nevisa is degraded in adverse noisy conditions. Equip- Farsdat corpus. In all experiments, the HMMs are made
ping this system with various compensation methods up using five states and eight Gaussian mixtures per
has made it robust to different noise types. Table 7 state. 29 phone models and a silence model are used for
shows the recognition results of the system on four the context-independent task using small Farsdat. The
noisy tasks on FANOS corpus. The baseline WERs of same acoustic models with two additional models, noise
the system on this speech corpus are very high. The
Table 6 The effect of cutoffs on the size and perplexity Table 8 WER% of Nevisa on small and large Farsdat
of a back-off trigram language model using context-independent (phone) and context-
dependent (triphone) modeling
Cutoffs Cutoffs Perplexity Size (MB)
(bigram) (trigram) Train Test
0 0 134.54 36 Databse Context gFarsdat sFarsdat
0 1 134.76 20 sFarsdat Independent 29.60 25.77
0 2 135.82 17 sFarsdat Dependent 20.51 16.79
1 1 143.18 10 gFarsdat Independent 6.10 37.39
1 2 143.26 7.8 gFarsdat Dependent 5.21 26.85
and breath, are used in context-independent modeling vocabulary size and improving our language models for
with large Farsdat. In the context-dependent modeling this purpose. We are also working on specific language
with small Farsdat (sFarsdat train) the states are tied into models for medical, legal, banking and office automation
500 senones while they are tied into four thousand applications.
senones in modeling with large Farsdat (gFarsdat train).
In the experiments given in Table 8, word-based back-o Notes
a
trigram language model extracted form the second edi- The binary features are the features that take only
tion of the text corpus and the vocabulary size of 20,000 two possible values.
b
words are used. The atomic features are the features that take more
As shown in Table 8, generally the performance of the than two possible values.
c
system with sFarsdat test is lower than with gFarsdat Ezafe is short vowel that makes genitives in Persian
test. This is due to the mismatch of the language model
between the sentences of sFarsdat test and the text cor-
Competing interests
pus. As indicated in sect. 4.1, the sentences of small Fars- The authors declare that they have no competing interests.
dat are designed artificially to cover the Persian acoustic
variations and do not have a compatible language model Received: 18 January 2011 Accepted: 5 October 2011
Published: 5 October 2011
with regular Persian texts such as the Peykare. Training
the triphone models with small Farsdat provides higher References
WER in comparison with large Farsdat because the train- 1. LR Rabiner, Challenges in speech recognition and natural language
ing data in small Farsdat is not enough for context- processing, in SPECOM (June 25 2006)
2. S Furui, 50 years of progress in speech and speaker recognition research.
dependent modeling. Due to the small size of the sFars- Trans Comput Information Technology ECTI-CIT. 1(2), 6474 (2005)
dat train, the numbers of final tied states are reduced to 3. X Huang, A Acero, HW Hon, Spoken Language Processing (Prentice Hall,
500. Furthermore, the acoustic mismatch between train Upper Saddle River, NJ, USA, 2001)
4. L Rabiner, BH Juang, Fundamentals of Speech Recognition (Prentice Hall,
and test conditions (train with sFarsdat train and test Upper Saddle River, NJ, USA, 1993)
using gFarsdat test or vice versa) intensifies the increase 5. J Allen, Natural Language Understanding (Benjamin-Cummings Publishing
of WER. The best performance of the system was Co. Inc., Redwood City, CA, USA, 1995)
6. B Babaali, H Sameti, The sharif speaker-independent large vocabulary
obtained in the case of context-dependent modeling speech recognition system, in The 2nd Workshop on Information Technology
using large Farsdat database. & Its Disciplines (WITID 2004), (Kish Island, 2004), pp. 24–26
7. H Sameti, H Movasagh, B Babaali, M Bahrani, K Hosseinzadeh, A Fazel
Dehkordi, HR Abu-talebi, H Veisi, Y Mokri, N Motazeri, M Nezami Ranjbar,
5 Summary and conclusion Large vocabulary persian speech recognition system, in 1st Workshop on
Nevisa system was introduced as the first large vocabu- Persian Language and Computer, 69–76 (May 24–26 2004)
lary speaker-independent continuous speech recognition 8. H Movasagh, Design and implementation of an optimized search method
for hmm-based persian continuous speech recognition. Ms thesis, Sharif
system for Persian language. The conventional and cus- University of Technology (2004)
tomized techniques for different modules of the system 9. B Babaali, Incorporating pruning techniques for improving the performance
were incorporated. For each module, necessary modifi- of an hmm-based continuous speech recognizer. Ms thesis, Sharif University
of Technology (2004)
cations and parameter optimizations were performed. 10. H Sameti, H Veisi, M Bahrani, B Babaali, K Hosseinzadeh, Nevisa, a persian
The parameter set for each part of the system was continuous speech recognition system, in Communications in Computer and
found by separately evaluating the performance of that Information Science (Springer Berlin Heidelberg, 2008), pp. 485–492
11. M Sheikhan, M Tebyani, M Lotfizad, Continuous speech recognition and
part with different parameter values. The system was syntactic processing in iranian farsi language. Inter J Speech Technol. 1(2),
developed in the process of academic and industrial 135 (1997). doi:10.1007/BF02277194
teamwork and was intended to be an exploitable pro- 12. SM Ahadi, Recognition of continuous persian speech using a medium-sized
vocabulary speech corpus, in European Conference on Speech
duct. Therefore, the problems of noisy environments communication and technology (Eurospeech’99), (Geneva, Switzerland, 1999),
and speaker variations had to be handled. Various pp. 863–866
robustness techniques were tried and optimized for this 13. N Srinivasamurthy, SS Narayanan, Language-adaptive persian speech
recognition, in European Conference on Speech Communication and
purpose. We also customized and utilized statistical and Technology (Eurospeech’03), Geneva (2003)
grammatical language models for Persian language. The 14. F Almasganj, SA Seyyed Salehi, M Bijankhan, H Razizade, M Asghari,
general n-gram statistics of Persian were extracted and Shenava 2: a persian continuous speech recognition software, in The first
workshop on Persian language and Computer (Tehran, 2004), pp. 77–82
incorporated for the first time. Our evaluation results 15. LR Rabiner, A tutorial on hidden markov models and selected applications
and real environ-mental tests show that the system is in speech recognition. Proc IEEE. 77(2), 257–286 (1989). doi:10.1109/5.18626
performing satisfactorily enough to be used by typical 16. S Ortmanns, A Eiden, H Ney, Improved lexical tree search for large
vocabulary speech recognition, in IEEE International Conference on Acoustics,
users. Speech, and Signal Processing (ICASSP’98) (1998)
We are now continuing our research for improved 17. H Ney, R Haeb-Umbach, BH Tran, M Oerder, Improvements in beam search
versions of Nevisa. We are using context-dependent for 10000-word continuous speech recognition. IEEE Trans Acoust Speech,
Signal Process. 2, 353–356 (1992)
acoustic phone units (e.g. triphones), increasing the
18. M Bahrani, H Sameti, M Hafezi Manshadi, A computational grammar for 42. MJF Gales, PC Woodland, Mean and variance adaptation within the mllr
persian based on gpsg, in 2nd Workshop on Persian Language and framework. Comput Speech Lang. 10(4), 249–264 (1996). doi:10.1006/
Computer, Tehran, (2006) csla.1996.0013
19. M Bahrani, H Sameti, Building statistical language models for persian 43. C Shannon, A mathematical theory of communication. Bell Sys Tech J. 27,
continuous speech recognition systems using the peykare corpus. Intern J 398–403 (1948)
Comp Process Lang. 23(1), 1–20 (2011). doi:10.1142/S1793840611002188 44. A Ashraf Sadeghi, Z Zandi Moghadam, The dictionary of Persian orthography,
20. M Bahrani, H Sameti, M Hafezi Manshadi, A computational grammar for (The Acad Persian Lang Lit, 2005)
persian based on gpsg. Lang Resour Eval, 1–22 (2011) 45. S Katz, Estimation of probabilities from sparse data for the language model
21. M Bijankhan, J Sheikhzadegan, MR Roohani, Y Samareh, C Lucas, M Tebyani, component of a speech recognizer. IEEE Trans Acoust Speech Signal
Farsdat-the speech database of farsi spoken language, in Proceeding of 5th Process. 35(3), 400–401 (1987). doi:10.1109/TASSP.1987.1165125
Australian International Conference on Speech Science and Technology, 46. PF Brown, RL Mercer, VJ Della Pietra, JC Lai, Class-based n-gram models of
826–831 (1994) natural language. Comput Linguist. 18(4), 467–479 (1992)
22. J Sheikhzadegan, M Bijankhan, Persian speech databases, in 2nd Workshop 47. S Martin, J Liermann, H Ney, Algorithms for bigram and trigram word
on Persian Language and Computer, 247–261 (2006) clustering. Speech Commun. 24(1), 19–37 (1998). doi:10.1016/S0167-6393
23. H Veisi, Model-based methods for noise robust speech recognition systems. (97)00062-9
Ms thesis, Sharif University of Technology (2005) 48. SF Chen, J Goodman, An empirical study of smoothing techniques for
24. K Hosseinzadeh, Improving the accuracy of continuous speech recognition language modeling, in Proceedings of the 34th annual meeting on
in noisy environments, Ms thesis (Sharif University of Technology, 2004) Association for Computational Linguistics, Santa Cruz, California, Association
25. M BijanKhan, Persian text corpus, in 1st Workshop on Persian Language and for Computational Linguistics Morristown, NJ, USA, 310-318 (1996)
Computer, Tehran, (2004) 49. IH Witten, TC Bell, The zero-frequency problem: Estimating the probabilities
26. M Bijankhan, J Sheykhzadegan, M Bahrani, M Ghayoomi, Lessons from of novel events in adaptive text compression. IEEE Transactions on
building a persian written corpus: Peykare. Lang Resour Eval. 45(2), 143–164 Information Theory. 37(4), 1085–1094 (1991). doi:10.1109/18.87000
(2011). doi:10.1007/s10579-010-9132-x 50. MP Harper, LH Jamieson, CD Mitchell, G Ying, S Potisuk, PN Srinivasan, R
27. P Zhan, M Westphal, M Finke, A Waibel, Speaker normalization and speaker Chen, CB Zoltowski, LL McPheters, B Pellom, Integrating language models
adaptation- a combination for conversational speech recognition, in with speech recognition, in Proceedings of the AAAI-94 Workshop on the
European Conference on Speech Communication and Technology Integration of Natural Language and Speech Processing, 139–146 (1994)
(EUROSPEECH’97), Greece, ISCA 2087–2090 (1997) 51. RM Kaplan, The formal architecture of lexical functional grammar, in Formal
28. D Pye, PC Woodland, Experiments in speaker normalisation and adaptation Issues in Lexical-Functional Grammar, Center for the Study of Language
for large vocabulary speech recognition, in IEEE International Conference on (CSLI), 7–28 (1995)
Acoustics, Speech, and Signal Processing (ICASSP’97), Munich 1047–1050 52. G Gazdar, E Klein, G Pullum, IA Sag, Generalized Phrase Structure Grammar
(1997) (Harvard University Press, 1985)
29. H Veisi, H Sameti, B Babaali, K Hosseinzadeh, MT Manzuri, Improving the 53. AK Joshi, L Levy, M Takahashi, Tree adjunct grammars. Journal of Computer
robustness of persian large vocabulary continuous speech recognition and System Sciences. 10(1), 136–163 (1975). doi:10.1016/S0022-0000(75)
system for real applications, in IEEE International Conference on Information 80019-5
and Communication Technologies (ICTTA’06), 1293–1297 (April 24–26 2006) 54. A Radford, Transformational grammar: a first course (Cambridge University
30. H Veisi, H Sameti, The integration of principal component analysis and Press, Cambridge, 1988)
cepstral mean subtraction in parallel model combination for robust speech 55. P Clarkson, R Rosenfeld, Statistical language modeling using the cmu-
recognition. Digit Signal Process. 21(1), 36–53 (2011). doi:10.1016/j. cambridge toolkit, in European Conference on Speech Communication and
dsp.2010.07.004 Technology (EUROSPEECH’97), ISCA, Rhodes 2707–2710 (September 22–25
31. MJF Gales, Model-based Techniques for Noise Robust Speech Recognition, (Phd 1997)
thesis, University of Cambridge, 1995)
32. H Veisi, H Sameti, The combination of cms with pmc for improving doi:10.1186/1687-4722-2011-426795
robustness of speech recognition systems, in Communications in Computer Cite this article as: Sameti et al.: A large vocabulary continuous speech
and Information Science, (Springer Berlin Heidelberg, 2008), pp. 825–829 recognition system for Persian language. EURASIP Journal on Audio,
33. CJ Leggetter, PC Woodland, Maximum likelihood linear regression for Speech, and Music Processing 2011 2011:6.
speaker adaptation of continuous density hidden markov models. Comput
Speech Lang. 9(2), 171 (1995). doi:10.1006/csla.1995.0010
34. PC Woodland, Speaker adaptation: Techniques and challenges. IEEE
Workshop on Automatic Speech Recognition and Understanding, 85–90
(1999)
35. SJ Young, PC Woodland, The use of state tying in continuous speech
recognition, in European Conference on Speech Communication and
Technology (EUROSPEECH’93), ISCA, Berlin, 2203–2206 (22–25 September
1993)
36. SJ Young, JJ Odell, PC Woodland, Tree-based state tying for high accuracy
acoustic modeling, in Proceedings of the Workshop on Human Language
Technology, Association for Computational Linguistics Morristown, NJ,
307–312 (1994)
37. JJ Odell, The Use of Context in Large Vocabulary Speech Recognition, (Phd
thesis, Cambridge University, 1995)
Submit your manuscript to a
38. MY Hwang, F Alleva, X Huang, Senones, multi-pass search, and unified journal and benefit from:
stochastic modeling in sphinx-ii, in European Conference on Speech
Communication and Technology (EU-ROSPEECH’93), Berlin. ISCA (22–25 7 Convenient online submission
September 1993) 7 Rigorous peer review
39. PJ Moreno, Speech Recognition in Noisy Environments. (Phd thesis, Carnegie 7 Immediate publication on acceptance
Mellon University, 1996)
7 Open access: articles freely available online
40. A Acero, Acoustical and environmental robustness in automatic speech
recognition. (Phd thesis, Carnegie Mellon University, 1990) 7 High visibility within the field
41. L Welling, H Ney, S Kanthak, Speaker adaptive modeling by vocal tract 7 Retaining the copyright to your article
normalization. IEEE Trans Speech Audio Process. 10(6), 415–426 (2002).
doi:10.1109/TSA.2002.803435
Submit your next manuscript at 7 springeropen.com

FARSDAT

Uploaded by

Copyright:

Available Formats

FARSDAT

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FARSDAT

Uploaded by

Copyright:

Available Formats

Sameti et al.

EURASIP Journal on Audio, Speech, and Music Processing 2011, 2011:6

RESEARCH Open Access

A large vocabulary continuous speech

1 Introduction were employed in the 80 s. In the next decade, robust

Table 1 Phonemes of Persian language

Table 2 The specifications of tasks in FANOS database

Figure 1 The architecture of Nevisa.

recognition rates on task C and task D are negative due

You might also like