My Lecture Notes in Speech Processing and Recognition Course
My Lecture Notes in Speech Processing and Recognition Course
My Lecture Notes in Speech Processing and Recognition Course
5th Edition
2010
Summary In this chapter the basic skills needed to work with human speech signal will be introduced. The human speech signal is sort of human interface tools. Human body can interact with environment using {Hands, eyes, hearing, touch, smells and speech}. Human Speech Signal (HSS) is altered by electrical engineers as a digital signal that includes some exclusive features. It is a random signal. Also it includes allot of information encoded into patterns. In this chapter the mathematical skills needed for speech signal manipulations will be introduced. In addition to that, speech signal production mechanism will be illustrated. Objectives Understanding the human speech mechanism. Recall knowledge in Digital filters. Recall knowledge in statistical process and random variables.
References
[1] LR. Rabiner, R.W. Schafer, "Digital Processing Of Speech Signals", PrenticeHall, ISBN: 0-13-213603-1. [2] LR. Rabiner, "Fundamentals Of Speech Recognition", Prentice-Hall, ISBN: 0-13285826-6. [3] ALAN V, OPPENHEIHIM, Signals and Systems, 2nd Edition, Prentice-Hall International, Inc., ISBN 7-302-03058-8.
Amr M. Gody
Page
5th Edition
2010
Speech is the main method human interacts together. It is the way of communicating between humans. Figure 1 introduces a basic model of speech communication system.
Amr M. Gody
Page
The produced speech signal is a digital signal that includes the human messages. The information is encoded within that signal. Human brain has the ability to encode and decode that signal. The process starts in human brain by formulating certain message. Then the message is encoded into certain phonetic sequence. The phonetic is the smallest information unit in speech technology. It is like the character in any written language. Then each phone is produced by sending certain sequence of signals to the controlling muscles of vocal tract and vocal cords. The signal is an analog signal. This signal is radiated from the mouth through the environmental air. It affects the particles of air in such that this effect is continuing transferred to the listener ears. The listener ears acts as a receptor for such environmental air deformation. It reverses the process in such that regenerating the pressure and speed waveform that was being generated by
the speech apparatus of the talker. The signal is analyzed by the basilar membrane (part in the auditory system). Then some features are extracted for the subsequent recognition process. Figure 2 provides some sort of classifications of speech applications.
5th Edition
2010
Amr M. Gody
Page
5th Edition
2010
Amr M. Gody
Page
5th Edition
2010
2. Statistical process Referring to figure 1, part of speech recognition system depends on the extracted features from the percept speech. Features are some physical properties that identify certain phenomena. For example female sounds much deeper than male sounds. This is a physical phenomenon. We can measure this phenomenon by extracting the fundamental frequency from speech signal. This evaluates to high frequency for deeper sounds.
Amr M. Gody
Page
5th Edition
2010
One can imagine that muscles responsible for controlling the vocal tract may not make 100% match in every time talker wants to communicate the same message. Certainly it may be a little bit deviated. This will reflected on a little bit difference in the value of the extracted features. It is not possible to alter features as deterministic values although they are generated through a deterministic process. Random process is a function of random variables. Random variables are modeled using probability distribution functions. Figure 5 gives a close image of speech signal as information source. The figure is 3 parts. The top is the waveform, the middle is the spectrogram and the bottom figure is the annotation. Speech waveform is the graph of the analog values of speech signal. The y axe represents the signal value. This value is a function of the microphone used to record the signal. The x axe is the time in this graph and the other two graphs. The middle graph represents the spectrogram. This is a 3 dimensional graph that cross reference time, frequency and power. The y axe is the frequency and the z axe is normal to the paper and represents the signal power. The more power the darker points. The x axe is the time as mentioned before. Recalling the figure, we can see that the signal has a non
Amr M. Gody
Page
homogenous frequency distribution over time. The signal is the Arabic word .It is pronounced khamsah.
5th Edition
2010
Amr M. Gody
Page
The observer can notice that this 4-Arabic-letters word include 6 different homogenous areas. Each area represents a stable duration in time of stable features. Those stable areas called the sounds or in other word the phonemes. So the phonemes play in spoken speech the same role as the letters play in the written language. The above discussion gives us the idea of segmenting the speech signal into small duration in order that deal with it as stationary during those small durations as shown in figure 6.
5th Edition
2010
The segmentation process is very important to make it possible to model the speech production systems. The phoneme may consist of single frame or sequence of frames. The statistical model should be suitable to the statistical process being descried. Assume the following example: (For a complete list of a certain symbols of Arabic phonemes refer to figure 6. Three Arabic vowels {a,o,e}. Formants features will be used. Formants are the resonant frequencies. Appears as black lines in figure 5. The first two Formants for the Arabic three vowels are used to build three Gaussians pdfs. Viewing the space {F1-F2} (figure 9), It is clearly appeared that we have three classes. We can model each class using a single Gaussian distribution.
Amr M. Gody
Page
5th Edition
2010
Amr M. Gody
Page
5th Edition
2010
F1 and F2 are random variables. Frames of {F1, F2} are modeled as Gaussian distribution.
To use this function you should evaluates the parameters 1- Covariance matrix 2- Mean vector The covariance matrix indicates the correlation between the different random variables {F1 and F2} . 11 =
21
12 22
Amr M. Gody
Page
The values are estimated from the available training data. We should collect a suitable descriptive frames that covers all situations of {F1,F2}.
10
Then we can estimate the model parameters. After that we can use the pdf to evaluate the probability of certain frame of {F1,F2} against the trained model. 200 300 = 255 221 295
=1
5th Edition
2010
11 = (1 1 )2 (1 )
evaluate to(1 , 2 ) = ; where c is the number of recurrence and N is the total number of training vectors in the data set.
=1 =1
(1 , 2 ) is the probability of having a vector containing both values of F1 and F2 at the same time. In case of having independent variables these value should evaluates to zero. In case of not dependency exist; each vector may have a recurrence within the data set. In this case (1 , 2 ) will = 1 (1 ) 2 (2 )
1
Amr M. Gody
Page
11
normal case), (1 ) =
In case of all vectors are equally probable (This should be almost the
5th Edition
2010
(a)
(b)
(C)
Figure 10 a) F1-F2 graph for two Arabic vowels. b) Gaussian pdf for the training data. c) 3D graph of Gaussian pdf.
Amr M. Gody
Page
12
5th Edition
2010
mu = [0 0]; Sigma = [.25 .3; .3 1]; F1 = -3:.2:3; F2 = -3:.2:3; [F1,F2]= meshgrid(F1,F2); F = mvnpdf([F1(:) F2(:)],mu,Sigma); F = reshape(F,length(F2),length(F1)); surf(F1,F2,F); caxis([min(F(:))-.5*range(F(:)),max(F(:))]); axis([-3 3 -3 3 0 .4]) xlabel('F1'); ylabel('F2'); zlabel('Probability Density');
Figure 11 Matlab script that evaluates the Multi variate Normal distribution of a certain Random variables.
Figure 12 Hypothetical case as F1-F2 are for single class of data. For example single phone.
The previous figure explains that data could not be fit by single Gaussian. As shown in figure there are two groups for data distribution. The data for the same phoneme is almost concentrated into two separate areas. Or in other word it appears as there is two points that may be assumed as a center. Gaussian mixture can represent such multi modal data. Figure10-b indicates the contours of Gaussian mixture pdf.
Amr M. Gody
Page
13
%%[m s p ll] = GMI(data,n) % Initialize Gaussian mixture using data and n. n is the number of Gaussians. % This function returns in successes the mean vector m, covariance matrix % s, portions vector P and negative log likelihood ll. % data is row based. Each row is a features vector. function [m s p ll] = GMI(data,n) options = statset('Display','final'); obj = gmdistribution.fit(data,n,'Options',options); s = obj.Sigma; m = obj.mu; p = obj.PComponents; ll = obj.NlogL; end %%[p] = GMcalProp(o,seg,mu,p) % Calculate Gaussian Mixture probability for observation vectors O. The % output is stored into the vector P. % O : Observation matrix. Each row in the observation matrix corresponds to % certain features vector. %seg : Covariance matrix of nxn elements. n is the size of features vector. %Mu : This is the mean vector. it is 1xn. %P : It is the portions vector. It is 1xm. M is the number of Gaussian mixtures. function [p] = GMcalProp(o,seg,mu,p) obj = gmdistribution(mu,seg,p);
5th Edition
2010
p = pdf(obj,o); end
Figure 13 Matlab code to estimate Gaussian mixture of dataset.
Amr M. Gody
Page
14
5th Edition
2010
Signal processing involves the transformation of a signal into a form which is in some sense more desirable. Thus we are concerned with discrete systems, or equivalently, transformations of an input sequence into an output sequence.
Linear shift invariant systems are useful for performing filtering operations on speech signals and, perhaps more importantly, they are useful as models for speech production. 4. Digital signal processing and digital filters In this section, the basics of signal processing will be recalled. This section starts by introducing the Fourier series and ends by digital filters.
1.Fourier series of periodic continuous time signal
The Fourier series is a link to the frequency domain for periodic time signals. Consider the following equation; () = 0 + ( cos + sin )
=1
Amr M. Gody
Page
This is a general closed form equation that express any function of time f(t) as a sum of harmonics (sin and cosine signals). To make the above equation true and valid, { , and 0 } should be calculated.
15
All terms in the Right Hand Side (RHS) will evaluate to zero excluding the first term. This is due to that T is the fundamental period. = 2 (
1
5th Edition
2010
0 = ()dt
() cos dt
T T
In the same way we can evaluate and . This is explained as follows: = 0 + ( cos + sin ) cos dt cos 2 =
T =1
sec
All terms in the RHS will evaluates to zero excluding the term of . = 2 () cos dt T 2 () sin dt T T 2
Amr M. Gody
Page
16
This explains that any periodic time signal of main period T can be expressed as a sum of sin and cosine signals of the main period T and integer multiples of the main frequency . This is very important gate that opens a new horizons in processing any periodic time signals by considering its components instead of the function itself. The filters are the first application of this evolutionary step. Keeping in mind that eJnot = cos no t + J sin no t f(t) = Cn eJnot
n=
5th Edition
2010
This is the same as the previous processing of the signal but this is much familiar to engineering to express it in phasor form (Magnitude and phase components). In the same way we can obtain the complex parameter C n .
T =
Hence
Amr M. Gody
Page
17
5th Edition
2010
= J
Co
Let us list the following points 1- The frequency domain coefficients C n are discrete. It is also called spectral coefficients. 2- The difference in w domain between two successive coefficients is . 3- is the period of the time domain periodic function.
2. Non periodic continuos time function and fourier transform Recall the above points, let us think about what will be the effect on the above handling to the time function () in case that the function is not a periodic function.
Amr M. Gody
Page
18
In that case we have no to start with. We should think in some different way to get the frequency domian components of the time signal. The basic and stright forward direction is to handle it as a periodic signal with period equals to infinite.
T= =
5th Edition
2010
We can excpect that the spectral coffecients are going to be much closer to each other. It is going to be touched together. This introduces that the frequency domain function is going to be a continous function instead of a discrete one such in the case of periodic time signals. Let us set some valuse according to the new situation: = o = 2 0 = d T
2 T
no = n =
Amr M. Gody
Page
1 () = () eJno t eJno t T
=
Cn =
() eJno t
19
1 () = d () eJt eJt 2
5th Edition
2010
() =
1 () eJt eJt d 2
The last two equations are called Fourier transform pairs. As shown that they are straight forward produced from Fourier series for periodic signals of infinite period or in simple words for non periodic signals. Let us list the following points: 1- Non periodic signals has continues spectral coefficients in the frequency domain. Example 1 Consider the impulse function in frequency domain () = ( )
() =
()
() eJt
Then
() =
Example 2
() =
()
Amr M. Gody
Page
20
Now let us consider the spectral function given by impulse train as follows
5th Edition
2010
1 () = 2 ( ) 2
=
This indicates that () is a periodic function of time with period = Example 3 Find the Fourier transform of the following time function () = ( )
=
Compare x(t) to the Fourier series of periodic signal, they are identical.
() =
=
By using the results of example 2, we can handle this function as a periodic function of period T. Then we can calculate 1 () = 2 ( ) = ( )
= =
1 1 ( ) =
Amr M. Gody
Page
21
This is a very important result. The spectrum of pulse train in time domain with a period T is also a pulse train separated with in frequency domain.
3.Discrete Fourier transform for a periodic sequence
5th Edition
2010
Let us continue in our analysis toward none continues time functions. The cause of discontinuity is that time signal is sampled. This is the first step toward the digital world. What will be the effect in the frequency domain for periodic and non periodic sequences? Let us start in a periodic sequence of period N. In the same way let us start in Fourier series representation:
() = Cn eJno t T 1
Cn =
Now we have
t = Ts as this is the minimum time difference between two successive samples. o = 2 rad ( ) NTS sec
T = NTs (sec
t = kTs
() eJno t
Where:
Amr M. Gody
Page
22
5th Edition
2010
Let
o = 0 Ts
(kTs ) = Cn eJno k
=
o =
2 rad ( ) N Sample
rad sample
We can remove the independent variable T S from the function brackets as it is a constant. (k) = Cn eJnok
Cn = 1 T 1 1
=
Cn =
Cn =
( )
Amr M. Gody
Page
23
NTS
NTS
( )
() eJno t
() eJno k
() eJno k
Let Cn
Cn =
=1
() eJnok F(no )
5th Edition
2010
1- F(no ) is a discrete function. 2- The distance between any two succsseive points in frequency domain is rad o ( ) sample 3- F(no ) is a periodic function in domain. The period is 2 . This is a direct impact of the fact of F(no ) =
N 1
=1
4- The relation between the real frequency and the digital frequency is
0 sec
() eJn(o+2)k = () eJnok = o fs
rad
=1
Going from this point, we can start to consider the case of non periodic sampled time signal. As we did before, we can assume it as periodic with a period infinite. Let us start with the pair of Fourier series for sampled periodic time signals: (k) = F(no ) eJnok
= =1
1 F(no ) = () eJnok N
Amr M. Gody
Page
24
5th Edition
2010
This is the same issue happened in case of Fourier transform. The distance between the successive points in the frequency domains tends to be zero. This will leads to continues frequency domain function. We also have the followings:
= 0 = 2 d N
= n = no (
at n = = at n = = fs
=
rad ) sample
Let us start
Amr M. Gody
Page
25
rad sec
5th Edition
2010
() eJk
3- Recalling = fs , at = 2, f =
introduces the very important rule of sampling theory. We should sample at least twice the maximum signal frequency to ensure that no overlap in frequency domain.
fs 2
2fs 2
= fs (Hz). This
Amr M. Gody
Page
26
5th Edition
2010
F s >2F max
F s =2F max
F s <2F max
Amr M. Gody
Page
27
5th Edition
2010
Amr M. Gody
Page
28
5th Edition
2010
Following the same direction as in examples 1 through example 3, Let us first consider the impulse response of the train of impulses in frequency domain () = 2 ( o )
=
() = ( )
This is identical to Fourier series of periodic discrete signal. So we can get by consider it for a periodic delta function of period P.
() =
o )) eJn d =
Amr M. Gody
2 () = 2 ( o ) = ( o ) = o ( o ) P
= = =
Page
29
5th Edition
2010
Amr M. Gody
Page
30
5th Edition
2010
5. Z-transform
Region of convergence The region of convergence (ROC) is the set of points in the complex plane for which the Z-transform summation converges.
Amr M. Gody
Page
31
Let
5th Edition
2010
it becomes
There are no such values of that satisfy this condition. Example 2 (causal ROC)
ROC shown in blue, the unit circle as a dotted grey circle and the circle is shown as a dashed black circle Let on the interval (where u is the Heaviside step function). Expanding it becomes
Amr M. Gody
Page
32
The last equality arises from the infinite geometric series and the equality only holds if which can be rewritten in terms of as
5th Edition
2010
. Thus, the ROC is . In this case the ROC is the complex plane with a disc of radius 0.5 at the origin "punched out". Example 3 (anticausal ROC)
ROC shown in blue, the unit circle as a dotted grey circle and the circle is shown as a dashed black circle Let Expanding on the interval (where u is the Heaviside step function). it becomes
Amr M. Gody
Page
33
Using the infinite geometric series, again, the equality only holds if which can be rewritten in terms of as ROC is of radius 0.5. . Thus, the
5th Edition
2010
What differentiates this example from the previous example is only the ROC. This is intentional to demonstrate that the transform result alone is insufficient. Examples conclusion Examples 2 & 3 clearly show that the Z-transform of is unique when and only when specifying the ROC. Creating the pole-zero plot for the causal and anticausal case show that the ROC for either case does not include the pole that is at 0.5. This extends to cases with multiple poles: the ROC will never contain poles. In example 2, the causal system yields an ROC that includes while the anticausal system in example 3 yields an ROC that includes .
Amr M. Gody
Page
34
In systems with multiple poles it is possible to have an ROC that includes neither example, nor . The ROC creates a circular band. For has poles at 0.5 and 0.75.
5th Edition
2010
The ROC will be , which includes neither the origin nor infinity. Such a system is called a mixed-causality system as it contains a causal term and an anticausal term .
The stability of a system can also be determined by knowing the ROC alone. If the ROC contains the unit circle (i.e., ) then the system is stable. In the above systems the causal system (Example 2) is stable because contains the unit circle. If you are provided a Z-transform of a system without an ROC (i.e., an ambiguous following: Stability Causality If you need stability then the ROC must contain the unit circle. If you need a causal system then the ROC must contain infinity and the system function will be a right-sided sequence. If you need an anticausal system then the ROC must contain the origin and the system function will be a leftsided sequence. If you need both, stability and causality, all the poles of the system function must be inside the unit circle. The unique Properties Properties of the z-transform Time domain
Notation
ROC:
Amr M. Gody
Page
35
Z-domain
ROC
5th Edition
2010
Time shifting
if
Time reversal
Conjugation
ROC
Real part
ROC
Imaginary part
ROC
Differentiation
ROC
Convolution
At least the intersection of ROC 1 and ROC 2 At least the intersection of ROC of X 1 (z) and X 2 (z 1) At least
Correlation
Multiplication
Parseval's relation
Amr M. Gody
Page
36
5th Edition
2010
are inside
ROC
Amr M. Gody
Page
37
5th Edition
2010
10
11 12
13
14
15
16
17
18
19
20
Page Amr M. Gody
38
Summary This chapter provides details about information included in speech signal. Speech signal conduct too many information. Some of the information like talkers identity, message and emotions will be explored. Research is very active to model such information. This chapter deals with introducing basic information models. It also goes further to provide some explanation of speech database. The spectral characteristics of different speech components will be discussed. The classification of speech components will be introduced. How emotions can be modeled will be discussed. The relation between prosodic characteristic and speech components will be discussed. Objectives Explore human speech from linguistic perspective. Understanding speech corpus. Understanding human speech signal as information source.
Amr M. Gody
Page
1. Acoustical parameters Most 1anguages, including Arabic, can be described in terms of a set of distinctive sounds, or phonemes. In particular, for American English, there are about 42 phonemes including [2] vowels, diphthongs, semivowels and consonants. There are a variety of ways of studying phonetics; e.g., linguists study the distinctive features or characteristics of the phonemes. For our purposes it is sufficient to consider an acoustic characterization of the various sounds including the place and manner of articulation, waveforms, and spectrographic characterizations of these sounds. Figure 1 shows how the sounds of American English are broken into phoneme classes.' The four broad classes of sounds are vowels, diphthongs, semivowels, and consonants. Each of these classes may be further broken down into subclasses that are related to the manner, and place of articulation of the sound within the vocal tract. Each of the phonemes in Figure 1 (a) can be classified as either a continuant, or a noncontinuant sound. Continuant sounds are produced by fixed (non-time-varying) vocal tract configuration excited by the appropriate source. The class of continuant sounds includes the vowels, the fricatives (both unvoiced and voiced), and the nasals. The remaining sounds (diphthongs, semivowels, stops and affricates) are produced by a changing vocal tract configuration. These are therefore classed as noncontinuants.
Amr M. Gody
Page
(a)
Amr M. Gody
Page
(b)
The Arabic language has basically 34 phonemes, 28 consonants and six vowels (see fig 1 b). 2. Phonological hierarchy Phonological hierarchy describes a series of increasingly smaller regions of a phonological utterance. From larger to smaller units, it is as follows: Utterance Prosodic declination unit (DU) / intonational phrase (I-phrase) Prosodic intonation1 unit (IU) / phonological phrase (P-phrase) Prosodic list unit (LU) Clitic group Phonological word (P-word, ) Foot (F): "strong-weak" syllable sequences such as English ladder, button, eat it Syllable (): e.g. cat (1), ladder (2) Mora () ("half-syllable") Segment (phoneme): e.g. [k], [] and [t] in cat Feature Syllable A syllable is a unit of organization for a sequence of speech sounds. For example, the word water is composed of two syllables: wa and ter. A syllable is typically made up of a syllable nucleus (most often a vowel) with optional initial and final margins (typically, consonants). Syllables are often considered the phonological "building blocks" of words. They can influence the rhythm of a language, its prosody, its poetic meter, its stress patterns, etc. A word that consists of a single syllable (like English cat) is called a monosyllable (such a word is monosyllabic), while a word consisting of two syllables (like monkey) is called a disyllable (such a word is disyllabic). A
1
Amr M. Gody
Page
word consisting of three syllables (such as indigent) is called a trisyllable (the adjective form is trisyllabic). A word consisting of more than three syllables (such as intelligence) is called a polysyllable (and could be described as polysyllabic), although this term is often used to describe words of two syllables or more. phoneme In human language, a phoneme is the smallest posited structural unit that distinguishes meaning. Phonemes are not the physical segments themselves, but, in theoretical terms, cognitive abstractions or categorizations of them. An example of a phoneme is the /t/ sound in the words tip, stand, water, and cat. (In transcription, phonemes are placed between slashes, as here.) These instances of /t/ are considered to fall under the same sound category despite the fact that in each word they are pronounced somewhat differently. The difference may not even be audible to native speakers, or the audible differences not apparent. phones Phoneme may cover several recognizably different speech sounds, called phones. In our example, the /t/ in tip is aspirated, [t], while the /t/ in stand is not, [t]. (In transcription, speech sounds that are not phonemes are placed in brackets, as here.) Allophones Phones that belong to the same phoneme, such as [t] and [t] for English /t/, are called allophones. A common test to determine whether two phones are allophones or separate phonemes rely on finding minimal pairs: words that differ by only the phones in question. For example, the words tip and dip illustrate that [t] and [d] are separate phonemes, /t/ and /d/, in English, whereas the lack of such a contrast in Korean (/tata/ is pronounced [tada], for example) indicates that in this language they are allophones of a phoneme /t/.
Page Amr M. Gody
3. Corpus Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are largely derived by an automated process, which is corrected. Computational methods had once been viewed as a holy grail of linguistic research, which would ultimately manifest a ruleset for natural language processing and machine translation at a high level. Such has not been the case, and since the cognitive revolution, cognitive linguistics has been largely critical of many claimed practical uses for corpora. However, as computation capacity and speed have increased, the use of corpora to study language and term relationships en masse has gained some respectability. The corpus approach runs counter to Noam Chomsky's view that real language is riddled with performance-related errors, thus requiring careful analysis of small speech samples obtained in a highly controlled laboratory setting. Corpus Linguistics has generated a number of research methods, attempting to trace a path from data to theory: Annotation consists of the application of a scheme to texts. Annotations may include structural markup, POS-tagging, parsing, and numerous other representations. Abstraction consists of the translation (mapping) of terms in the scheme to terms in a theoretically motivated model or dataset. Abstraction typically includes linguist-directed search but may include e.g., rule-learning for parsers. Analysis consists of statistically probing, manipulating and generalizing from the dataset. Analysis might include statistical evaluations, optimization of rule-bases or knowledge discovery methods.
Amr M. Gody
Page
Most lexical corpora today are POS-tagged. However even corpus linguists who work with 'un-annotated plain text' inevitably apply some method to isolate terms that they are interested in from surrounding words. In such situations annotation and abstraction are combined in a lexical search. The advantage of publishing an annotated corpus is that other users can then perform experiments on the corpus. Linguists with other interests and differing perspectives than the originators can exploit this work. Speech corpuses are designed to provide source of segmented sound samples for researchers or for application manufacturers. There are many of famous databases in the market. Also there are many of languages being targeted by the databases. Any Speech corpus consists of all or some of the following parts: 1. 2. 3. 4. 5. 6. 7. Annotation information. Waveform Samples. Segmentation information. Text that describes the Spoken utterance. Phoneme Statistics. Recording information. Talkers statistics.
3.1. TIMIT corpus TIMIT is a corpus of phonemically and lexically2 transcribed 3 speech of American English speakers of different sexes and dialects 4. Each transcribed element has been represented precisely in time. TIMIT was designed to advance acoustic-phonetic knowledge and automatic speech recognition systems. It was commissioned by DARPA and worked on by many sites, including Texas Instruments (TI) and
2 3
By the mean of words To represent (speech sounds) by phonetic symbols. 4 A regional or social variety of a language distinguished by pronunciation, grammar, or vocabulary, especially a variety of speech differing from the standard literary language or speech pattern of the culture in which it exists.
Amr M. Gody
Page
Massachusetts Institute of Technology (MIT), hence the corpus' name. There is also a telephone bandwidth version called NTIMIT (Network TIMIT). The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). The TIMIT corpus transcriptions have been hand verified. Test and training subsets, balanced for phonetic and dialectal coverage, are specified. Tabular computer-searchable information is included as well as written documentation. 3.2. SCRIBE corpus
Amr M. Gody
Page
The material consists of a mixture of read speech and spontaneous speech. The read speech material consists of sentences selected from a set of 200 'phonetically rich' sentences (SET-A) and 460 'phonetically compact' sentences (SET-B) and a two-minute continuous passage. The 'phonetically rich' sentences were designed at CSTR to be phonetically balanced. The 'phonetically compact' sentences were based on a British version of the MIT compact sentences (as in TIMIT) which were expanded to include relevant RP contrasts (the set contains at least one example of every possible triphone in British English). The passage was designed at UCL to contain accent sensitive material. The spontaneous speech material was collected from a constrained 'free speech' situation where a talker gave a verbal description of a picture.
The recordings were divided between a 'many talker' set and a 'few talker' set. In the 'many talker' set, each speaker recorded ten sentences from the 'phonetically rich' sentences and ten sentences from the 'phonetically compact' sentences. In the 'few talker' set, each speaker recorded 100 sentences from the 'phonetically rich' set and 100 from the 'phonetically compact' set. Speakers were recruited from four 'dialect areas': South East (DR1), Glasgow (DR2), Leeds (DR3) and Birmingham (DR4). The aim was to employ 5 male and 5 female speakers from each dialect area for the fewtalker sub corpus, with 20 male and 20 female speakers from each dialect area for the many-talker corpus. In fact this number of speakers was not fully achieved. The original aim of the project was to release the corpus as a collection of audio recordings with just orthographic transcription, but with a small percentage to be phonetically annotated in the style of the TIMIT corpus. FILE EXTENSIONS
SES PES FES SE2 PE2 FE2 SET PET FER text SEO Sentence(s) English Sampled Passage English Sampled Free-speech English Sampled Sentence(s) Eng. 2nd channel Passage Eng. 2nd channel Free-speech Eng. 2nd channel Sentence English Text Passage English Text Pressure Microphone signal Pressure Microphone signal Pressure Microphone signal Close talking Microphone signal Close talking Microphone signal Close talking Microphone signal Text used to prompt the subject Text used to prompt the subject
Orthographic time aligned labels Acoustic phonetic time aligned Acoustic phonetic time aligned
SEA Sentence Eng. Acoustic labels PEA Passage Eng. Acoustic labels SEB Sentence Eng. Broad labels PEB Passage Eng. Broad labels
Amr M. Gody
Page
Read the text file to record the speech Prepare transcription for the text. Open speech using suitable tool that enables you to view speech waveform and associated spectrogram.(SFS 5) With the aid of spectrogram, start to locate each character into the transcription file. Find the stable parts on the spectrogram. Mark them, then write the symbol. {Number of stable parts should equals to the number of symbols into the transcription file} Store annotations into suitable file. Name the file exactly as the original speech file with a new descriptive file extension.
Amr M. Gody
Page
10
Each raw indicates start time, end time and transcription symbol. The units of time are in 100(ns). For example the first Raw tells us that There is a SIL starts at time 0 and ends by 32829500 x 100 x 10-9 = 3.2(s).
Figure 3 Multi view screen in SFS is used to best annotate certain speech sample in SCRIBE corpus.
Amr M. Gody
Page
11
In this section, speech stream will be processed as information source. To start, we have first to declare what information we need to extract from speech stream. Phonemes will be considered as information in this discussion. Let us start by the first beginning of the problem. The story starts by the first moments we faced the world. It is the birthday of the human body. By the first beginning, the perception mechanisms starts to collects all sounds available in the environment. The first step is information packaging. Speech samples are packaged into short duration time frames. Each frame may be not exceeding 30 (ms). This number is to ensure that spectral parameters of speech signal will be stationary. This stage is visualized in figure 4. Each frame is 10 (ms) in this example. Within the 10 (ms) the frame may fall into three different states: 1- The frame is a pure part of a single phone. 2- The frame includes a pure part of a single phone and a successor or a possessor part that is a transition to or from adjacent phone. 3- The frame covered all phone parts. Each phone may be considered as a three stationary parts. Tow transition parts to adjacent phones and one middle stationary part. Frames are colored in figure 4 to hold the three-state model image in our mind.
Amr M. Gody
Page
12
30 (ms) = 3 Frames Train to get the maximum probability by randomly dropping fractals
Figure 4 Speech stream is converted into frames stream. Each frame is 10(ms).
The first process is to locate the speech activities. It is the first intelligent process done by the human childs brain. The brain starts to discriminate human speech off the available environmental sounds.
Baby stage
Utterance
Amr M. Gody
Page
13
Child stage
Transcription
Models Labeling
Figure 6: Models are identified into phones. Transcription is included into the learning process.
Going growth, the human brain starts to make complex analytical functionalities. The brain starts to apply grammars and to understand the emotions from the speech signal. Altering the meaning according to the context starts and a huge word networks are constructed in the mind.
Amr M. Gody
Page
14
Adult Stage
Corpus
Dictionary
Figure 7: Corpus and language models are included in the training process. Brain starts to make complex processing of speech stream. The brain starts to apply grammars and to recognize words and meanings. The lexical word net starts to be constructed.
Figure 8 introduces a simple mathematical model of speech stream processing as being discussed. First step, speech stream is turned into small time durations. This process is called time framing. This is important to have stationary signals for subsequent mathematical operations. So we can process frames stream instead of samples stream. The frame may be 30 (ms) or less for human speech signal. This is to minimize the effect of dynamics in speech features along the time. Step 1 is to Mark frames into speech / nonspeech. The output of this stage should be purely speech frames. Step 2 is to mark the speech frames stream into phones. This is a coloring phase. Step 3 is to merge frames for the same phone into one part. This is the information stream.
Amr M. Gody
Page
15
Figure 9 provides a simple pipeline process that describes the basic steps for information extraction model discussed in the previous paragraph.
Layout
Digitized speech Sample/Sec Frame/Sec
Buffer A
Buffer B
Process A
Process B
Phone/Sec
Process C
Each process is responsible for Reading from a certain buffer and writing to another buffer. This way the total pipeline operation is constructed. Each buffer is an entity that self manage of its store.
Buffer C
Word/Sec
Process D
Buffer D
Amr M. Gody
Page
16
Wavelet packets technique is utilized to detect phone features. Best tree nodes method is used as phone identifier. This gives a real representation of phone frequency components over a predetermined frequency bands. Node index is used as a feature. This give a sense that phone has a frequency component with suitable relative power in this band. This will give the process more speed by ignoring do any feature
Amr M. Gody
Page
17
extraction in certain band. Also this make the process is signal level independent. The following slide indicates how signal with different properties are discriminated using best tree algorithm
An algorithm of tree matching is utilized to give a weight for the match not just [matched/no matched] condition. This is important in such stochastically signals. Figure 11 represents the overall system layout
Amr M. Gody
Page
18
1- Number of training phones N 2- Number of new CPC units needed. 3- Reference to already matched CPC units in previous training sessions. 4- Phone Sequence List
No
Allocate new CPC
Phone Already exist
yes
Locate phone CPC
Input buffer
No yes No
Average count indicates group change
yes
Exclude garbage frames
Amr M. Gody
Page
19
References
[1] LR. Rabiner, R.W. Schafer, "Digital Processing Of Speech Signals", PrenticeHall, ISBN: 0-13-213603-1. [2] Thomas W. Parsons, "Voice and Speech Processing " ,McGraw-Hill inc.,1987,pp. 57-98, 136-192, 291-317. [3] Nemat Sayed Abdel Kader ," Arabic Text-to-Speech Synthesis by Rule", Ph.D thesis , Cairo university, faculty of Eng., communication dept., 1992. Page 165. [4] Wikipedia contributors, "Phonological hierarchy," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Phonological_hierarchy&oldid=204742 230 (accessed October 24, 2008). [5] Amr M. Gody, " Speech Processing Using Wavelet Based Algorithms" , Ph.D thesis , Cairo university, faculty of Eng., communication dept., 1999. [6] Amr M. Gody, "Human Hearing Mechanism Codec (HHMC)", CLE2007,The sixth conference of language engineering , Ain-Shams University, Cairo Egypt, December 2007 [7] Amr M. Gody, "Wavelet Packets Best Tree 4 Points Encoded (BTE) Features", The Eighth Conference on Language Engineering, Ain-Shams University, Cairo Egypt, December 2008 [8] Amr M. Gody, " Voiced/Unvoiced and Silent Classification Using HMM Classifier Based on Wavelet Packets BTE Features", The Eighth Conference on Language Engineering, Ain-Shams University, Cairo Egypt, December 2008
3TU U3T
Amr M. Gody
Page
20
Speech Analysis
2nd Edition
2008
Summary In this chapter Linear Prediction Coding (LPC) will be introduced. Storage. The importance of this method lies both in its ability to provide extremely accurate estimates of the speech features, and in its relative speed of computation. Some of the issues which are involved in using it in practical speech applications will be discussed.
Amr M. Gody
Page
Speech Analysis
2nd Edition
2008
1. Linear Prediction Coding (LPC) Linear prediction is one of the most important tools in speech processing. It can be utilized in many ways but regarding to speech processing, the most important property is the ability to model the vocal tract. It can be shown that the lattice structured model of the vocal tract is an all-pole filter which means a filter that has only poles. One can also think that the lack of zeros restricts the filter to bolster up certain frequencies which in this case are the formant frequencies of the vocal tract. In reality the vocal tract is not composed of lossless uniform tubes, but in practice, modeling the vocal tract with an all-pole filter works fine. Linear prediction (LP) is a useful method for estimating the parameters of this all-pole filter according to a recorded speech signal.
Amr M. Gody
Page
This system is excited by an impulse train for voiced speech or a random noise sequence for unvoiced speech. Thus, the parameters of this model are: voiced/unvoiced classification, pitch period for voiced speech, gain parameter G, and the coefficients (a k ) or the digital filter, These parameters are slowly varying with time.
Speech Analysis
2nd Edition
2008
The term linear prediction refers to the prediction of the output of a linear system based on its input and previous outputs s(n) = ( ) + bm u(n m)
k=1 m=0 p L
[1]
For systems where the input is unknown, it is better to try to express it in terms of current input and previous output. This is achieved by putting = 0 > 0 () = ( ) + 0 ()
=1
[2]
Once we know the current input and the previous output we can predict the current output. The target is to evaluate the behavior of the unknown system H(z) once we have such information {previous outputs and current input}. () = () () () = ()()
[3] [4]
To go into the model that may generate such output in equation 2, we consider All-pole model. Assume the following transfer function () = ()
Amr M. Gody
Page
[5]
Speech Analysis
2nd Edition
2008
While
() = 1 +
=1
[6]
() 1 + = ()
=1
() = ()()
[7]
[8]
() = () ( )
=1
[9]
[10]
[11]
Amr M. Gody
Page
Speech Analysis
2nd Edition
2008
1- The vocal track is modeled in an all pole model. This implies that the output is a function of previous output and the current input. 2- We are targeting to model vocal tract filter using the observation utterance. The input is unkown. = ( () ())2 = ()2 =
3- The mission now is to find the optimal parameters { } that ensures the minimum error
[12]
Let us start to process equation 12. As the number of available samples are limited due to the framing process, e(n) has a value only in certain time indexes (Instances defined by the index n). This leads to that the energy of e(n) will not change if we sum over infinite = ()2
= ( ( ))2 = ( )
= =0
= ( ( ) + ())2
= =1
[13]
Amr M. Gody
Page
0 = 1
[14]
( ) =
( )
= ( )
=
=0
Speech Analysis
2nd Edition
2008
[15]
= 2 ( ( )) + ( )
=0 =0
= ( ) = 2 ( ( ))( )
= = =0
( ) = 2 ( ( ))( )
() +1 ( 1) + 0
[16]
[17]
Let
= 2 ( )( )
=0 =
[18]
Amr M. Gody
Page
(, ) = ( )( )
=
[19]
Speech Analysis
2nd Edition
2008
Equation 19 is the autocorrelation of s(n) with a delay k-i. We can rearrange (, ) much more by replacing n-i = m (, ) = ()( ( )) = ( )
=
[20]
= 2 (, )
=0
[21]
[22]
Amr M. Gody
Page
[23]
Speech Analysis
2nd Edition
2008
Recalling from equations 19 and 20 that R(k) = R(-k) and from equation 14 that 0 = 1, then equation 24 may be rearranged as follows:
=1
= 2 (1) +1 (0) + + ( 1)
2 (, 1) = 2 0 (0,1) +1 (1,1) + + (, 1)
=0
[24]
( 1) = (1)
[25]
Follow the same direction as in 24 and 25 we can write the other terms of the equality in equation 23 as follows:
( 1) = (1) =1 ( 2) = (2) =1 ( ) = () =1
[26]
(0) (1)
(1) (2)
( 2)
[27]
Equation 27 is symmetric matrix of autocorrelation coefficients. Let us recall what we have till now:
Page
1- The linear prediction coefficients may be calculated using equation 27. It is for a linear system that can we utilize to regenerate the output with
Amr M. Gody
Speech Analysis
2nd Edition
2008
3- () = () From this point let us figure out the possible ways to solve equation 27.
2- 0 = 1.
minimum error. This model do not assume the real input that generates the original signal, rather it assumes an impulse input.
Amr M. Gody
Page
%%[A] = LPCDemo(file,N,T,p) % This function demonstrate LPC using plotting spectral, Estimated frame % spectral, Frame waveform and estimated frame waveform. The frame is % extracted from a speech samples included into the WAV file 'file'. This % function returns LPC parameters vector of the requested frame. %file :WAV file %N : Frame number. The first frame is frame number 0. Default = 1 %T : Frame size in ms, Default = 10 (ms) %p : LPC order, Default = 12 %%----------------------------------------------------------------------function [A] = LPCDemo(file,N,T,p) nbIn = nargin; if nbIn < 1 , error('Not enough input arguments.'); elseif nbIn == 1, N=1; T=10; p=12;
Speech Analysis
2nd Edition
2008
elseif nbIn ==2, T=10; p=12; elseif nbIn ==3, p=12; end; [y fs] = wavread(file); frame_size = T *1e-3 * fs; Start = N* frame_size; End = Start + frame_size -1; Frame = y(Start:End); temp = abs(fft(Frame)); Spectral_Parameters = temp(1:int32(frame_size/2)); dltaf = fs / frame_size; dltat = 1/fs; t = Start*dltat:dltat: End * dltat; f = 0:dltaf:fs/2-dltaf; A = LPC(Frame,p);
subplot(2,2,1);plot(t,Estimated_frame);xlabel('Time(sec)');ylabel('Amplitude'); title('Estimated Frame for single impulse'); subplot(2,2,2);plot(t,Frame);xlabel('Time(sec)');ylabel('Amplitude');title('Fra me'); subplot(2,2,4);plot(f,Spectral_Parameters);xlabel('Frequency(Hz)');ylabel('Spec tral value');title('Spectrum of Frame'); subplot(2,2,3);plot(f,Estimated_Spectral_Parameters);xlabel('Frequency(Hz)');yl abel('Spectral value');title('Spectrum of Estimated Frame'); end
Amr M. Gody
Page
10
Speech Analysis
2nd Edition
2008
Figure 3 indicates that speech waveform changes in shape by time. As it is shown in figure, each step on the time axis is 5 (ms). LPC parameters provide filter parameters for certain time duration. The duration should be selected such that speech waveform has stationary properties. The typical values for the analysis period may be 10, 15, 20, 25, and 30 (ms). Let us consider a simple example. Assume a speech signal sampled at 32000 (Hz). Consider analysis period of 20(ms). The frame length in samples will be 32000 20 103 = 640 ().
Assume that the LPC model considered in this chapter is used to model speech production process. Assume that the LPC order is 12. This means that 12 parameters will be used to generate equivalent speech waveform in the designated analysis period. For this example it is 20 (ms). The model in figure1 will be excited using a train of impulses with a period equals to the fundamental frequency (pitch) in case of voiced sound. The above example indicates that 640 samples frame length is replaced by 12 parameters + 1 extra parameter that indicated if this frame is for voiced sound or unvoiced sound. This implies 640:13 compression ratio.
Page Amr M. Gody
11
Speech Analysis
2nd Edition
2008
The process is illustrated in figure 4. Speech waveform is sliced into frames. Each frame is 20(ms). Each frame is 640 samples. Each frame is applied to the LPC process. LPC process extracts the LPC parameters. Here it is 12 parameters. Also the pitch information is packaged into the parameters frame. So each time samples frame is converted into 12 LPC parameters + 2 parameters that include the pitch period and the type of that frame voiced/unvoiced). The sequence of parameters frames are applied to the model. Then the predicted speech waveform is constructed. The MUX block is for Multiplexer. It multiplex the two inputs using the Voiced/Unvoiced selector that is included into the parameters frame.
Amr M. Gody
Page
12
Speech Analysis
2nd Edition
2008
References
[1] LR. Rabiner, R.W. Schafer, "Digital Processing Of Speech Signals", PrenticeHall, ISBN: 0-13-213603-1. [2] Thomas W. Parsons, "Voice and Speech Processing " ,McGraw-Hill inc.,1987
Amr M. Gody
Page
13
5th
Edition
2008
Summary In this chapter some of the popular speech signal features will be explained. Speech features are the base of speech recognition. Speech features are used to discriminate different speech sounds. There are too many features that may be utilized in Automatic speech recognition systems. In this chapter we will pass through Pitch, formants, Cepstrum, Energy and zero crossing and Mel Cepstrum. The algorithms to evaluate the mentioned features will be explained and provided in C# or Matlab scripting programming languages.
Objectives Understanding Pitch. Understanding Reflection coefficients. Understanding Formants. Understanding Cepstrum and Mel Cepstrum. Understating Energy and Zero crossing rate of different speech sounds. Practice using Matlab and C#.
Amr M. Gody
Page
5th
Edition
2008
1. Energy The energy is a property that can be used to discriminate Voiced speech from unvoiced speech. Let us consider the short time energy. Short time means the energy of analysis period. = ()2
=
[1]
Equation 1 evaluates the energy of the signal s(n). We are dealing here in short time energy for single frame.
As it shown in figure 1, the impulse train has Fourier transform of impulse train with a frequency spacing of be utilized to understand all events in frequency domain. Let us start by analyzing the windowing effect. It is the process of cut certain period off the signal to extract features. This process is called a framing stage. The cut
Amr M. Gody
Page
5th
Edition
2008
signal is called the analysis period. Figure 2 indicates the model of framing process.
Speech frame S f (n) is a result of applying a frame of impulse train U(n) to the filter H(z). H(z) is a model of glottal pulse shape filter multiplied by vocal tract filter. The frame of impulse train is a result of multiplying an infinite impulse train by the window W(n).
() = ()()
The multiplication in time domain is a convolution in frequency domain between the W(z) and P(Z). Figure 3 indicates the frequency response of a rectangular window of length 30 samples.
Amr M. Gody
Page
5th
Edition
2008
As shown in figure 3 the window in frequency domain is a low pass filter with a cut off frequency of between P(Z) and W(z) will produce a train of W(z) in the frequency domain.
30
Amr M. Gody
Page
5th
Edition
2008
As shown in figure 4, the frequency response of U(n) is shaped impulses by the filter in figure 3. Now U(n) is the excitation of H(z). The multiplication between the shaped impulses and H(Z) will be much sensitive to frequency contents than impulse train with almost zero duration in frequency domain as shown in figure 5. The pulse width increases if the window length decreases. = 2 2
2
When pulse width increases much frequency contents will be included. This certainly will has a significant factor if it is needed to catch any phenomena that is fluctuating in time domain or has many adjacent frequency components in frequency domain. Also if the window is very narrow for example in order of impulse period, this will cause that the shaped impulses will be overlapped in frequency domain and may cause noisy information. So if we choose a suitable length of the window, this will help to detect much frequency components from E(z), which in turn cause to detect the fast fluctuations in E(z).
(a)
Amr M. Gody
Page
5th
Edition
2008
(b)
Figure 5: The U(n) H(n) in frequency domain. (a) U(n) is a multiplication between window function and glottal impulses. (b) No window function is applied to the glottal impulses.
= (()( )) = () ( )
= =
[2]
() = ()
For rectangular window as that one shown in figure 2, the impulse response is given by 1 0 1 () = 0 sin (
Amr M. Gody
Page
) 2 (N1) 2 () = sin ( ) 2
[3]
5th
Edition
2008
Figure 6: S(n) is being convoluted in time domain with the window filter to obtain the frame samples.
Amr M. Gody
Page
Figure 7 : Rectangular window and its impulse response. Window width 220.
5th
Edition
2008
Figure 8: Rectangular window and its impulse response. Window width 40.
Figure 8 explains frequency response of rectangular window of length 220 (samples). The cut off frequency of the filter is 0.028 (rad/sample). In figure 9 where window width is 40 (samples), the cut off frequency is 0.157 (rad/sample). The following script in Matlab is used to draw figures 8 and 9.
%% function [omega,ys] = WFD(N) % WFD is for Window Filter Demo. This function calculate the impulse % response of the rectangular window. It returns the digital frequency % Omega and the associated absolute spectrum for the window of width N. %%---------------------------------------------------------------------% Amr M. Gody, Fall 2008 %% function [omega,ys] = WFD(N) t = -N * 10:N*10; M = size(t,2)-1; x = rectpuls(t,N); y = fft(x); ys = abs(y); dltaOmega = 2*pi/M; L = 0:M; omega = dltaOmega .* L; tt = 1; yy= 1.4; Amr M. Gody
Page
5th
Edition
2008
As shown in figure 10 Energy is a function of window width. When the window is too short, the cut off frequency is a little bit large so that the fluctuation in energy is much noticeable. The energy function may be used to discriminate Voiced sounds off the unvoiced sounds. Voiced sounds have noticeable energy values with respect to the unvoiced sounds.
Figure 9 : Energy as a function in window width for certain speech sample. The speech utterance is What she, said
Amr M. Gody
Page
5th
Edition
2008
1- The rectangular window affects the frequency contents of the analysis frame. The window width is acting as a low pass filter that shapes the impulse train of the vocal cords. The cut off frequency inversely proportional with window width. The cut off frequency is 2- Window length has a key factor in detecting the fluctuation in the energy function. The fluctuation may be modeled as frequency components spread over certain frequency band. Window length shapes the impulse trains in frequency domain. It cause the that the output signal to be much dense of frequency components than the zero length impulse train. 3- The energy function is used to discriminate the voiced sounds off the unvoiced sounds.
2
2. Zero Crossing Speech signals are broadband signals and the interpretation of average zero-crossing rate is therefore much less precise. However, rough estimates of spectral properties can be obtained using a representation based on the short- time average zero-crossing rate. Before discussing the interpretation or zero- crossing rate for speech, let us first define and discuss the required computations. = () (( 1))( )
=
[4]
Where
1 () 0 () = 1 () < 0
Amr M. Gody
Page
1 () = 2 0 1 0
[5]
[6]
10
5th
Edition
2008
If zero crossing occurred, equation 4 will evaluate to 2. To get the total number of zero crossing over the N samples we should first divide by 2 to evaluate the right number of zero crossing then we will divide by N to get the average. Now let us see how the short-time average zero-crossing rate applies to speech signals. The model for speech production suggests that the energy of voiced speech is concentrated below about 3 kHz because of the spectrum fall off introduced by the glottal wave, whereas for unvoiced speech, most of the energy is round at higher frequencies. Since high frequencies imply high zero- crossing rates, and low frequencies imply low zero-crossing rates, there is a strong correlation between zero-crossing rate and energy distribution with frequency. A reasonable generalization is that if the zerocrossing rate is high, the speech signal is unvoiced, while if the zerocrossing rate is low, the speech signal is voiced. This however is a very imprecise statement because we have not said what is high and what is low. And, of course, it really is not possible to be precise. Figure 7 shows a histogram of average zero-crossing rates (average over 10 (msec)) for both voiced and unvoiced speech. Note that a Gaussian curve provides a reasonably good fit to each distribution. The mean short-time average zerocrossing rate is 49 per 10 (msec) for unvoiced and 14 per 10 (msec) for voiced. Clearly the two distributions overlap so that an unequivocal voiced/unvoiced decision is not possible based on short-time average zero crossing rate alone. Nevertheless, such a representation is quite useful in making this distinction.
Amr M. Gody
Page
11
5th
Edition
2008
Figure 11: Histogram for zero crossing averaged on 10 (ms) for Voiced/Unvoiced samples. The solid curve is for Gaussian distribution fit.
Clearly, the zero-crossing rate is strongly affected by dc offset in the analog-to-digital converter; 60 Hz hum in the signal and any noise that may present in the digitizing system. Therefore, extreme care must be taken in the analog processing prior to sampling to minimize these effects. For example, is often preferable to use a bandpass filter, rather than a lowpass filter as the anti-aliasing filter to eliminate dc and 60 Hz components in the signal.
Amr M. Gody
Page
12
5th
Edition
2008
3. Reflection coefficients A widely used model for speech production is based upon the assumption that the vocal tract can be represented as a concatenation of lossless acoustic tubes. Figure 9 lists the main organs that are responsible of speech production. Starting from this model, we can see that speech waveform is a flow of air inside different tubes. The tubes are not uniform. It has different cross section variations along the tube path to articulate the different sounds. This changing properties cause turbulence in the air flow.
Amr M. Gody
Page
13
5th
Edition
2008
To study the air flow inside those tubes, we should consider the velocity of the air, the pressure along with the tube path and the properties of the tubes. Let us consider the following parameters:
Amr M. Gody
Page
14
5th
Edition
2008
Considering the hypothetical tube of constant cross section indicated in figure 10, and then we will have the following relations that rule the pressure and the velocity:
= = 2
[7]
Now let us expand our talk to cover the whole vocal tract tube. Refer to figure 10; the whole vocal tract excluding the nasal cavity is modeled as a long straight tube with continuously changing cross sectional area.
Amr M. Gody
Page
Let us solve the differential equations in 7. Equation 7 is derived for uniform cross section tube as that one indicated in figure 10. To overcome
15
5th
Edition
2008
this limitation, let us consider to segment the tube in figure 11 into segments with each of uniform cross section as shown in figure 12.
The solution will be for a single segment. Assume time harmonic input as in equation 8. This assumption is very practical as any signal may be considered as a combination of sins and cosines. Also the system is a linear system, so the output is a sum of each output resulted from the input harmonics. The term will appear in all equations terms.
= =
Amr M. Gody
Page
16
5th
Edition
2008
Applying
2
= 2 2 = 2 2
2 = 2 2 2 2 2 = 2 2
on equation 10:
[11]
= + + = + +
2
2
[12]
[13]
[14] [15]
(, ) = +
(, ) = + +
[16]
The + and signs are for incident and reflected wav respectively. The x direction is considered as the incident direction. = +
= +
Amr M. Gody
Page
= =
[18]
[17] [19]
17
R is the speech wave impedance inside the lossless transmission line media (vocal tract model). =
5th
Edition
2008
[20]
[21]
(, ) = + ( ) ( ) () = + + And
() = +
Recalling that the speech wave has a direction of the velocity, equation 16 may be written as: [22] [23]
Equations 23 and 24 will be denoted with the index of the corresponding segment as of figure 12 (, ) = ( + ( ) + ( )) (, ) = + ( ) ( ) + = + ( ) = +
[24]
[25]
Amr M. Gody
Page
18
The propagation within a segment causes a shift in the velocity wave as indicated in equation 27 equals to . For a complete picture refer to figure 13.
+ ( , ) = + = + ( )
[26] [27]
5th
Edition
2008
[28]
The reflection coefficient at the interface between segments k and k+1 is given by: Let us consider the following boundary conditions between any two successive segments ( , ) = +1 (0, ) ( , ) = +1 (0, ) [29]
The reflection and transmission coefficients for the wav propagating from Media 1 to media 2 Now Let solve equations 25 for + and
Amr M. Gody
Page
19
5th
Edition
2008
Figure 13 and the differential equations 10 leads to the transmission line analogy. Each segment may be modeled as a lossless transmission line. P and U will do the role of V and I in the transmission line. The characteristic impedance of the transmission line will be R.
lk Rg
Z1 Vg
Zk
Zk+1
Zlips
Figure 18: Analogous circuit for vocal tract lossless tube model
L I-dI C
dx
= =
= =
[30]
[31]
[32]
Amr M. Gody
Page
20
5th
Edition
2008
[33]
= =
[34]
Equation 34 matched to 21 as expected. The above discussion leads us to use Transmission line calculations. Each segment in figure 13 is a load segment as shown in figure 14. The wav turns to be standing wave due to the reflection at load points. The reflection and transmission coefficients are given by equations 35.
2 1 2 + 1
1,2 =
1,2 =
22 2 + 1
2 1 + 2 1 2 2 + 2 1
= 1 +2
21 1 +2
1 2
[35]
4. Pitch The opening and closing of the cords break the air stream up into pulses as shown in figure 1. The repetition rate of the pulses is termed Pitch.
Amr M. Gody
Page
21
5th
Edition
2008
5. Cepstrum
The output of the characteristic system in figure 18; () is called the "complex cepstrum". The term "complex cepstrum" implies that the complex logarithm is involved. Given that it is possible to compute the complex logarithm so as to satisfy Equation 36, the inverse transform of the complex logarithm of the Fourier transform of the input is the output of the characteristic system for convolution, i.e..
Figure 22: Complex cepstrum
[37]
Amr M. Gody
Page
22
5th
Edition
2008
Equation 37 gives a very important aspect of cepstrum. The Fourier transform of a certain process may be considered as a resultant of cascaded filters as shown in equation 36. Each filter in the cascaded process may hold a certain piece of information. It is hypothetical filters. I can assume that the total information is a compound of smaller pieces of that information pieces. The log operator breakdown the compound information into superimposed pieces. The final stage of the process is the inverse transform. This makes the cepstrum parameters are the output of a linear system that is being triggered by that superimposed pieces of information signals. This makes cepstrum domain acts as a parallel time domain that gives a new perspective of the original speech signal. For example the signal in time domain is a sum of sin and cosines. We can discriminate those signals in frequency domain by using suitable filters. The signal in the parallel time domain (quefrency domain) is a resultant of sum of some signals in the log-frequency domain. Those signals may be discriminated using suitable filters (called lifters) in the parallel time domain. Much information may be emphasized from this new perspective. The keywords in this new parallel domain are a little bit altered to remind us with the way it is being created. Figure 19 provides the keywords in the newly established domain. This is a Homomorphic filtering.
Amr M. Gody
Page
23
5th
Edition
2008
Homomorphic filtering is a generalized technique for signal and image processing, involving a nonlinear mapping to a different domain in which linear filter techniques are applied, followed by mapping back to the original domain. Homomorphic filtering is used in the log-spectral domain to separate filter effects from excitation effects, for example in the computation of the cepstrum as a sound representation; enhancements in the log spectral domain can improve sound intelligibility. 7.1. Speech signal in Complex cepstrum domain Recall that the model for speech production consists essentially or a slowly time-varying linear system excited by either a quasi-periodic impulse train or by random noise. Thus, it is appropriate to think or a short segment of voiced speech as having been generated by exciting a linear timeinvariant system by a periodic impulse train. Similarly, a short segment of unvoiced speech can be thought of as resulting from the excitation of a linear time-invariant system by random noise. That is, a short segment of voiced speech can be thought of as a segment from the waveform Equation 38 illustrate that the voiced segment is a resultant of a convolution of Periodic pulse train () with a period and glottal pulse shape filter () and vocal tract filter () and radiation effects filter (). For unvoiced speech the waveform segment is given by equation 39 below: () = () () () () : is random noise excitation. [39] () = () () () () [38]
Amr M. Gody
Page
From equations 38 and 39, and from the definition of complex cepstrum, we can find that using cepstrum makes it possible to isolate the excitation off the vocal tract model. This is very important in automatic speech recognition process. The excitation is a speaker dependant. By isolating that effect, it is expected that the recognition efficiency will be highly enhanced.
24
5th
Edition
2008
A segment of voiced speech is shown in figure 20. The hamming window is applied to minimize the effect of sharp changes at segments ends. The effect of excitation plus train is clearly appearing as high peaks. The filters effect appear in the shape of pulses.
Amr M. Gody
Page
25
5th
Edition
2008
%%Cepstrum_Demo(file,start_time,duration) % This function demonstrate Cepstrum analysis. It applies complex cepstrum % analysis on a short segment of duration in (ms). To use it, get a wav % file and choose a period of voiced speech and periods of unvoiced speech. % Save the figures and make a conclusion. Try to focus on pitch period. %file : WAV file %start_time : Starting time in seconds to get a segment. %duration :duration of the segment in milliseconds %%----------------------------------------------------------------------% Amr M. Gody, Fall 2008 %% function Cepstrum_Demo(file,start_time,duration) [y,fs] = wavread(file); LastSample = size(y,1); t = linspace(0,LastSample/fs,size(y,1)); end_time = start_time+duration * 1e-3; start_sample = start_time * fs; end_sample = end_time * fs; S = y(start_sample:end_sample); t_s = t(start_sample:end_sample); t_s = t_s - t_s(1); %this step is to shift the time axis to 0 S_H = S .* hamming(size(S,1)); C = cceps(S_H); subplot(3,2,1); plot(t,y);xlabel('Time (Sec)'); ylabel('Amplitude');title('Original wav file'); subplot(3,2,2);plot(t_s,S);xlabel('Time (Sec)'); ylabel('Amplitude');title('WAV Segment'); subplot(3,2,3);plot(t_s,S_H);xlabel('Time (Sec)'); ylabel('Amplitude');title('WAV Segment after Hamming'); subplot(3,2,4);plot(t_s,C);xlabel('Time (Sec)'); ylabel('Amplitude');title('Complex Cepstrum of Hamming segment'); end
Amr M. Gody
Page
26
5th
Edition
2008
6. mel Cepstrum It is the same as cepstrum with only one difference. The log is applied on mel scale spectrum instead of spectrum.
It is believed that human hearing system is the best recognition system. By trying to simulate human hearing system, good practical results may be achieved. Speech signal is processed in this research in such a manner that low frequency components have more weights than high frequency components [3]. The human ear responds to speech in a manner such as that as indicated by Mel scale in figure 23. This curve explains a very important fact. Human ears cannot differentiate between different sounds in high frequency scale while it can do this in low frequency scale. Mel scale is a scale that reflects what human can hear. As shown by figure 23, a change in frequency from 4000(HZ) to 8000 (HZ) makes only 1000 (Mel) change in Mel scale. This is not the case in the low frequency range which starts at 0(HZ) and ends by 1000 (HZ). In this low frequency range it appears that 1000(Hz)s change is equivalent to 1000(Mel) change in Mel scale. This explains that the human hearing is very sensitive for frequency variation in low range while it is not the case in high range.
Figure 27: Mel scale curve that models the human hearing response to different frequencies[4]
Amr M. Gody
Page
27
5th
Edition
2008
7. Exercises Exercise 1
Use the following Matlab script to do the following exercises. Find the frequency response of impulse train of period 120 (Hz). Assume the sampling rate is 10 (KHz).
Answer Write the following command lines g = 1; Tp = 1/120; Ts = 1/10000; P = Tp/Ts ; [f,h]=cr_fr(g,P);
Amr M. Gody
Page
28
5th
Edition
2008
Exercise 2
Now apply a rectangular window of length 20 (ms); Answers
Amr M. Gody
Page
29
5th
Edition
2008
(HZ) when multiplying it by 10000 (Hz) sampling rate. 2- Impulses in frequency domain are shaped in window frequency response. Exercise 3: Now consider the following glottal pulse shape, find the frequency response. g = [[0:0.1:5]';[5:0]']; [f,h]=cr_fr(g);
2 =0.0754
Exercise 4: Now let us find the frequency response of a train of impulses applied to the glottal pulse in exercise 3. Assume a period of 150 samples [f,h]=cr_fr(g,150);
Amr M. Gody
Page
30
5th
Edition
2008
Observations: 1- Frequency response is repeated in a period corresponding to the impulse train period. 2- The frequency response is discrete signal. 3- The frequency response in exercise 3 is the envelop of the frequency response in this exercise. 4- The distance between any two successive impulses in the frequency domain is
2
= 0.0419.
Amr M. Gody
Page
31
5th
Edition
2008
Exercise 5:
Now apply a window of length 600 samples on the signal in exercise 4. [f,h]=cr_fr(g,150,600);
Observations 1- Window of length 600 cause a shaping of the impulses in frequency domain. 2- Frequency response of exercise 3 is the envelope of this frequency response. 3- The distance between any two peaks in frequency domain is
2 150 2
Amr M. Gody
Page
= 0.0419 (rad/sample)
32
5th
Edition
2008
600
= 0.0105 (rad/sample).
5- The signal power is much distributed over the frequency. Compare the frequency response of exercise 4 to this frequency response.
Amr M. Gody
Page
33
5th
Edition
2008
Exercise 6: Now Evaluate the vocal tract model for certain voiced speech segment. Use the function v = cr_vtf_1(a2);
Amr M. Gody
Page
34
5th
Edition
2008
Exercise 7 Using the filter obtained in exercise 6, pitch = 286, obtain the corresponding filter response of impulse train in period equals to pitch period. [f h t y] = cr_fr(v,286);
Exercise 8 Using the following chart that provides the spectrum of the vocal tract filter of the signal in exercise 6, locate the formants and the pitch on the chart. Pitch frequency =
286 2
= 0.0220 (
Amr M. Gody
Page
35
5th
Edition
2008
Observations 1- Frequency response of vocal tract filter is sampled in pitch frequency. 2- Formants locations have the maximum power. Exercise 9 Include the window effect in the previous example. This is by having a window of length = 4 pitch periods. Wl = 4 * 286 = 1144 samples; [f2 h2] = cr_fr(v,286); [f1 h1] = cr_fr(v,286,1144); [f h] = cr_fr(v); h = cr_TR(h2,h); h1 = cr_TR(h2,h1); subplot(3,1,1);plot(f2(1:100000),h(1:100000)); subplot(3,1,2);plot(f2(1:100000),h2(1:100000)); subplot(3,1,3);plot(f2(1:100000),h1(1:100000));
Page Amr M. Gody
36
5th
Edition
2008
Observations: 1- The power is distributed on more frequency components. This indicates the smaller values of spectrum with respect to the figure in the middle. 2- The bandwidth of the impulse is
2
1144
= 0.0055 (
Amr M. Gody
Page
37
5th
Edition
2008
7.1.
Matlab functions
function CR_fr
%% function [o,h]=CR_fr(g,p,Wn,Wt) % FR evaluates Frequency Response of digital function. %The function is sample base. The frequency response is in digital frequency domain. It do not include %the sampling period. If you want to evaluate the real frequency you should multiply the frequency axis by %sampling frequency. The function plot the frequency domain as well as the time domain. % G: The Discrete time signal for one period. % P: The period in samples. Default is 0 means that it is not a periodic signal. % O: The digital frequency values associated to the frequency response values (H). % H: the frequency response. % Wn : Window length in samples. Default = Signal length % Wt : Default is rectangular window. %----------------------------------------------------------------------% Amr M. Gody, Fall 2008 %% function [o,h]=CR_fr(g,p,Wn,Wt) nbIn = nargin; if nbIn < 1 , error('Not enough input arguments.'); elseif nbIn==1 , %% Period = 0 , Rect window , window length = Signal length g = ToColumn(g); Ng = size(g,1); p = 0; Wn = Ng * 10; Wt = 1; N = Ng * 10; x = [g;zeros(N-Ng,1)];
elseif nbIn==2 , %% P ~= 0 , Rect window , window length = Signal length g = ToColumn(g); Ng = size(g,1); if p > Ng, Xg = [g;zeros(p-Ng,1)]; else Xg = g(1:p); end x = Xg; for k =1 : 100, x = [x;Xg]; end N = size(x,1);
Amr M. Gody
Page
38
5th
Edition
2008
elseif nbIn==3, %% P~=0 , Rect window , window length ~= Signal length g = ToColumn(g); Ng = size(g,1); if p > Ng, Xg = [g;zeros(p-Ng,1)]; else Xg = g(1:p); end x = Xg; for k =1 : 100, x = [x;Xg]; end x = x(1:Wn); N = Wn; elseif nbIn==4, %% P~=0 , Hamming window , window length ~= Signal length g = ToColumn(g); Ng = size(g,1); if p > Ng, Xg = [g;zeros(p-Ng,1)]; else Xg = g(1:p); end x = Xg; for k =1 : 100, x = [x;Xg]; end x = x(1:Wn) .* hamming(Wn); N = Wn; end %% N = N * 10; % this is to increase the calculation resolution. This step do not affect the results. h = abs(fft(x,N)); dlta = 2*pi/N; o = 0:dlta:(N-1) * dlta; n = [0:max(size(x)-1)]; if(p ~=0)' n = [0:max(size(x)-1)]; subplot(2,1,1);bar(n(1:4*p),x(1:4*p),0.001);Xlabel('Time (Sample)'); else subplot(2,1,1);bar(n,x,0.001);Xlabel('Time (Sample)'); end subplot(2,1,2);plot(o,h);xlabel('Frequency (Rad/sample)'); end
Amr M. Gody
Page
39
5th
Edition
2008
function cr_VTF_1
%[v] = cr_VTF_1(frame) %%VTF = Vocal Track Filter Estimation. The function calculate the frequency %%response of vocal tract for a certain frame. The function returns the vocal tract filter in time %%domain.
%Frame : wavefor Frame %%----------------------------------------------------------------------% Amr M. Gody, Fall 2008 %% function [v] = cr_VTF_1(Frame) nbIn = nargin; if nbIn < 1 , error('Not enough input arguments.'); end; p = 12; % LPC order % H(z) = B/A. B and A are the coefficients of Z series in the numerator and denumerator. A = LPC(Frame,p); frame_size = max(size(Frame)); B = 1; x = zeros(frame_size,1); x(1) = 1; v = filter(B,A,x); subplot(2,1,1);plot(Frame);xlabel('Time (Sample)');title('Speech frame'); subplot(2,1,2);plot(v);xlabel('Time (Sample)');title('Vocal tract impulse response'); end
function cr_TR
%%[f] = cr_TR(f1,f2) %%TR = Time wrapping. This function aligns f2 to f1. It resamples f2 such %%that f2 length = f1 length. The function returns f2 in the new sampling %%rate. %%----------------------------------------------------------------------% Amr M. Gody, Fall 2008 %% function [f] = cr_TR(f1,f2) nbIn = nargin; if nbIn < 2 , error('Not enough input arguments.TR = Time wrapping. This function aligns f2 to f1. It resamples f2 such that f2 length = f1 length. The function returns f2 in the new sampling'); end; s1 = max(size(f1));
Amr M. Gody
Page
40
5th
Edition
2008
Amr M. Gody
Page
41
5th
Edition
2008
8. References
[1] LR. Rabiner, R.W. Schafer, "Digital Processing Of Speech Signals", PrenticeHall, ISBN: 0-13-213603-1. [2] Thomas W. Parsons, "Voice and Speech Processing " ,McGraw-Hill inc.,1987 [3] Alessia Paglialonga, "Speech Processing for Cochlear Implants with the DiscreteWavelet Transform: Feasibility Study and Performance Evaluation", Proceedings of the 28th IEEE EMBS Annual International Conference New York City, USA, Aug 30-Sept 3, 2006 [4] Mel scale, http://en.wikipedia.org/wiki/Mel_scale
3TU U3T
Amr M. Gody
Page
42
7th
Edition
2009
Summary In this chapter the basic ideas in speech recognitions will be illustrated. Automatic Speech Recognition (ASR) techniques are widely spread from phone recognition to context understanding. Too many applications nowadays get benefits of ASR. Speech is considered as a new human user interface just like mouse and keyboard in the modern computer systems. Moreover, speech understanding machines are widely used in some systems that require intensive interface to clients like ticket reservation of air planes and restaurant orders. Speech dialog is replacing the human-human interface. Those types of predefined speech dialogs are very stable due to that the speech stream is guided by predefined grammar. The idea of distance measurer of speech features will be introduced through different type of measurers. Time alignment method like dynamic time warping will be discussed. The different type of template training methods will be discussed.
Objectives Understanding the idea of speech normalization. Understanding source coding using vector quantization. Understanding clustering methods and how it is used in recognition. Get much familiar in using C# and Matlab in speech recognition.
Amr M. Gody
Page
7th
Edition
2009
1. Time alignment Time alignment is the process of normalizing the length of all speech samples of the same spoken word. This alignment should consider speech variability according to different articulation of speakers or self speaker variability conditions that affects the way of spoken word. In this case all speech samples are belonging to the same word, but they are different in length. In order to make recognition we should first make a time alignment. The time alignment should consider features alignment along the time of spoken word. This dynamic process is called a dynamic programming or Dynamic Time Warping (DTW). Figure 1 illustrates two samples of the same word. Although the two samples contain the same phones sequence, the duration of each phone is not linearly scaled. This leads us to the dynamic programming. We should consider this fact during time alignment. It is not a linear process.
Amr M. Gody
Page
To discuss the dynamic programming, let us consider the following problem. A space of N states as each state represents certain stationary time
7th
Edition
2009
frame of speech signal. Assume that a cost function is defined to best align the frames sequence to a reference utterance based on phonetic contents as illustrated in figure 1. In figure 1 assume that the frames sequence is for the top stream and the reference utterance is the bottom stream. Assume that the reference stream is M frames while our utterance is N frames. We need to reform that sequence of N frames in length at the top to best align with the reference word of M frames in length based on the phonetic contents. So our target is to reform the sequence to be M frames starting at frame number 1 and ends at frame number n. (, )
Table 1: (i,j) the cost function for certain phone sequence.
1 2 3
1 57 58 59
2 54 86 22
3 53 3 90
Table 1, illustrates a cost function for certain sequence of the three frames {1, 2, 3}. Let us assume that the sequence should be aligned in 3 frames in length. Assume that the first frame is number 2. The sequence ends by frame number 3. What is the best sequence the best align the utterance to the reference 4 frames utterance with minimum cost? To answer this question let us go through the following discussion.
Amr M. Gody
Page
7th
Edition
2009
Figure 2: The problem of finding the minimum cost path between point 1 and point i.
Each point in figure 2 represents a frame of time from the speech signal. Consider the space of N states indicates in figure 2. It is needed to evaluate the best path between two certain states. For practical situations, the cost function may be is the distance between the two vectors. (, ) = (, ) = [] [] =1
2
[1]
Equation 1 is the Euclidian distance between two vectors of length M. Let us consider that we are given a cost function that express the cost of moving from vector to vector . Now let us back to our problem. We need to find the optimal path between two given vectors in the space assuming that the maximum numbers of allowable hops between any two vectors are M.
Amr M. Gody
Page
7th
Edition
2009
Let us arrange the nodes in a trellis diagram as shown in figure 3. The x direction in the diagram represents steps in time. The y direction in the diagram represents the available nodes. The arrow between any two nodes represents the hop direction from the first node to the second node where the first node is the left node and the second node is the right one. The cost of jumping from to in one time step can be expressed as: If we consider the possibility of 2 hops between and , equation 2 should be restructured as the following: Equation 3 evaluates the best path between in two hops. By following the same direction we can evaluates3 (, ), 4 (, ), , (, ). [3] 2 (, ) =
, + , 1
1 (, ) = (, )
[2]
Amr M. Gody
Page
Now let us apply what we got on our problem that is defined by the cost function given in table 1. Figure 4 indicates the method of finding the best path in case of considering the cost function given in table 1, the number of steps are 3, the start node is 2 and the end node is 3. The square above each circle is (, ) and the number above the arrow is (, ). The best path is marked by a star symbol.
1 2,1 = 58 2 2,1 = 62 3 2,1 = For clarification 1 2,2 = 86 2 2,2 = 25 3 2,2 = 1 2,3 = 3 2 2,3 = 89 3 2,3 = 28
7th
Edition
2009
Figure 4: Tracing the problem of dynamic programming for N=3 and according to the cost function given by table 1
Exercise 1: Write a C# function that evaluates the best path between two given nodes. Assume that the cost function is given. Assume that the maximum possible number of hops is M.
<summary> Get the best path for certain number of hops. The path is returned into Int array of m elements. The first element is the starting node and the last element is the end node. </summary> /// <remarks> /// chapter 5 /// Amr M. Gody /// sites.google.com/site/agdomains /// 2009 /// </remarks> /// <param name="start">Start Node. The first one is node number 0</param> /// <param name="end">End Node.</param> /// <returns> /// 0: if succsess /// -1: if false /// </returns> public int GetBestPath(int start, int end, int m, ref int[] path, ref int ct) {
Amr M. Gody
Page
if(m>S ) throw new Exception ("m should be less than or equal S.");
try
7th
Edition
2009
if(path .GetLength (0) != m+1) throw new Exception ("The path array should be m elemnts in size."); epsai = new int[S,N, N];
//1 - Initialize path[0] = start; int min = epsai[0, start, 0] = cost[start, 0]; for (int j = 1; j < N; j++) { epsai[0, start, j] = cost[start, j]; if (epsai[0, start, j] < min) { min = epsai[0, start, j]; path[1] = j; } } //2- Recurresion for (int hops = 1; hops < m-1; hops++) { min = epsai[hops, start, 0] = GetMinmum(epsai, hops - 1, start, 0); path[hops+1] = 0; for(int n=1;n<N;n++) { epsai[hops, start, n] = GetMinmum(epsai, hops - 1, start, n); if (epsai[hops, start, n] < min) { min = epsai[hops, start, n]; path[hops+1] = n; } } } // 3- termination epsai[m - 1, start, end] = GetMinmum(epsai, m - 2, start, end); path[m ] = end; ct = epsai[m - 1, start, end]; return 0; } catch (Exception e) { throw e ; } }
Amr M. Gody
Page
public int GetMinmum(int[, ,] epsai, int hops, int start, int end) { int min_sum = epsai[hops, start, 0] + cost[0, end]; ; int sum;
7th
Edition
2009
{ sum = epsai[hops, start, n] + cost[n, end]; if (sum < min_sum) { min_sum = sum; } } return min_sum; }
Now let us back to the second question. Given sequence of feature vectors that compose a certain word and given feature space, what is this word? This is a speech recognition problem. Now to answer this question let us reformulate the question. Figure 5 introduces the recognition process.
The recognition process has 2 main phases. The first phase is the training phase and the second phase is the testing phase. Consider dynamic programming method, the training phase deals with constructing a prototype sequence of each word considered in the recognition system. The prototype is considered as the best sequence of feature vectors that produce that word.
Amr M. Gody
Page
7th
Edition
2009
It is the average in sequence length and the centroid in features space of that word. The process of building a prototype of certain word is illustrated in figure 6. The process starts by time alignment of the available training samples of that word. Then the average process of all aligned samples is applied.
The second phase is the testing phase. In this phase, the unknown sequence will be tested against all available prototypes. The prototype that cause of minimum distance will be considered as the decision. The dynamic programming that so far introduced should be modified to consider some practical situations. For example the speech sequence is a temporal process. This means that the time alignment should consider that the sequence of the aligned sample should save the temporal information. It is not possible to consider that the frame number 4 will be come after the frame number 4 in that sequence. This will distract the temporal information of the sample that is being aligned. The following sections will provide more information about the temporal information and some other practical stuff that should be considered during the time alignment process.
Amr M. Gody
Page
7th
Edition
2009
2. Time alignment and normalization This process of time normalization is very important in pattern recognition. The recognition processes will concern of spectral distortion without including the time into consideration. Equation 4 explains the linear normalization of two signals during comparison. In this case, signal X is considered as a reference. Where: =
(, ) = =1 ( , )
[4] [5]
In the linear normalization we considered liner relation that joins frame indices of both signals as indicated in equation 5. This linear relationship could not be kept in the practical situation. We should try to make it much practical by considering a common reference signal of central common length. For example assume that we have the following database: Utterances symbol Durations (frames) Cat 20 Cat 1000 Cat 15 The database indicates 1000 different utterances of word Cat. Let us assume that average length is 18 (frames). Then we can normalize all utterances to certain sample that is closest in length to 18. The process of time normalization is called time wrapping process. Figure 7 indicates the time wrapping of two signals x and y to common reference signal of length T. # 1
Amr M. Gody
Page
10
7th
Edition
2009
is the time index in signal X. For example if the signal X is wrapped to a signal of length T. This means that: [6] (1) = 1 () = [7]
Figure 7 illustrates a non linear time wrapping process. Let us define a wrapping function () . This function relates the time index of signal X to the time index of signal Y. = () = 1
As shown in figure 7, the time index for signal x on X-Axis is calculated based on the wrapping function (). Also the same for time index calculation of signal Y on Y-Axis using (). For example the point on the figure that wrap based on the above wrapping functions will be evaluated as in following table: k 1 2
Amr M. Gody
Page
1 2
= ()
1 2
= ()
11
7th
Edition
2009
3 4 5 6 7 8
4 4 5 7 8 8
3 4 6 7 7 8
Dynamic programming is used to align between two time signals based on feature matching. This is called temporal matching. Suppose that we have two signals that represent the same utterance. The two signals have different temporal properties. They convey the same information but in different time frames. (Recall figure 1 for more details.) Typical warping constraints that are considered necessary and reasonable for time Alignment between utterances include the following: Endpoint Constraints Monotonicity conditions Local continuity constraints Global path constraints Slope weighting. An endpoint constraint deals with the end points. It means that the two signals that represent that same utterance should have the same end points. That means the first and the last frames of both signals will be aligned together. Let us define a function called the wrapping function. This function explains the corresponding frame number during the wrapping process.
Amr M. Gody
Page
12
is a wrapping function of signal x. Equation 4 read as frame number 1 of signal x is wrapped to frame number 1 of signal y. [8] (1) = 1 () = (1) = 1 () = Back to end point constraints we can write it as
(1) = 1
7th
Edition
2009
[9]
Equation 5 reflects that the end points of the signal should be respected during the wrapping process. Equation 5 implies that the signals X and Y are wrapped to length of T frames. Monotonicity conditions Suppose that it is needed to align the two signals on a temporal basis. This is very important step before making any further recognition process. It is the normalization process. The dynamic programming can be used to make such time alignment. The method discussed in section 1 is very generic. It may be not practical. Let us navigate through an example: Let us recall figure 7. We need to align signal Y to signal X in a way that ensures minimum temporal distortion.
Amr M. Gody
Page
13
7th
Edition
2009
Figure 8: Trills diagram. The x axis represents signal X and the y axis represents signal y. The solid path represents the best path that ensures minimum temporal disturbance.
It is not logical to go backward. If frame 3 of signal y is aligned to frame 2 of signal x, this means that the subsequent frames of signal y should be aligned to frames that are higher than or equal to time index 2 in signal x. Recalling figure 4, the best sequence is 2,3,2,3. This is very strange, frame number 3 is followed by frame number 2 to best align with the reference signal. This is what come up with the pure mathematics without applying the monotonicity constraint.
Amr M. Gody
Page
14
7th
Edition
2009
The monotonicity constraint, as shown in Figure 8, implies that any path along evaluated will not have a negative slope. The constraint eliminates the possibility of (time) reverse warping, along the time axis, even within a short time interval. Local continuity constraints deals with the maximum increment allowed between frames during the alignment process.
Let us define the path function. This function describes the path along the trills diagram. Recalling figure 9, P 1 is a path that can be described as: 1 = (1,1)(1,0) [10]
Amr M. Gody
Page
15
7th
Edition
2009
Equation 10 describes the path as the increments in both directions. Path 1 will be read such that it is a one increment in x direction and a one increment in y direction, then a 1 increment in x direction and 0 increments in y direction. Exercise Consider the path functions in figure 10, describe the following path:
Amr M. Gody
Page
16
7th
Edition
2009
Figure 10 illustrates some popular path functions. Because of the local continuity constraints, certain portions of the( , ) plane are excluded from the region the optimal warping path can traverse. To evaluate the global constraints let us define the following slope parameters: = min { } ; The slop of minimum asymptote.
is the path index. To calculate the asymptotes of the possible moves consider equations 11 and 12 and figure 11. Maximum asymptote considering point 1,1 as a beginning is given by equation 13 () = 1 + ( () 1) [13]
[11] [12]
Amr M. Gody
Page
17
7th
Edition
2009
Amr M. Gody
Page
11 = 2 1 1 = min = 1 = 1 = 2 2 1 = 1 2 2
3
11 = 2 1 = max = 1 = 1 = 2 and 2 1 = 1 2 2
2 3
18
7th
Edition
2009
This makes it
We should remember that there is another local constraint that assumes that () = (). This is the end point constraint. We should find the effect of this constraint in combined with the path constraint on the global path constraint. Using the same method as above, Maximum asymptote is given by: () = + ( () )
1 + ( () 1) () 1 + ( () 1).
() = 1 + ( () 1)
[14]
Amr M. Gody
Page
() = + ( () )
19
7th
Edition
2009
+ ( () ) () + ( () )
Figure 12 illustrate the whole picture. To make is the end point constraint will be
[15]
Equations 14 and 15 give the global constraints according to the given local paths. This global constraint reflects another point. Equation 15 depends on and . What should this bear for us. To answer this question let us test the above equalities in 15. In our case we have = 2 and =0.5 . If = 2 , this will end up to the collapse the area enclosed by 14 and 15 into a straight line. This is the maximum possible
Amr M. Gody
Page
20
7th
Edition
2009
allowable difference between the two signals to be wrapped according to the local path defined above. 4. Vector quantization Vector quantization is one very efficient source-coding technique. Vector quantization is a procedure that encodes a vector of input into an integer (index) that is associated with an entry of a collection (codebook) of reproduction vectors. The reproduction vector chosen is the one that is closest to the input vector in a specified distortion sense. The coding efficiency is obviously achieved in the process of converting the (continuously valued) vector into a compact integer representation, which ranges from, for example, 1 to N, with N being the size of (number of entries in) the codebook. The performance of the vector quantizer, however, depends on whether the set of reproduction vectors, which are often called code words, is properly chosen such that the distortion is minimized. The block diagram of Figure 14 shows a binary split codebook generation algorithm that produces a good codebook based on a given training data set. Characteristics of the information source that produced the given training data are embedded in the codebook. This is the key parameter of using Vector quantization in recognition. The problem ends up with L code book as one code book for each class being recognized. For example a code book for each word. Then simply the unknown word will be fetched against each code book. The distortion will be calculated. The code book that evaluates to the minimum distortion is used to make the decision. The associated information source of that code book will evaluate the target word. Figure 15 illustrates the process.
Amr M. Gody
Page
21
7th
Edition
2009
i=1 m = 2xi = 2
Loop Start
Classify m classes
D_prev = D_current
No
Yes
m<M
Yes
No
Amr M. Gody
Page
22
Done
7th
Edition
2009
D1(U)
Unknown U
D2(U)
Min_Index
W(min_index)
Dn(U)
Amr M. Gody
Page
23
7th
Edition
2009
5. References
[1] LR. Rabiner, R.W. Schafer, "Digital Processing Of Speech Signals", PrenticeHall, ISBN: 0-13-213603-1. [2] Thomas W. Parsons, "Voice and Speech Processing " ,McGraw-Hill inc.,1987 [3] Alessia Paglialonga, "Speech Processing for Cochlear Implants with the DiscreteWavelet Transform: Feasibility Study and Performance Evaluation", Proceedings of the 28th IEEE EMBS Annual International Conference New York City, USA, Aug 30-Sept 3, 2006 [4] Mel scale, http://en.wikipedia.org/wiki/Mel_scale
3TU U3T
Amr M. Gody
Page
24
Version 7
2010
Summary This chapter deals in introducing speech recognition using a very powerful statistical model called Hidden Markov Model (HMM). The concept and methodology of HMM will be discussed in this chapter. HMM is the most popular practical tool that is being used in Automatic speech Recognition (ASR) applications. HMM concerns on phonetic contents as well as temporal properties of the phonemes inside the phrase or the word. The time is not a direct factor in the recognition process. In this chapter the theory of HMM will be discussed. How HMM can be utilized to model discrete and continuous word recognition will be discussed. How HMM can implement Gaussian mixer probability distribution, to model multi modal phoneme properties, will be introduced.
Objectives Understanding HMM. Understanding speech signal as statistical process. Understanding Gaussian mixer modeling for multi modal phoneme properties. Practice using Matlab and C#.
Amr M. Gody
Page
Version 7
2010
1. HMM To understand HMM, let start by discrete Markov process. Consider the following example. Consider 3 states sound generator. This is a music box that generates only 3 sounds. When user presses the generate button, the music box generates a random of 3 sounds. The outputs are the sound itself and the associated light indicator as shown in figure 1.
Figure 1: 3-state sound generator. Let us go through the process by considering the state diagram that represents the above stochastic process. States are: 1- Sound 1; Green color. 2- Sound 2; Yellow color. 3- Sound 3; Red color.
Page Amr M. Gody
Version 7
2010
The process is: User presses the button to issue new sound. The observer registers both the sound ID and the color of the light indicator. Assume that the following state diagram, in figure 2, is obtained according to this experiment:
0.3
0.2
0.3
0.4
0.5
0.4 0.4 3 2
0.2
0.3
Figure 2: State diagram of the music machine. Let us tabulate the information from figure 2. 0.3 0.3 0.4 = 0.5 0.2 0.3 0.2 0.4 0.4
[1]
Amr M. Gody
Page
Equation 1 is the transition matrix. It gives the transition probability between any two states in the state diagram. The sum of any row should
Version 7
2010
evaluates to 1. This is logical as the sum gives all possible transition probabilities. Consider also the matrix in equation 2; 0.3 = 0.2 0.5
[2]
The Matrix in 2 gives the probability to be in certain state at the first press of the button. It is called the initial state probability matrix . For example to be in state 1 according to equation 2; p = 0.3. Both T and construct the model that represent the experiment. We can give the model certain symbol. = {, } [3]
Now after defining the system model parameters, we can have answers to the following questions: Q1) what is the probability of having the following sequence of observations given that the system model ? {Red, Red, Red, Green, Yellow, Red, Green, Yellow} A1) To make the answer it is preferable to form the question into probabilistic terms. So the above question is Calculate (|), where O is for Observation. = {1 , 2 , 3 , 4 , , }. This is an observation of t symbols. In our example 1 = , 2 = , , 8 = Figure 3 gives an insight vision of the process. The problem now is turned into simple expression. It is the multiplication of all terms:
Amr M. Gody
Page
= (3) (3|3)2 (1|3)2 (2|1)2 (3|2) 2 2 2 = 3 33 31 12 23 = 0.5 0.42 0.22 0.32 0.3 = 8.6 105
P(3) P(3|3) P(3|3) P(1|3) P(2|1) P(3|2)
Version 7
2010
P(1|3)
P(2|1)
Q2) Given that the system is in state 2, what is the probability to stay in state 2 for 9 consecutive turns? A2) This is a probability of Observation O given and initial state 1 = = {1 , 2 , 3 , 4 , , }
(|, i ) is our target. The difference off the first case is that the first state is given in the question itself. It is stated clearly in the question. Where 1 = 2 = = 9 = 10 =
Amr M. Gody
8 (| , 1 = 2)= (, 1 = 2|) = (2) (2|2) (1 (2|2)) (2) (1 = 2) 8 2 22 (1 22 ) = = 0.28 (1 0.2) 2 = 2.048 106
Page
1 () = (1 )
Version 7
2010
[4]
Using equation 4 we can obtain the average number of turns stay in certain state is given by the following analysis: 1 2 n () (1) (2) () Comments In state for 1 turn In state for 2 turns In state for n turns = 1 (1 ) = 1 (1 )
=1
= ()
=1
Now let us start to change the experiment a little bit. In the above experiment the observations are the same as the states. What will happen if the states become hidden? In other word, the output or the observations are not the states itself; rather it is some events that are emitted by the different states. There are 3 music machines exactly the same as the one given in figure 1. There is a man that went into the room which contains the 3 machines. The man presses a button randomly in anyone of the 3 machines. The observer reside in another room. He receives the observation by asking the man whom implements the experiments for the color. He has no information about which machine is utilized to generate that color. Now the output is the colors but the state represents the machine. The machines are hidden. Figure 4 gives representation of the experiment. The actor brings the result to the observer. The actor implement the experiment in the room then brings the
Amr M. Gody
Page
Version 7
2010
result to the observer. The observer could not see which machine is used to generate the sound. Let us explore the information in this experiment. 1- The observations are not the states. 2- Each state represents a machine, which is hidden in the room. 3- The output at each state is a discrete. It is only 3 symbols {Green, Yellow and Red}. 4- The model includes a new entity that represents symbol 11 12 probability at each state. This is the matrix B. = 21 22 31 32 emitting 13 23 33
Page
Version 7
2010
Each column in B matrix represents the symbols probability in the corresponding state. For example the probability of emitting Red in state 3 is 33 while the probability of emitting Green in state 2 is 21 .
HMM model
: is the matrix that gives the probability of being in certain state at time t = 1. For our example state represents the machine. Where: : is the matrix that gives the state transition probability. It is matrix, where N is the number of states. For our case, = 3. : is the matrix that gives the symbols probability in each state. It is matrix, where is the number of symbols and is the number of states. For our case, = 3 & = 3. 0.3 0.4 0.2 = 0.2 0.3 0.4 0.5 0.3 0.4
Now let us consider the same state diagram in figure 2 and the same model with B as:
Let us write B in probabilistic form. () = ( = | = ). For example 2 () = 0.3, 1 () = 0.2 and 3 () = 0.2 Now let us ask the following questions: Q1) given the model , What is the probability of the following observation? {Red, Red, Red, Green, Yellow, Red, Green, Yellow} Q2) given the model , what is the state sequence associated to the mentioned observation in Q1?
Amr M. Gody
Page
Version 7
2010
Q3) given a training set, what is the model that best describe the training set?. In other word how you can adapt the model parameter to best fit the given training set?
Speech production model
Let us recall the system in figure 4 to relate it to speech production model. The closed room is the brain, the actor is the speech production mechanism while the observer is the listener. In figure 4, the system has only 3 states. This is equivalent to a brain that has only 3 different phones to express with any word. This is equivalent to a 3-phone language. It is a hypothetical language to realize the knowledge in expressing language model using HMM. Also there are 3 different sounds for each state. This is equivalent to the variations in speech properties that express the phone. This is to model the practical situation of having more than one pronunciation for each single phone. Figure 2, express the grammar that relates the phones one to another. One can think that if there are M words in the language, it is better to build HMM for each word. Then when having an unknown observation, we can score it against each model. The decision will be for the maximum score. Yes this is true.
To simplify notation, let us enumerate the output as Green = 1, Yellow = 2 and Red = 3. This should not confuse with state symbols. Hence: = {3,3,3,1,2,3,1,2}
Amr M. Gody
Page
Version 7
2010
To evaluate the probability of that sequence given the model ; Let us go through the example. The observation starts by Red or symbol 3. According to the model this is the case that has 3 probabilities: 1- Being Red by machine 1 with probability = 0.5, 2- Or, Being Red by machine 2 with probability = 0.3, 3- Or, Being Red by machine 3 with probability = 0.4, So the probability of being Red at t = 1 will be ( = 1) = 0.3 0.5 + 0.2 0.3 + 0.5 0.4 =
Red at t=1.
Forward probability
0.3 0.4 0.2 0.3 0.3 0.4 0.3 = 0.5 0.2 0.3 = 0.2 = 0.2 0.3 0.4 0.5 0.3 0.4 0.2 0.4 0.4 0.5 () = (1 , 2 , , , = |)
Let us define a new term. It is the forward probability as To get the probability of certain observation of length T, we have to calculate the forward probability for all states then sum all. For our example T = 8. Hence; To calculate 8 () consider figure 5. It is calculated by induction. We start by calculating 1 () as = 1 3 Following the same direction:
Amr M. Gody
(|) = 3 8 () =1
[5]
Page
10
2 (1)
Hence:
The same direction will be followed to calculate 2 (2) 2 (3). 2 (1) 0.0575 2 (2) = 0.0411 0.0632 2 (3)
Version 7
2010
[6]
We will repeat the same process till ends by 8 (1) , 8 (2) and 8 (3). At that end we can calculate (|) as of equation 5.
Amr M. Gody
Page
11
Version 7
2010
Exercise 1
Write a C# function that calculate p(O| ). The function prototype is: ( , ) is an object to describe the observations. is an object to describe HMM model. The function returns the score of the observation O against the model L.
Exercise 2
Write a C# function that evaluates the class of a given Observation. Assume that each class is described by an HMM model. Assume also the function of exercise 1 is given. The prototype of the function is ( , ) is an object to describe the observations. is an object list. The elements of the list are HMM model as of exercise 1. The function returns the class index in the given list.
Given the model , what is the state sequence associated to the mentioned observation in Q1? Let us consider the same observation = {3,3,3,1,2,3,1,2} {, , , , , , , }
[7]
To answer this question, we seek for the best path in figure 5. Let us try to think about the difference between question 1 and 2. In question 1, we are trying to find the probability of certain observation against the given model . Recalling figure 5, in each time index we evaluate all possible occurrences of the given observation. But in our case now, we try to find the
Amr M. Gody
Page
12
Version 7
2010
most probable path that gives the given observation. Let us implement it on figure 5 given the observation in (7): Let us define the variables () and () as follows:
The Viterbi algorithm:
All possible state sequences that end to state at time index will be enumerated. The best probability for the given observation will be assigned for (). [8] Hence: +1 () = max () | () = arg max { ()} N: the number of states () = ( + 1) And +1 () = arg max () |
=1
=1
(+1 )
=1
[9] [10]
[11]
[12]
Amr M. Gody
Page
13
1 2
3 4 5 6 7 8
1 (1) 11 = 0.15 0.3 = 0.0450 1 (1) 12 = 0.15 0.3 = 0.0450 max 1 (2) 21 = 0.06 0.5 = 0.03 max 1 (2) 22 = 0.06 0.2 = 0.012 1 (3) 31 = 0.2 0.2 = 0.04 1 (3) 32 = 0.2 0.4 = 0.08 1 ( ) = 0.0450 0.5 = 0.0225 2 ( ) = 0.024
0.3 0.4 0.2 0.3 0.3 0.4 0.3 = 0.2 = 0.2 0.3 0.4 = 0.5 0.2 0.3 0.5 0.3 0.4 0.2 0.4 0.4 0.5
(2) 2 2 () = 0.2 0.3 = 0.06 1 (1) 13 = 0.15 0.4 = 0.06 max 1 (2) 23 = 0.06 0.3 = 0.018 1 (3) 33 = 0.2 0.4 = 0.08 3 ( ) = 0.032 (3) 3 3 () = 0.5 0.4 = 0.2
Version 7
2010
2 2 2 2 2 2
3 3 3 3 3 1
3 1 2 3 3 2
3 3 2 3 3 2 3
Amr M. Gody
Page
14
Version 7
2010
Code
// Initialization for (int i = 0; i < StateCount; i++) { epsi[0,i] = 0; dlta[0, i] = pi[i] * B[o[0],i]; } //-- Induction for (int t = 1; t < ObservationLength; t++) { for (int i = 0; i < StateCount; i++) { max = dlta[t - 1, 0] * T[0,i] ; arg_max = 0; for (int j = 1; j < StateCount; j++) { double p = dlta[t - 1, j] * T[j, i]; if (p > max) { max = p; arg_max = j; } } dlta[t, i] = max* B[o[t], i]; epsi[t, i] = arg_max; } }
Amr M. Gody
Page
15
Version 7
2010
//-- Termination max = dlta [ObservationLength -1, 0]; arg_max = 0; for(int i=1;i<StateCount ;i++) { if(dlta [ObservationLength -1,i] > max) { max = dlta [ObservationLength -1,i]; arg_max = i; } } // -- State Sequence int[] q = new int[ObservationLength]; q[ObservationLength - 1] = arg_max; for(int t = ObservationLength -2;t>=0;t--) { q[t] = epsi[t+1, q[t + 1]]; } // Storing the results into a file System.IO.StreamWriter sr = new System.IO.StreamWriter("sq_results.txt", true); sr.WriteLine(); sr.WriteLine("//////////////////////////////"); sr.WriteLine(DateTime.Now.ToString()); sr.WriteLine("----------------------------"); sr.WriteLine("State Sequence finder module"); sr.WriteLine("-----------------------------"); sr.WriteLine("HMM parameters"); sr.WriteLine("[PI]"); for (int i = 0; i < StateCount; i++) sr.WriteLine(pi[i].ToString()); sr.WriteLine(); sr.WriteLine("[T]"); for (int i = 0; i < StateCount; i++) { for (int j = 0; j < StateCount; j++) sr.Write(T[i, j].ToString() + "\t"); sr.WriteLine(); } sr.WriteLine(); sr.WriteLine("[B]"); for (int i = 0; i < SymbolCount; i++) { for (int j = 0; j < StateCount; j++) sr.Write(B[i, j].ToString() + "\t"); sr.WriteLine(); } sr.WriteLine("--------------------------------"); sr.Write("t\t"); for(int i=0;i<StateCount ;i++) sr.Write ("dlta"+string .Format ("_{0}\t\t",i+1)); for (int i = 0; i < StateCount; i++) sr.Write("Epsi" + string.Format("_{0}\t\t", i + 1)); sr.Write("O\tq"); sr.WriteLine(); sr.WriteLine("------------------------------------------------------------------------------"); for (int t = 0; t < ObservationLength ; t++) { sr.Write (string .Format ("{0}\t",t+1)); for(int i=0;i<StateCount ;i++) sr.Write (string .Format ("{0}\t\t",System .Math .Round ( dlta [t,i],5))); for (int i = 0; i < StateCount; i++) sr.Write(string.Format("{0}\t\t", epsi [t,i])); sr.Write(string.Format ("{0}\t{1}",symbols [o[t]],q[t]+1)); sr.WriteLine();
Amr M. Gody
Page
16
Version 7
2010
} sr.WriteLine("------------------------------------------------------------------------------"); sr.Close(); } } }
using System; using System.Collections.Generic; using System.Text; namespace sq { public class ConfigReader { private System.Data.DataSet m_objDataSet; private int m_nNumberOfStates; private int m_nNumberOfSymbolsPerStates; private int m_nLength; /// <summary> /// Number of symbols being recognized /// </summary> private int m_nSymbolsCount; public ConfigReader() { try { string file = Environment.CurrentDirectory + "\\sq.xml"; if (!System.IO.File.Exists(file)) throw new Exception("Configuration file is not exist"); m_objDataSet = new System.Data.DataSet(); m_objDataSet.ReadXml(file); m_nNumberOfStates = m_objDataSet.Tables["pi"].Rows.Count; m_nNumberOfSymbolsPerStates = m_objDataSet.Tables["B"].Rows.Count; m_nLength = m_objDataSet.Tables["o"].Rows.Count; m_nSymbolsCount = m_objDataSet.Tables["symbol"].Rows.Count; } catch (Exception e) { throw e; } } public double[,] B { get { double[,] temp = new double[m_nNumberOfSymbolsPerStates , m_nNumberOfStates ]; for (int i = 0; i < temp.GetLength(0); i++) for (int j = 0; j < temp.GetLength(1); j++) temp[i,j] =Convert .ToDouble ( m_objDataSet.Tables["B"].Rows[i][j]); return temp; } set { } } public double[] PI { get {
Amr M. Gody
Page
17
Version 7
2010
double [] pi = new double [m_nNumberOfStates ]; for (int i = 0; i < m_nNumberOfStates; i++) pi[i] = Convert .ToDouble ( m_objDataSet.Tables["PI"].Rows[i][0]); return pi; } set { } } /// <summary> /// Transition Matrix /// </summary> public double[,] T { get { double[,] temp = new double[m_nNumberOfStates, m_nNumberOfStates]; for (int i = 0; i < temp.GetLength(0); i++) for (int j = 0; j < temp.GetLength(1); j++) temp[i, j] = Convert.ToDouble(m_objDataSet.Tables["T"].Rows[i][j]); return temp; } set { } } /// <summary> /// Number of states /// </summary> public int n { get { return m_nNumberOfStates ; } set { } } /// <summary> /// Number of symbols per state /// </summary> public int m { get { return m_nNumberOfSymbolsPerStates ; } set { } } /// <summary> /// Observation /// </summary> public int[] o { get { int [] temp = new int [m_nLength ]; for (int i = 0; i < m_nLength ; i++) temp [i] = Convert.ToInt32 (GetIndex ( m_objDataSet.Tables["o"].Rows[i][0].ToString ()));
Amr M. Gody
Page
18
Version 7
2010
return temp ; } set { } } /// <summary> /// Observation length /// </summary> public int Length { get { return m_nLength; } set { } } /// <summary> /// Symbols array. /// </summary> public string[] Symbols { get { string [] temp = new string [m_nSymbolsCount ]; for (int i = 0; i < m_nSymbolsCount ; i++) temp[i] = Convert.ToString(m_objDataSet.Tables["symbol"].Rows[i][0]); return temp; } set { } } public int GetIndex(string m) { for (int i = 0; i < m_nSymbolsCount; i++) if (m.CompareTo(m_objDataSet.Tables["symbol"].Rows[i][0].ToString()) == 0) return i; return -1; } } }
Amr M. Gody
Page
19
Version 7
2010
2. Case Study The problem of speech recognition is illustrated here through building a complete speech recognition system. HTK 1 will be utilized to build and evaluate the system. a. Problem Definition Arabic dialer system is to be built using HTK. There are two methods of dialing: 1- Utterance like Dial number1, number 2, number 3 number N. The utterance manner is continues speech. b. Procedure 2- Recalling the name from the phone book.
Step 1 is to build the dictionary. The dictionary gives an illustration of how the word is spoken. It gives the answer in terms of the basic speech units in the target language. For our case it gives the utterance in Arabic phonemes. For example the word In arabic In english Pronounciation one w a2 ~h i d All words used in the system should have at least one entry in the dictionary. Step 2 is to build the grammar. The grammar is the guidance that will be followed to detect the tokens. The tokens are the words to be recognized or to be detected.
U U
Amr M. Gody
Page
Hidden Markov toolkit. HTK consists of a set of library modules and tools available in C source form. The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis. The software supports HMMs using both continuous density mixture Gaussians and discrete distributions and can be used to build complex HMM systems. For more information http://htk.eng.cam.ac.uk/
20
Version 7
2010
Step 3 is data preparation for training and testing purposes. This is very important step. The data should provide sufficient amount of samples to train all possible phone combinations for triphone recognition purpose. The database will be transcribed and annotated using SFS 2. Step 4 is to extract the features from the training data. This is a key step in the successful recognition. Selecting good features that are highly discriminating is very important. In this project Mel Cepstrum will be used. Step 5 is to build and to initialize the basic HMM models for monophony recognition process. The available training database and the pronunciation Dictionary will be used to initialize the basic models. Step 6 is updating the silence and pause Model. The model should allow for long duration of silence. This is not the case of monophone model. Step 7 is Model modifications to consider triphones instead of single isolated monophone. This is much practical situation to best fit transition periods between different phones. Step 8 is system evaluation. In this step the testing database will be used to evaluate system performance.
c. Dictionary This step is the first step in building the recognition system. In this phase we should count for all words that will be used in the process. We have to store them all into a text file. This file will be used later on to locate the proper pronunciation for each spoken word. The file name "dict" is chosen for this example to store the dictionary words. The phone symbols should be conserved along with all the subsequent process in the recognizer. The alphabetic used in this tutorial is illustrated in table 1.
Speech Filling System. It performs standard operations such as acquisition, replay, display and labeling, spectrographic and formant analysis and fundamental frequency estimation. It comes with a large body of readymade tools for signal processing, synthesis and recognition, as well as support for your own software development. http://www.phon.ucl.ac.uk/resource/sfs/
2
Amr M. Gody
Page
21
7 Version
0102
.Table 1: Arabic phone alphabetic symbols Comments Phone Symbol /@/ //b //t //t_h //d_j //~h //x //d //~z //r //z //s //s_h //Sa //Da //Ta //T_Ha /@~/ //g_h //f //q //k //l //m
Amr M. Gody
Context Symbol Dj
Arabic Character
Page
22
Version 7
2010
/n/ /h/ /w/ /y/ /a / /a2/ /u/ /u2/ /i/ /i2/ /sp/ /sil/
- -
Code
Amr M. Gody
Page
23
Version 7
2010
Note
The words in the dictionary are written in English letters while the pronunciations are in Arabic phone symbols. This should not confuse the reader as the system will respond to the Arabic pronunciations. The words indicate the output on identifying the pronunciation. No matter what the way the system will announce the recognized Arabic word. In our case the system will output the text in English letters when it detected the Arabic pronunciation associated to it according to the dictionary. For example when the Arabic pronunciation "Sa i f r" is detected, the system will output the text "Zero" to announce for this recognized pronunciation. d. Grammar I consider this step as system outline design. The grammar is the guidance of what speaker should speak to provide the information to the system. It is just like designing a dialog screen in certain software application to pull information from the user. In our case, we are targeting dialer application. We need to design some acceptable dialogues to catch the user input. The dialog contains some Tags to be used for catching the required tokens. The tags acts juts like the labels in the dialog box to guide the user for what information he should provide in the associated area. Figure 7 will be used to illustrate the analogy between software application dialog and speech dialog. As shown in figure 7, there is two ways of dialing the number either by selecting the name from the list or by dialing the number through the push buttons then pressing the button . In this case the prompts are the text written to guide the user of what he should do. In this dialog we have two prompts. The first one is Select from phone book, the second one is the label on the button . Let us back to our speech dialog that we need to design to act the same function like that one provided in figure 7. We have the following dialogs:
Amr M. Gody
Page
24
Version 7
2010
Figure 7: Sample GUI illustrating different tags and tokens in dialler application.
Dialog 1:
U
Token = Dialog 2:
U
Just pronounce the name directly. In this case the Tag = NONE Token = Name
Page Amr M. Gody
25
Version 7
2010
In this case it is assumed that the system will retrieve the number later on from certain database that contains the number of this phone book entry. To formulate this grammar using HTK, it is needed to write it in text file. Figure 8 illustrates the word network that illustrates the grammar of this dialer system.
Figure 8: Word network that describe the possible grammar for the dialer recognizer.
Code
Below is the list of "gram" file contents. $digit = zero | one | two | three | four | five | six | seven | eight | nine; $name = Amr | Osama | Salwa; ( SENT-START ( dial <$digit> | $name) SENT-END ) The above text will be stored into a file. You may choose any name for this text file. The file name "gram" is chosen. Then the following command will be invoked to create the word net file:
Amr M. Gody
Page
26
Version 7
2010
Command line
hparse gram wdnet Hparse command will parse the file "gram" and generate the file "wdnet".
Code
Page
27
Version 7
2010
J=20 J=21 J=22 J=23 J=24 J=25 J=26 J=27 J=28 J=29 J=30 J=31 J=32 J=33 J=34 J=35 J=36 J=37 J=38 J=39 J=40
S=6 S=16 S=6 S=16 S=6 S=16 S=6 S=16 S=6 S=16 S=6 S=16 S=6 S=16 S=6 S=16 S=6 S=16 S=17 S=19 S=0
E=7 E=7 E=8 E=8 E=9 E=9 E=10 E=10 E=11 E=11 E=12 E=12 E=13 E=13 E=14 E=14 E=15 E=15 E=16 E=17 E=18
This is the word network file in lattice format. The file indicates that there are 20 Nodes (N = 20) and 41 links (L = 41). The first node ID ( I = 0) represents the word "SENT-END" (W=SENT-END). Figure 9 illustrates the word net network associated to our grammar. It is part of the network. The words are enclosed by the oval shape. The word ID is attached to the oval shape and enclosed by circle as shown in figure. The links are labels by the joint number as listed in the lattice file format.
Amr M. Gody
Page
28
Version 7
2010
Figure 9: Part of the word network expressed by the lattice format file "wdnet"
e. Feature Extraction The database is prepared using the SFS program. Database preparation means: 1- Recording: The data will be recorded and stored into suitable audio file format. For example WAV files. 2- Annotation: The recorded samples will be annotated and transcribed. This is to allow for the recognizer the way to find the suitable audio segments to train the models. SFS may be used to make both of the above tasks. It is All In One package. We can later on export the annotation file and the WAV file from the created SFS file.
Amr M. Gody
Page
29
Version 7
2010
Command line
sfs2wav -o
wavfile sfsFile
The command line sfs2wav is used to export the wavfile from the container sfsFile. wavFile and sfsFile are the file names. They may be replaced by your own file names. The annotation file exported using the following SFS command line.
Command line
anlist -h -o labFile
sfsFile;
You should repeat the mentioned two command lines on all database files to export the WAV and the associated annotation file in HTK format. The following is part of C# code that illustrates this process.
Code
path = C:\Database\TrainingSet string[] fillist = System.IO.Directory.GetFiles(path, "*.sfs"); foreach (string file in fillist) { string sfsFile = file; string labFile = file .Split ('.')[0] + ".lab"; string wavfile = file .Split ('.')[0] + ".wav"; string cmnd = "sfs2wav -o " + wavfile + " " + sfsFile; string res = Exec(cmnd); cmnd = "anlist -h -o " + labFile + " " + sfsFile; res = Exec(cmnd); }
Code
Amr M. Gody
Page
30
The below code is for the function Exec. This function is used to run the command line from within a C# program.
Version 7
2010
string buffer; string[] splitters = { " " }; string cmd = CommandLine.Split(splitters , StringSplitOptions.RemoveEmptyEntries )[0]; string args = CommandLine.Substring(cmd.Length + 1); Process a = new Process(); a.StartInfo.FileName = cmd; a.StartInfo.Arguments = args; a.StartInfo.RedirectStandardOutput = true ; a.StartInfo.UseShellExecute = false ; a.StartInfo.WindowStyle = ProcessWindowStyle.Hidden; a.Start(); buffer = a.StandardOutput.ReadToEnd(); a.WaitForExit(); return buffer; }
Now we should have all training database in the suitable format for further processing using HTK tool set. Each file is stored in a standard WAV file and has a label file in standard HTK format.
Code
HTK formatted label file is shown below. The numbers are scaled in 100(ns) per unit. For example 462100 in the file below is 0.0462 (sec)
462100 1270800 w 1270800 3442800 a2 3442800 4829200 ~h 4829200 5961500 i 5961500 7297000 d
Before starting the use of any HTK tools we should prepare the configurations that will be used along this tutorial. HTK accepts the configuartion into a simple text file. Let us create a file with name "Config" to store the common configurations. The file is shown below.
Amr M. Gody
Page
31
Version 7
2010
Code
The configuration file is used by all HTK tools to minimize the number of parameters passed in the command line. There are too many parameters can be configured using the configuration file, for a full reference you should back to HTK manual. The above are the common configurations. The configuration parameters used here are
TARGETKIND = MFCC_0_D_A
This indicates that the kind of features that will be used. Mell Cepstrum, C0 will be used as the energy component. The delta and acceleration coefficients are to be computed and appended to the static MFCC.
TARGETRATE = 100000.0
Gives frame period in 100 (ns) units. This is evaluates to 10 (ms) in our case.
SAVECOMPRESSED = T SAVEWITHCRC = T
The output should be saved in compressed format, and a CRC checksum should be added.
WINDOWSIZE = 250000.0 USEHAMMING = T PREEMCOEF = 0.97
Amr M. Gody
Page
Hamming window with length 25 (ms) and premphasis with coefficient = 0.97 is used.
32
Version 7
2010
This is to configure the filtering and leftring stages. The number of channels in the filter bank = 26. The number of coefficients for Cepstrum = 12.
ENORMALISE = F SOURCEFORMAT = WAV
Here is the energy normalization is disabled for frame. And the source of the speech signal is standard WAV file format.
We will need to create a text file that contain a list of all database files and the associated output feature file. Part of this file is illustrated below.
Code
The script file that stores the list of database files which will be processed for feature extraction process. Let us name this file as "codetr.scp"
1015356936.wav 106707170.wav 1091955181.wav 1135829975.wav 1144342368.wav 1015356936.mfc 106707170.mfc 1091955181.mfc 1135829975.mfc 1144342368.mfc
Now we can apply the following command line to extract the features from the database files.
Command line
Amr M. Gody
Page
f. HMM models design Now we need to design the models that will be used in the recognition process. 3 states left to right model will be used to model phone levels. In this case we assumed that the phone is a three parts. Tow transition parts at
33
Version 7
2010
phone boundaries and a middle part. It is needed to build a model and initiate it for each phone under test. We have labeled database and associated feature files. Those labeled files will be used to train each corresponding phone model under test. So we need to prepare the following files to build the models: 1- Master Label Format (MLF) file that contain the crossreference between the available training database and the phonetic contents. 2- Single HMM Prototype file. This file will be used as initial model for building a separate monophone HMM file.
Code
Amr M. Gody
Page
34
At this stage drop speech paused from MLF. We will not build a model for it now. We will use the silence model later on to build the SP model.
Version 7
2010
Code
The initial prototype for 3 emitting states and 2 Gaussian mixtures presented below. Let us name it "proto"
~h "proto" <BeginHMM> <VecSize> 39 <MFCC_0_D_A> <NumStates> 5 <State> 2 <NumMixes> 2 <Mixture> 1 0.5 <Mean> 39 0 0 0 0 0 0 <Variance> 39 1 1 1 1 1 1 <Mixture> 2 0.5 <Mean> 39 0 0 0 0 0 0 <Variance> 39 1 1 1 1 1 1 <State> 3 <NumMixes> 2 <Mixture> 1 0.5 <Mean> 39 0 0 0 0 0 0 <Variance> 39 1 1 1 1 1 1 <Mixture> 2 0.5 <Mean> 39 0 0 0 0 0 0 <Variance> 39 1 1 1 1 1 1 <State> 4 <NumMixes> 2 <Mixture> 1 0.5 <Mean> 39 0 0 0 0 0 0 <Variance> 39 1 1 1 1 1 1 <Mixture> 2 0.5 <Mean> 39 0 0 0 0 0 0 <Variance> 39 1 1 1 1 1 1 <TransP> 5 0 0.16 0.84 0 0 0 0.04 0.47 0.49 0 0 0 0.26 0.38 0.36 0 0 0 0.22 0.78 0 0 0 0 0 <EndHMM>
Amr M. Gody
Page
35
Version 7
2010
Command line
The below command line is used to initialize the prototype model using the available database.
HCompV -C config -f 0.01 -m -S train.scp proto
Code
The file train.scp contains the list of the available training files. Below is part of the file "Train.scp"
F:\TEMP\temp1\TrainSet\1015356936.wav F:\TEMP\temp1\TrainSet\106707170.wav F:\TEMP\temp1\TrainSet\1135829975.wav F:\TEMP\temp1\TrainSet\1197356831.wav F:\TEMP\temp1\TrainSet\1236274632.wav F:\TEMP\temp1\TrainSet\1243860769.wav F:\TEMP\temp1\TrainSet\1265385611.wav F:\TEMP\temp1\TrainSet\1299215089.wav F:\TEMP\temp1\TrainSet\131557568.wav F:\TEMP\temp1\TrainSet\1316301303.wav F:\TEMP\temp1\TrainSet\1363171868.wav F:\TEMP\temp1\TrainSet\1421050189.wav
Note
It is assumed that the associated label files available in the same location as well as the corresponding WAV files. The above command line will generate two files 1- The modified file proto after initializing the parameters using the database. 2- The variance floor macro file named "vFloors". This macro file is generated due to the -f option in the command line. The variance floor in this case is equal to 0.01 times the global variance. This is a vector of values which will be used to set a floor on the variances estimated in the subsequent steps.
Amr M. Gody
Page
36
Version 7
2010
Code
Now we need to use the prototype file to create a separate initial HMM model for each phone under test. It is better to merge all models into a single file that contain them all. We will create two new files that will be used for all subsequent processes. 1- "Hmmdefs" file. This file contains all HMM definitions. 2- "Macros" file. This file contains the common macros in all HMM definition models.
Code
Code
Page
37
Version 7
2010
Amr M. Gody
Page
<GCONST> 9.240792e+001 <MIXTURE> 2 5.000000e-001 <MEAN> 39 -7.067261e+000 1.175784e+000 2.454413e+000 <VARIANCE> 39 3.558010e+001 2.060277e+001 1.728483e+001 <GCONST> 9.240792e+001 <STATE> 3 <NUMMIXES> 2 <MIXTURE> 1 5.000000e-001 <MEAN> 39 -7.067261e+000 1.175784e+000 2.454413e+000 <VARIANCE> 39 3.558010e+001 2.060277e+001 1.728483e+001 <GCONST> 9.240792e+001 <MIXTURE> 2 5.000000e-001 <MEAN> 39 -7.067261e+000 1.175784e+000 2.454413e+000 <VARIANCE> 39 3.558010e+001 2.060277e+001 1.728483e+001 <GCONST> 9.240792e+001 <STATE> 4 <NUMMIXES> 2 <MIXTURE> 1 5.000000e-001 <MEAN> 39 -7.067261e+000 1.175784e+000 2.454413e+000 <VARIANCE> 39 3.558010e+001 2.060277e+001 1.728483e+001 <GCONST> 9.240792e+001 <MIXTURE> 2 5.000000e-001 <MEAN> 39 -7.067261e+000 1.175784e+000 2.454413e+000 <VARIANCE> 39 3.558010e+001 2.060277e+001 1.728483e+001 <GCONST> 9.240792e+001 <TRANSP> 5 0.000000e+000 1.600000e-001 8.400000e-001 0.000000e+000 0.000000e+000 0.000000e+000 4.000000e-002 4.700000e-001 4.900000e-001 0.000000e+000 0.000000e+000 0.000000e+000 2.600000e-001 3.800000e-001 3.600000e-001 0.000000e+000 0.000000e+000 0.000000e+000 2.200000e-001 7.800000e-001 0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000 0.000000e+000 <ENDHMM> ~h "i" <BEGINHMM> <NUMSTATES> 5 <STATE> 2
38
Version 7
2010
Figure 10: Master Macro Files (MMF) for HMM. To create the Master Macro Files (MMF) simply you need to copy the text from the prototype into the MMF file and repeat it for each phone. You have just to rename each model with one of the phones under test as indicated in figure 10. Then to create the file macros, you need to copy the first 3 lines in the prototype and the contents of the file vFloors which is mentioned shortly before into a single file. This is the macros file. See figure 10 for more details.
Code
39
Version 7
2010
The macro ~h is for the HMM name. So each model starts with a macro that explains the name of the model. Let us navigate through the listed macros.
As being cleared in the files above, there are some lines starts with the symbol ~. This symbol indicates that the next character is macro identification symbol. Each macro has a unique meaning in HTK. For a complete reference of the macros you should back to HTK manual.
~h "i"
This tag indicates that the model is 5 states. This is including the non emitting states. It is always assumed that the first and the last states are non emitting. They are used to identify the first state in the model.
<STATE> 2
This tag announces the beginning of the definition of state number 2. You may notice that there is no definition of state number 1. State number 1 and state number 5 are non emitting states so there is nothing to be defined for them.
<NUMMIXES> 2
Amr M. Gody
Page
This tag defines the number of Gaussian mixtures used to define the emitting probability distribution function for state 2.
40
Version 7
2010
<MIXTURE> 1 5.000000e-001
This tag announces the beginning of the definition of Mixture number 1. It also defines the mix ratio. It is 0.5 for this mixture.
<MEAN> 39
This tag defines the mean array. It is 39 elements. This should be the same length as the feature vector length. Then the mean array starts. Here are the first three elements in this array.
-7.067261e+000 1.175784e+000 2.454413e+000
<VARIANCE> 39
This tag announces the starting of the variance array. The variance array gives the correlation between the feature vector paramaters. If there is no correlation between the feature vector paarmaters this should be a diagonal matrix with NxN diminsion. The diagaonal eleemnts indicates the varainace or the AC component of each featutres vector elelemnt. In this case we do not need to store the zero eleemnts in the NxN array. It suitable to store only the diagonal N elements as shown below. Below is the first 3 elements of the 39 total elements.
3.558010e+001 2.060277e+001 1.728483e+001
<GCONST> 9.240792e+001
Then the above tags will be repeated for each state in the model. After that the definition of the transition matrix will begin by the following tag.
<TRANSP> 5
This is to announce the beginning of the transition matrix. It is 5 x 5 elements here. Below is the elements array
0.0e+000 1.6e-001 8.4e-001 0.0e+000 0.0e+000 0.0e+000 4.0e-002 4.7e-001 4.9e-001 0.0e+000 0.0e+000 0.0e+000 2.6e-001 3.8e-001 3.6e-001 0.0e+000 0.0e+000 0.0e+000 2.2e-001 7.8e-001 0.0e+000 0.0e+000 0.0e+000 0.0e+000 0.0e+000 Amr M. Gody
Page
41
Version 7
2010
Note that the non emitting states are included in the transition matrix. The probability of stay on state 1 is 0. This is indicated by the element a 11 = 0. We can only move to either state 2 or state 3 from state 1. This is expressed by having a value for those probabilities a 12 = 0.16. a 13 = 0.84. Going through the transition matrix we can figure out that the model is that one give by figure 11.
Starter Model
Create a subfolder named hmm0. Copy the files hmmdefs and macros into it. This will be the first model definitions. Now we can make re-estimation for the parameters using the available database.
Re-estimating models parameters
herest -C config -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm0\macros -H hmm0\hmmdefs -M hmm1 monphones0
Amr M. Gody
Page
42
Version 7
2010
The content of the file monophones0 which contains the list of all monophones under test is shown below.
Sa i f r sil w a2 ~h d @ t n i2 n2 a l h b ~@ x m s u Ta
The tool HErest will fetch the training database defined in the file train.scp for each phone listed in the file monophones0. It will use the file phones0.mlf to locate the files that contains the proper speech segments for training each phone. The parameters will be located in the macro files hmmdefs and macros which are stored into the subfolder hmm0. The final parameters estimate will be stored into similar macro files and will be stored into the subfolder hmm1. During the training the token propagation method is utilized. The pruning of good token is considered. Let us discuss the token propagation method. Consider figure 12. A token is certain register object that will be propagated through all available networks defined in this system. Assume a word of length T monophones.
Amr M. Gody
Page
43
Version 7
2010
To find the best path, many tokens will be launched to navigate every possible path of length T monophone. During the propagation, the probability will be accumulated as well as the monophone symbol. At the end, the path that evaluates to the maximum probability will be considered the winner path. In figure 12, 2 tokens are lunched to navigate the 2 paths as shown in figure.
To minimize the time of calculations in the large networks, the pruning method is applied. It is assumed that there is a bandwidth of the propagated tokens. It bandwidth is a margin of probability relative to the maximum probability at time t. Figure 13 illustrate how the pruning tokens are chosen. The pruning process is defined for HErest command line by the parameter t.
-t 250.0 150.0 1000
This parameter indicates that the bandwidth will be started by 250 during the re estimation process. If the token is not passing, then the bandwidth will be incrementally increased by an increment equals 150.0. The process is repeated till the training file is all processed or the bandwidth exceeding 1000. In this case an error message will be generated that indicates a problem in the training file.
Page Amr M. Gody
44
Version 7
2010
Figure 14 explains implementing the pruning in token propagation method for large network. Now we need to modify the SIL model and to include the Speech Pause (SP) model into the starter model. It is much better to modify the SIL model to consume the environment noise and to extend much longer than the normal phones. This will be achieved by adding a transition from state 4 to state 2. In this case the time of occupy this model may be extended. The token will not be forced to propagate to state 5 (End state) as the regular phones, rather it may propagate back to state 2. Also we will add a transition from state 2 to state 4 to ensure that the token may get out the model shortly if needed to transfer to the next phone.
Amr M. Gody
Page
45
Version 7
2010
Although we manually added some transition parameters to the transition matrix, the parameters will be later on estimated through the available training data. We just add the track for the train (token). The database is to add more probability or to weak the initial probability of any track. Regarding SP, it represents a short pause. It is much similar to SIL but for a very short period. We can make a model for SP and append it manually to
Amr M. Gody
Page
46
Version 7
2010
the MMF (last hmmdefs). It is 3 states model. The second state will be shared between SIL and SP. This is to make use of the available database for SIL to avoid model under training problem due to the leak of database. SP may appear very rarely during the speech, so it is better to make use of something similar like SIL database in training the SP model.
Title
SIL model, State 3 is shared with SP model
Title
SP model
Figure 15: SIL and SP model sharing the middle state We will need to add the initial definition of SP model into the MMF file. This may be added manually.
Code
Page
47
The initial model definition of SP model is shown below. This part should be appended into the MMF file (hmmdefs).
Version 7
2010
<BeginHMM> <VecSize> 39 <MFCC_0_D_A> <NumStates> 3 <State> 2 <NumMixes> 1 <Mixture> 1 1 <Mean> 39 0 0 0 0 0 0 0 0 0 <Variance> 39 0 0 0 0 0 0 0 0 0 <TransP> 3 0 0.26 0.74 0 0.05 0.95 0 0 0 <EndHMM>
SP model
Parameters values may be anything. It will be modified on the re estimation process using the training data.
Command line
The below command line is for editing the hmm model for SIL and SP to be much suitable. It is the way to edit HMM models to be matched with the design in figure 15
HHEd -H hmm2\macros -H hmm2\hmmdefs -M hmm3 sil.hed monophones1
Code
Now SP model is included in the Macro Definition File. We will need to modify the model list file monophones1 by adding SP for it. It is better to rename the file to be monophones1.
Code
Page
48
Version 7
2010
sil w a2 ~h d @ t n i2 n2 a l h b ~@ x m s u Ta Sp
Now we may apply a single Re Estimation process on the last MMF files using the available training data. The last HMM macro files are available at the subfolder hmm3.
herest -C config -I phones1.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm3\macros -H hmm3\hmmdefs -M hmm4 monophones1
MLF with SP
The master label file phones1.mlf is the same as phones0.mlf with the exception of adding the SP label.
What we have now
Phones1.mlf
Root
Amr M. Gody
Page
49
Version 7
2010
wdnet Dict
Root Root
Root TrainSet
Word network in lattice format. Dictionary which contains how the words are spoken in terms of the symbols used in the recognition process. Symbols here is for Arabic monophones. Script file contain the list of training files. All WAV files needed for training the models. All feature files associated to the WAV files. It is in the same name as the WAV files with a different file extension. The file extension for it is .mfc All label files for the associated WAV files. This is in the same name with a different file extension. The file extension is .LAB. Master Macro File that contain model definitions of all symbols included in the recognition. Master Macro File that include the common macro definitions for all symbols.
TrainSet
Label files
TrainSet
hmmdefs
Hmm4
Macros
Hmm4
g. Evaluating the recognition process The above Hmm models need to be evaluated using some testing database. The database should be altered the same way as the training database to make it possible to count the correct answers. So we should prepare some database as indicated before and assign them for the testing process.
Code
Page
50
Version 7
2010
Code
The file test.scp will be prepared. The file list all samples prepared for testing process. This file is shown below.
Here is below the recognizer will be tested using the file test.wav.
HVite -C Config -H hmm4\\macros -H hmm4\\hmmdefs -S test.wav -i recout.mlf w wdnet -p 100.0 -s 5.0 dict monophones1
This command will produce a master label format file called Recout.mlf.
Code
. To evaluate the results, we may prepare the file testref.mlf. This file contains all label information of all files using in the test.
Amr M. Gody
Page
51
Version 7
2010
Code
Amr M. Gody
Page
52
Version 7
2010
3. References
[1] LR. Rabiner, R.W. Schafer, "Digital Processing Of Speech Signals", PrenticeHall, ISBN: 0-13-213603-1. [2] Thomas W. Parsons, "Voice and Speech Processing " ,McGraw-Hill inc.,1987 [3] Alessia Paglialonga, "Speech Processing for Cochlear Implants with the DiscreteWavelet Transform: Feasibility Study and Performance Evaluation", Proceedings of the 28th IEEE EMBS Annual International Conference New York City, USA, Aug 30-Sept 3, 2006 [4] Mel scale, http://en.wikipedia.org/wiki/Mel_scale [5] HTK manual. http://htk.eng.cam.ac.uk/ftp/software/htkbook.pdf.zip
3TU U3T 3TU U3T
Amr M. Gody
Page
53