Subject Related Assignment No.1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Subject Related Assignment No.

Note:
1. Attempt all questions.
2. Answer to each new question to be started on a fresh page.
3. Assume suitable data if necessary.

Q1. Explain four broad classes of phonemes. Explain in details voiced and unvoiced fricatives.
Q2. Explain LPC model of speech recognition.
Q3. Explain in details isolated digit recognition system.
Q4. Explain with block diagram continuous/connected digit speech recognition system
Q5. A speech signal is sampled at a rate of 20000 samples per second (Fs=20KHz). A 20msec
window is used for short time spectral analysis, and the window is moved by 10msec in
consecutive analysis frames. Assume that a radix 2FFT is used to compute DFTs
1. How many speech samples are used in each segment?
2. What is frame rate of short time spectral analysis?
3. What size DFT and FFT are required to guarantee that no time aliasing will occur?
4. What is the resulting frequency resolution between adjacent spectral samples?

(Dr. M. U. Nemade)
Supervisor
Q1) Explain four broad classes of phonemes. Explain in details voiced and unvoiced fricatives.
Ans. The main point to understand about speech is that the sounds generated by a human are filtered
by the shape of the vocal tract including tongue, teeth etc. This shape determines what sound comes
out. If we can determine the shape accurately, this should give us an accurate representation of
the phoneme being produced.
Phonemes are the basic units for sound composed. The waveform representation of each
phonemes is characterized by a small set of distinctive features, where a distinctive feature is a
minimal unit which distinguishes between two maximally close but linguistically distinct speech
sounds. These acoustic features should not be affected by different vocal tract sizes and shapes of
speakers and the changes in voice quality.
Phonemes are classified into different categories as, vowels, nasals, plosives, fricatives,
approximants and silence. Each of these classes has some discriminate features so that they can be
easily classified. For example vowels can be easily categorized with its higher amplitude, whereas
fricatives with their high zero crossing rate.
The voiced dental fricative is a type of consonantal sound, used in some spoken languages. It
is familiar to English speakers as the ‘th’ sound in father. The symbol in the International Phonetic
Alphabet that represents this sound is ‘eth’ or (ð). This was taken from the Old English letter eth,
which could stand for either a voiced or unvoiced inter dental non-sibilant fricative.
This symbol is also sometimes used to represent the dental approximant, a similar sound not known
to contrast with a dental non-sibilant fricative in any language, though that is more clearly written
with the lowering diacritic. The dental non-sibilant fricatives are often called "inter dental" because
they are often produced with the tongue between the upper and lower teeth, and not just against the
back of the upper teeth, as they are with other dental consonant.
This sound, and its unvoiced counterpart, is rare phonemes. The great majority of language of
Europe and Asia, such as German, French, Persian, Japanese and mandarin lack this sound. Native
speakers of those languages in which the sound is not present often have difficulty enunciating or
distinguishing it, and replace it with a voiced alveolar sibilant, a voiced dental stop or voiced alveolar
stop or a voiced labio-dental known respectively as th-alveoiarization, th-stoopin, and th-fronting. As
for Europe, there seems to be a great arc where this sound (and or the unvoiced variant) is present.
Most of mainland Europe lacks the sound. However, some "periphery" languages
as Gascon, Welsh, English, Elfdalian, Northern Sami, Mari, Greek, Albanian, Sardinian, some
dialects of Basque and most speakers of Spanish have this sound in their consonant inventories, as
phonemes or allophones.
Q2) Explain LPC model of speech reorganization.
Ans. One simple model of LPC is shown below,

U(n) A(z) S(n)

G
The basic idea behind the LPC model is that a given speech sample at time n,S(n) can be
approximated as a linear combination of the past p speech samples. Linear predictive coding (LPC)
is a tool used mostly in audio signal processing and speech processing for representing the spectral
envelope of a digital signal of speech in compressed form, using the information of a linear
predictive model.[1] It is one of the most powerful speech analysis techniques, and one of the most
useful methods for encoding good quality speech at a low bit rate and provides extremely accurate
estimates of speech parameters.
LPC starts with the assumption that a speech signal is produced by a buzzer at the end of a tube
(voiced sounds), with occasional added hissing and popping sounds (sibilants and plosive sounds).
Although apparently crude, this model is actually a close approximation of the reality of speech
production. The glottis (the space between the vocal folds) produces the buzz, which is characterized
by its intensity (loudness) and frequency (pitch). The vocal tract (the throat and mouth) forms the
tube, which is characterized by its resonances, which give rise to formants, or enhanced frequency
bands in the sound produced. Hisses and pops are generated by the action of the tongue, lips and
throat during sibilants and plosives.
LPC analyzes the speech signal by estimating the formants, removing their effects from the
speech signal, and estimating the intensity and frequency of the remaining buzz. The process of
removing the formants is called inverse filtering, and the remaining signal after the subtraction of the
filtered modeled signal is called the residue. LPC is frequently used for transmitting spectral
envelope information, and as such it has to be tolerant of transmission errors. Transmission of the
filter coefficients directly (see linear prediction for definition of coefficients) is undesirable, since
they are very sensitive to errors. In other words, a very small error can distort the whole spectrum, or
worse, a small error might make the prediction filter unstable.
There are more advanced representations such as log area ratios (LAR), line spectral pairs
(LSP) decomposition and reflection coefficients. Of these, especially LSP decomposition has gained
popularity, since it ensures stability of the predictor, and spectral errors are local for small coefficient
deviations. Applications are:
LPC is generally used for speech analysis and resynthesis. It is used as a form of voice compression
by phone companies, for example in the GSM standard. It is also used for secure wireless, where
voice must be digitized, encrypted and sent over a narrow voice channel; an early example of this is
the US government's Navajo I. LPC synthesis can be used to construct vocoders where musical
instruments are used as excitation signal to the time varying filter estimated from a singer's speech.
This is somewhat popular in electronic music. Paul Lansky made the well known computer music
piece not-just-more-idle-chatter using linear predictive coding. LPC predictors are used in Shorten,
MPEG4 ALS, FLAC, SILK audio codec, and other lossless audio codec’s. LPC is receiving some
attention as a tool for use in the tonal analysis of violins and other stringed musical instruments.

Q3. Explain in details isolated digit recognition system.


Ans.

Above fig. shows a simple process of isolated digit recognition, the next dig. Shows how exactly
isolated digit is recognized.

This process consists of two phases training phase & verification phase.
Training phase accepts speech samples from different people and trains the system to create acoustic
models for each word in vocabulaey.TP undergoes through two stages Data preparation & Recording
data.
Verification Phase display some random numbers then check for pronouns number.
Some time system consists of speech processing inclusive of digit boundary and recognition which
uses zero crossing and energy techniques. Mel Frequency Cepstral Coefficients (MFCC) vectors are
used to provide an estimate of the vocal tract filter. Meanwhile dynamic time warping (DTW) is used
to detect the nearest recorded voice.
The general methodology of audio classification involves extracting discriminatory features from the
audio data and feeding them to a pattern classifier. Different approaches and various kinds of audio
features were proposed with varying success rates. The features can be extracted either directly from
the time domain signal or from a transformation domain depending upon the choice of the signal
analysis approach. Some of the audio features that have been successfully used for audio
classification include Mel Frequency Cepstral Coefficients (MFCC).

Q4. Explain with block diagram continuous/connected digit speech recognition system.
Ans. Connected digit speech recognition is important in many applications such as voice-dialing
telephone, automated banking system, automatic data entry, PIN entry, etc.
The main problem in continuous speech recognition, when comparing with isolated speech, is
detection of word boundaries that often fuzzy because of sound articulations. Moreover, the speaking
rate of each speaker is less constant and also speaker pronunciation is less careful.
Following dig. Shows digit speech recognition using HMM,

PRM PVQ Isolated Connected


HMM Recobnition

● PRM: Manually segmented and labeled speech (digits) from which LPC cepstral coefficients is
calculated. A littering filter is also applied.
● PVQ: Vector quantizer. Using N vectors as codebook.
● Isolated HMM: HMM trained with segmented digits.
● Connected Recognition: Synthesis of a HMM capable of recognizing connected digits.
PRM – As an individual block

Frames-Hamming LPC Cepstrum Liftering


Widow analisys filter

● Main envelope information is found in the first Cepstral Coefficients. 12 are used as a starting
point.
● Cepstrum is an ideal representation since distance between cepstrum coefficients is strongly
related with differences in perception.
● a liftering filter is used to help whitten the resulting vector.
PVQ - Choosing the appropriate size of the vector quantizer codebook is essential given the small
size of the training and test database. An inadequate size of codebook could lead to:
● If too large, it would lead to a poor general representation, but a very adequate one of the training
data base (over fitting).
● If too small, the quantization would result in an important loss of information probably useful for
recognition.
Isolated HMM
● A HMM for each digit is trained separately.
● Each HMM has five states.
● Since the system is originally designed to identify isolated digits, besides the usual training, a
HMM representing the silence has been trained.
Connected Recognition – A conceptual overview
● A HMM can be synthesized connecting smaller HMMs, which will represent it's elemental units.
● This HMM could be used as a point of departure for a general HMM. In this case, if it is not
necessary to modify each state's p.d.f., only the transition probabilities between states are re-
estimated.
● In fact, this point of departure is used in the Extended Baum Welch algorithm.

Q5. A speech signal is sampled at a rate of 20000 samples per second (Fs=20KHz). A 20msec
window is used for short time spectral analysis, and the window is moved by 10msec in
consecutive analysis frames. Assume that a radix 2FFT is used to compute DFTs
1. How many speech samples are used in each segment?
2. What is frame rate of short time spectral analysis?
3. What size DFT and FFT are required to guarantee that no time aliasing will occur?
4. What is the resulting frequency resolution between adjacent spectral samples?

Ans.

1) 20ms of speech at the rate of 20,000 samples per second gives as,

20*10-3 * 20,000 = 400 samples


Each section is 400 samples in duration.
2) Frame rate = 1/ frame shift = 1/(10 * 10-3) = 100/sec
3) for short-time Fourier transform we require DFT to avoid aliasing, size to be as large as the frame
size of the analysis frame. Hence we require at least 400pt.DFT, but as we use radix2 FFT the next
higher value of radix2 is 512pt.In this we use 400pt. for speech samples and for remaining put zero
i.e. zero padding.
As speech signal is real we can also use DFT of pt.256,then perform complex FFT algorithm.

Sampling rate 20000Hz


4) Frequency Resolution = ---------------------- = ----------- = 39Hz
DFT size 512

You might also like