Chapter 2

Chapter 2
Speech Signal Processing

(Feature Extraction)
Ear Physiology
Pinna ‫صيوان األذن‬

Tympanic ‫طبالنى‬
Eardrum ‫طبلة األذن‬
Ossicles ‫عظيمات‬
Malleus ‫المطرقة‬
Incus ‫السندان‬
Stapes ‫الركابى‬
Cochlea ‫قوقعة األذن‬
‫)قوقعة‪ /‬حلزونى( ‪Snail:‬‬
‫دهليزى‬
Interesting Aspects of Perception
• Audible sound range is from 20Hz to 20kHz

• Ear is not equally sensitive to all
frequencies
• Perceived loudness is a function of both
the frequency and the amplitude of the
sound wave
Loudness of Pure Tones
• Contours of “equal loudness” can be estimated
• Ear relatively insensitive to low-frequency

sounds of moderate to low-intensity
• Maximum sensitivity of the ear is at around

4kHz (the ear hears the same loudness but at
minimum value of intensity for the same
contour). There is a secondary local maximum
near 13kHz due to the first two resonances of
the ear canal
Secondary Max.
Max. sensitivity at 4 kHz sensitivity at 13
kHz
Critical Bands
• Cochlea (‫ )القوقعة‬converts pressure waves to
neural firings:
– Vibrations induce traveling waves down the basilar
membrane
– Traveling waves induce peak responses at frequency
specific locations on the basilar membrane
• Frequency perceived within “critical bands”

– Act like band-pass filters
– Defines “frequency resolution” of the auditory system
– About 24 critical bands along basilar membrane.
– Each critical band is about 1.3 mm long and embraces
(holds) about 1300 neurons.
Mel Scale
 Linear below 1 kHz and logarithmic above 1 kHz
 Approximation,
Feature Extraction for Speech
Recognition
• Frame-Based Signal Processing
• Linear Prediction Analysis
• Cepstral Representations
– Linear Prediction Cepstral Coefficients (LPCC)
– Mel-Frequency Cepstral Coefficients (MFCC)
– Perceptual Linear Prediction (PLP)

Goals of Feature Extraction
• Compactness
• Discrimination Power
• Low Computation Complexity
• Reliable
• Robust
Discrete Representation of
Speech
Digital Representation of Speech
• Sampling Rates
– 16,000 Hz (samples/second) for microphone
speech
– 8,000 Hz (samples/second) for telephone
speech
• Storage formats:
– Pulse Code Modulation (PCM)
• 16-bit (2 bytes) per sample
• +/- 32768 in value
• Stored as “short” integers
• Microsoft “wav” files
Signal Pre-emphasis
Frame Blocking
• Process the speech signal in small chunks
over which the signal is assumed to have
stationary spectral characteristics
• Typical analysis window is 25 msec
– 400 samples for 16kHz audio
• Typical frame-rate is 10 msec
– Analysis pushes forward by 160 samples for
16kHz audio
• Frames generally overlap by 50% in time
– Results in 100 “frames” of audio per second
Input Representation
• Audio is represented by a vector sequence
X:
X: ……
x1 x2 x3 …… 39 dim MFCC
Frame-based Processing Example:
Speech Detection
• Accurate detection of speech in the presence of
background noise is important to limit the
amount of processing that is needed for
recognition
• Endpoint-Detection algorithms must take into

account difficult situations such as,
– Utterances that contain low-energy events at
beginning/end (e.g., weak fricatives)
– Utterances ending in unvoiced stops (e.g., ‘p’, ‘t’, ‘k’)
– Utterances ending in nasals (e.g., ‘m’,’n’).
– Breath noises at the end of an utterance
End-Point Detection
• End-Point Detection Algorithms mainly assume
the entire utterance is known. Must search for
begining and end of speech
• Rabiner and Sambur, "An Algorithm for

Determining the Endpoints of Isolated
Utterances". The Bell System Technical Journal,
Vol. 54, No. 2, February 1975, pp. 297- 315
• This end-point detection algorithm is based on:

– ITU - Upper energy threshold.
– ITL - Lower energy threshold.
– IZCT - Zero crossings rate threshold.
Energy and Zero-Crossing Rate
• Log-Frame Energy
– log of the square sum of the signal samples
N
log energy  log10  x[i ]2
i 1
• Zero-Crossing Rate
– Frequency at which the signal cross the 0 axis
Sign(s(i)) = +1 if s(i) ≥0
= -1 if s(i) < 0
Implementation
NOTE: N is the number of samples in the frame

Idea of the Rabiner / Sambur Algorithm
• Begin-Point:
– Search for the first time the signal exceeds the upper
energy threshold (ITU).
– Step backwards from that point until the energy drops
below the lower energy threshold (ITL).
– Consider previous 250 msec of zero-crossing rate. If ZCR
exceeds IZCT threshold 3 or more times, set begin point to
the first occurrence that threshold is exceeded
• End-Point:
– Similar to begin-point algorithm but takes place in the
reverse direction.
Linear Prediction (LP) Model
Samples from a windowed frame of speech can be
predicted as a linear combination of P previous
samples and error u(n):
u(n) is an excitation source and G is the gain of the

excitation. The ai terms are the LP coefficients and
P is the order of the model.
Linear Prediction (LP) Model
u(n) assumed to be an impulse train for voiced

speech and random white noise for unvoiced
speech.
Computing LP Parameters
Computing LP Parameters
• The model parameters are found by taking
the partial derivative of the MSE with
respect to the model parameters.
• Can be shown that the parameters can be

solved quite efficiently by computing the
autocorrelation coefficients from the
speech frame and then applying what is
known as the Levinson-Durbin recursion.
LP Parameter Estimation
Feature Extraction Software
Internet Institute for Speech and Hearing

DEMO
This demo illustrates different techniques used for speech spectrum analysis such as:
1- normal spectrum analysis
2- Linear filter bank analysis
3- Auditory filter bank analysis
4- Linear prediction spectrum analysis
5- Cepstrum
6- Autocorrelation
ESECTION: speech spectral cross-

sections.
DEMO

sections.
Mel-Frequency Cepstral Coefficients
(MFCC)
• Davis & Mermelstein (1980)
• Computes signal energy from a bank of filters

that are linearly spaced a frequencies below
1kHz an logarithmically spaced above 1kHz.
• Same and equal spacing of filters along Mel-

Scale,
H1(k) ……………………… Hp(k)
Ft(k)
Speech
P is the number of filters

Mel-Scale Filterbank Implementation
• (20-24) triangular shaped filters spaced evenly
along the Mel Frequency Scale with 50% overlap
• Energy from each filter is computed (N = DFT

size, P = number of filters) at time t:
Ft
Why are MFCC’s still so Popular?
• Efficient (and relatively straight forward) to
compute
• Incorporate a perceptual frequency scale
• Filter banks reduce the impact of excitation

in the final feature sets
• DCT decorrelates the features

DEMO

sections.
Implementation
Perceptual Linear Prediction (PLP)
• H. Hermansky. “Perceptual linear predictive
(PLP) analysis of speech”. Journal of the
Acoustical Society of America, 87:1738-1752,
1990.
• Includes perceptual aspects into recognizer

– equal-loudness pre-emphasis
– intensity-to-loudness conversion
• More robust than linear prediction cepstral

coefficients (LPCCs).
Dynamic Cepstral Coefficients
• Cepstral coefficients do not capture temporal
information
• Common to compute velocity and acceleration

of cepstral coefficients. For example, for delta
(velocity) features,
Final Feature Vector for ASR
• A single feature vector,
 12 cepstral coefficients (PLP, MFCC, …) + 1 norm energy
 + 13 delta features
 + 13 delta-delta
• 100 feature vectors per second

• Each vector is 39-dimensional
• Characterizes the spectral shape of the signal for
each time slice
NOTE: Compare the total size of 1 sec. of speech in case of 16 kHz
sampling rate before MFCC (16000 numbers) with its size after MFCC
which results in a matrix of 39 x 100 = 3900 numbers only.
DEMO

sections.
Input Representation
• Audio is represented by a vector sequence
X:
X: ……
x1 x2 x3 …… 39 dim MFCC

Chapter 2 - Speech Signal Processing

Uploaded by

Copyright:

Available Formats

Chapter 2 - Speech Signal Processing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 - Speech Signal Processing

Uploaded by

Copyright:

Available Formats

Speech Signal Processing

Pinna ‫صيوان األذن‬

• Audible sound range is from 20Hz to 20kHz

• Ear relatively insensitive to low-frequency

• Maximum sensitivity of the ear is at around

• Frequency perceived within “critical bands”

• Linear Prediction Analysis

– Mel-Frequency Cepstral Coefficients (MFCC)

– Perceptual Linear Prediction (PLP)

• Low Computation Complexity

• Audio is represented by a vector sequence

• Endpoint-Detection algorithms must take into

• Rabiner and Sambur, "An Algorithm for

• This end-point detection algorithm is based on:

NOTE: N is the number of samples in the frame

u(n) is an excitation source and G is the gain of the

u(n) assumed to be an impulse train for voiced

• Can be shown that the parameters can be

Internet Institute for Speech and Hearing

ESECTION: speech spectral cross-

ESECTION: speech spectral cross-

• Computes signal energy from a bank of filters

• Same and equal spacing of filters along Mel-

P is the number of filters

• Energy from each filter is computed (N = DFT

• Incorporate a perceptual frequency scale

• Filter banks reduce the impact of excitation

• DCT decorrelates the features

ESECTION: speech spectral cross-

• Includes perceptual aspects into recognizer

• More robust than linear prediction cepstral

• Common to compute velocity and acceleration

• 100 feature vectors per second

ESECTION: speech spectral cross-

• Audio is represented by a vector sequence

You might also like