Chapter 2 - Speech Signal Processing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

Chapter 2

Speech Signal Processing


(Feature Extraction)
Ear Physiology

Pinna ‫صيوان األذن‬


Tympanic ‫طبالنى‬
Eardrum ‫طبلة األذن‬
Ossicles ‫عظيمات‬
Malleus ‫المطرقة‬
Incus ‫السندان‬
Stapes ‫الركابى‬
Cochlea ‫قوقعة األذن‬
‫)قوقعة‪ /‬حلزونى( ‪Snail:‬‬
‫دهليزى‬
Interesting Aspects of Perception

• Audible sound range is from 20Hz to 20kHz


• Ear is not equally sensitive to all
frequencies
• Perceived loudness is a function of both
the frequency and the amplitude of the
sound wave
Loudness of Pure Tones
• Contours of “equal loudness” can be estimated

• Ear relatively insensitive to low-frequency


sounds of moderate to low-intensity

• Maximum sensitivity of the ear is at around


4kHz (the ear hears the same loudness but at
minimum value of intensity for the same
contour). There is a secondary local maximum
near 13kHz due to the first two resonances of
the ear canal
Secondary Max.
Max. sensitivity at 4 kHz sensitivity at 13
kHz
Critical Bands
• Cochlea (‫ )القوقعة‬converts pressure waves to
neural firings:
– Vibrations induce traveling waves down the basilar
membrane
– Traveling waves induce peak responses at frequency
specific locations on the basilar membrane

• Frequency perceived within “critical bands”


– Act like band-pass filters
– Defines “frequency resolution” of the auditory system
– About 24 critical bands along basilar membrane.
– Each critical band is about 1.3 mm long and embraces
(holds) about 1300 neurons.
Mel Scale
 Linear below 1 kHz and logarithmic above 1 kHz

 Approximation,
Feature Extraction for Speech
Recognition
• Frame-Based Signal Processing

• Linear Prediction Analysis

• Cepstral Representations
– Linear Prediction Cepstral Coefficients (LPCC)

– Mel-Frequency Cepstral Coefficients (MFCC)

– Perceptual Linear Prediction (PLP)


Goals of Feature Extraction
• Compactness

• Discrimination Power

• Low Computation Complexity

• Reliable

• Robust
Discrete Representation of
Speech
Digital Representation of Speech
• Sampling Rates
– 16,000 Hz (samples/second) for microphone
speech
– 8,000 Hz (samples/second) for telephone
speech
• Storage formats:
– Pulse Code Modulation (PCM)
• 16-bit (2 bytes) per sample
• +/- 32768 in value
• Stored as “short” integers
• Microsoft “wav” files
Signal Pre-emphasis
Frame Blocking
• Process the speech signal in small chunks
over which the signal is assumed to have
stationary spectral characteristics
• Typical analysis window is 25 msec
– 400 samples for 16kHz audio
• Typical frame-rate is 10 msec
– Analysis pushes forward by 160 samples for
16kHz audio
• Frames generally overlap by 50% in time
– Results in 100 “frames” of audio per second
Input Representation

• Audio is represented by a vector sequence

X:

X: ……

x1 x2 x3 …… 39 dim MFCC
Frame-based Processing Example:
Speech Detection
• Accurate detection of speech in the presence of
background noise is important to limit the
amount of processing that is needed for
recognition

• Endpoint-Detection algorithms must take into


account difficult situations such as,
– Utterances that contain low-energy events at
beginning/end (e.g., weak fricatives)
– Utterances ending in unvoiced stops (e.g., ‘p’, ‘t’, ‘k’)
– Utterances ending in nasals (e.g., ‘m’,’n’).
– Breath noises at the end of an utterance
End-Point Detection
• End-Point Detection Algorithms mainly assume
the entire utterance is known. Must search for
begining and end of speech

• Rabiner and Sambur, "An Algorithm for


Determining the Endpoints of Isolated
Utterances". The Bell System Technical Journal,
Vol. 54, No. 2, February 1975, pp. 297- 315

• This end-point detection algorithm is based on:


– ITU - Upper energy threshold.
– ITL - Lower energy threshold.
– IZCT - Zero crossings rate threshold.
Energy and Zero-Crossing Rate
• Log-Frame Energy
– log of the square sum of the signal samples
N
log energy  log10  x[i ]2
i 1
• Zero-Crossing Rate
– Frequency at which the signal cross the 0 axis

Sign(s(i)) = +1 if s(i) ≥0
= -1 if s(i) < 0
Implementation

NOTE: N is the number of samples in the frame


Idea of the Rabiner / Sambur Algorithm
• Begin-Point:
– Search for the first time the signal exceeds the upper
energy threshold (ITU).
– Step backwards from that point until the energy drops
below the lower energy threshold (ITL).
– Consider previous 250 msec of zero-crossing rate. If ZCR
exceeds IZCT threshold 3 or more times, set begin point to
the first occurrence that threshold is exceeded

• End-Point:
– Similar to begin-point algorithm but takes place in the
reverse direction.
Linear Prediction (LP) Model
Samples from a windowed frame of speech can be
predicted as a linear combination of P previous
samples and error u(n):

u(n) is an excitation source and G is the gain of the


excitation. The ai terms are the LP coefficients and
P is the order of the model.
Linear Prediction (LP) Model

u(n) assumed to be an impulse train for voiced


speech and random white noise for unvoiced
speech.
Computing LP Parameters
Computing LP Parameters
• The model parameters are found by taking
the partial derivative of the MSE with
respect to the model parameters.

• Can be shown that the parameters can be


solved quite efficiently by computing the
autocorrelation coefficients from the
speech frame and then applying what is
known as the Levinson-Durbin recursion.
LP Parameter Estimation
Feature Extraction Software

Internet Institute for Speech and Hearing


DEMO
This demo illustrates different techniques used for speech spectrum analysis such as:
1- normal spectrum analysis
2- Linear filter bank analysis
3- Auditory filter bank analysis
4- Linear prediction spectrum analysis
5- Cepstrum
6- Autocorrelation

ESECTION: speech spectral cross-


sections.
DEMO

ESECTION: speech spectral cross-


sections.
Mel-Frequency Cepstral Coefficients
(MFCC)
• Davis & Mermelstein (1980)

• Computes signal energy from a bank of filters


that are linearly spaced a frequencies below
1kHz an logarithmically spaced above 1kHz.

• Same and equal spacing of filters along Mel-


Scale,
H1(k) ……………………… Hp(k)

Ft(k)
Speech

P is the number of filters


Mel-Scale Filterbank Implementation
• (20-24) triangular shaped filters spaced evenly
along the Mel Frequency Scale with 50% overlap

• Energy from each filter is computed (N = DFT


size, P = number of filters) at time t:

Ft
Why are MFCC’s still so Popular?
• Efficient (and relatively straight forward) to
compute

• Incorporate a perceptual frequency scale

• Filter banks reduce the impact of excitation


in the final feature sets

• DCT decorrelates the features


DEMO

ESECTION: speech spectral cross-


sections.

Implementation
Perceptual Linear Prediction (PLP)
• H. Hermansky. “Perceptual linear predictive
(PLP) analysis of speech”. Journal of the
Acoustical Society of America, 87:1738-1752,
1990.

• Includes perceptual aspects into recognizer


– equal-loudness pre-emphasis
– intensity-to-loudness conversion

• More robust than linear prediction cepstral


coefficients (LPCCs).
Dynamic Cepstral Coefficients
• Cepstral coefficients do not capture temporal
information

• Common to compute velocity and acceleration


of cepstral coefficients. For example, for delta
(velocity) features,
Final Feature Vector for ASR
• A single feature vector,
 12 cepstral coefficients (PLP, MFCC, …) + 1 norm energy
 + 13 delta features
 + 13 delta-delta

• 100 feature vectors per second


• Each vector is 39-dimensional
• Characterizes the spectral shape of the signal for
each time slice
NOTE: Compare the total size of 1 sec. of speech in case of 16 kHz
sampling rate before MFCC (16000 numbers) with its size after MFCC
which results in a matrix of 39 x 100 = 3900 numbers only.
DEMO

ESECTION: speech spectral cross-


sections.
Input Representation

• Audio is represented by a vector sequence

X:

X: ……

x1 x2 x3 …… 39 dim MFCC

You might also like