Chapter 2 - Speech Signal Processing
Chapter 2 - Speech Signal Processing
Chapter 2 - Speech Signal Processing
Approximation,
Feature Extraction for Speech
Recognition
• Frame-Based Signal Processing
• Cepstral Representations
– Linear Prediction Cepstral Coefficients (LPCC)
• Discrimination Power
• Reliable
• Robust
Discrete Representation of
Speech
Digital Representation of Speech
• Sampling Rates
– 16,000 Hz (samples/second) for microphone
speech
– 8,000 Hz (samples/second) for telephone
speech
• Storage formats:
– Pulse Code Modulation (PCM)
• 16-bit (2 bytes) per sample
• +/- 32768 in value
• Stored as “short” integers
• Microsoft “wav” files
Signal Pre-emphasis
Frame Blocking
• Process the speech signal in small chunks
over which the signal is assumed to have
stationary spectral characteristics
• Typical analysis window is 25 msec
– 400 samples for 16kHz audio
• Typical frame-rate is 10 msec
– Analysis pushes forward by 160 samples for
16kHz audio
• Frames generally overlap by 50% in time
– Results in 100 “frames” of audio per second
Input Representation
X:
X: ……
x1 x2 x3 …… 39 dim MFCC
Frame-based Processing Example:
Speech Detection
• Accurate detection of speech in the presence of
background noise is important to limit the
amount of processing that is needed for
recognition
Sign(s(i)) = +1 if s(i) ≥0
= -1 if s(i) < 0
Implementation
• End-Point:
– Similar to begin-point algorithm but takes place in the
reverse direction.
Linear Prediction (LP) Model
Samples from a windowed frame of speech can be
predicted as a linear combination of P previous
samples and error u(n):
Ft(k)
Speech
Ft
Why are MFCC’s still so Popular?
• Efficient (and relatively straight forward) to
compute
Implementation
Perceptual Linear Prediction (PLP)
• H. Hermansky. “Perceptual linear predictive
(PLP) analysis of speech”. Journal of the
Acoustical Society of America, 87:1738-1752,
1990.
X:
X: ……
x1 x2 x3 …… 39 dim MFCC