Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi
Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi
Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi
for ASR
Lecture 13
CS 753
Instructor: Preethi Jyothi
Speech Signal Analysis
Generate
discrete “A frame”
samples
frame
frame size
shift
(25 ms)
(10 ms)
Speech Signal Analysis
Generate
discrete “A frame”
samples
energy
DFT
Windowing
Pre-emphasis
energy
DFT
Windowing
Pre-emphasis
(
2⇡n
0.54 0.46cos L 0nL 1
Hamming: w[n] =
0 otherwise
Windowing: Illustration
energy
DFT
Windowing
Pre-emphasis
energy
DFT
Windowing
Pre-emphasis
• Warp the DFT output to the mel scale: mel is a unit of pitch
such that sounds which are perceptually equidistant in pitch
are separated by the same number of mels
Mels vs Hertz
Mel filterbank
• Mel frequency can be computed from the raw frequency f as:
f
mel(f ) = 1127ln(1 + )
700
10
• 9.3.
Section filters
Feature spaced linearly
Extraction: MFCC below 1kHz and remaining filters17
vectors
spread logarithmically above 1kHz
1
Amplitude
0
0 1000 2000 3000 4000
Frequency (Hz)
T
Figure 9.13 The Mel filter bank, after Davis and Mermelstein (1980). Each triangular
filter collects energy from a given frequency range. Filters are spaced linearly below 1000
Hz, and logarithmically above 1000 Hz.
cepstrum has a number of useful processing advantages and also significantly improves
Image credit: Jurafsky & Martin, Figure 9.13
phone recognition performance.
ti on — P h y s io log y
ch Percep Mel filterbank inspired by speech perception
Mel filterbank
• Mel frequency can be computed from the raw frequency f
as:
f
mel(f ) = 1127ln(1 + )
700
Amplitude
0
0 1000 2000 3000 4000
Frequency (Hz)
T
Figure 9.13 The Mel filter bank, after Davis and Mermelstein (1980). Each triangular
• Take log of each mel spectrum value 1) human sensitivity to signal
filter collects energy from a given frequency range. Filters are spaced linearly below 1000
Hz, and logarithmically above 1000 Hz.
energy is logarithmic 2) log makes features robust to input variations
cepstrum has a number of useful processing advantages and also significantly improves
Image credit: Jurafsky & Martin, Figure 9.13
phone recognition performance.
MFCC Extraction
yt(j) ( )
iDFT
yt (j), et
Time
yt (j), et
derivatives 2
log yt (j), 2 et
Mel
Filterbank
energy
DFT
Windowing
Pre-emphasis
cepstrum
Image credit: Jurafsky & Martin, Figure 9.14
Cepstrum
• For MFCC extraction, we use the first 12 cepstral values
energy
DFT
Windowing
Pre-emphasis
2
• Add 13 delta features (Δxt) and 13 double-delta features (Δ xt)
Recap: MFCCs
• Main advantages:
• Useful for:
[BN12]:Bisani & Ney , “Joint sequence models for grapheme-to-phoneme conversion”,Specom 2012
[Phonetisaurus] J. Novak, Phonetisaurus Toolkit