Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi

Acoustic Feature Analysis  
for ASR
Lecture 13
CS 753
Instructor: Preethi Jyothi
Speech Signal Analysis
Generate
discrete “A frame”
samples
• Need to focus on short segments of speech (speech

frames) that more or less correspond to a subphone and
are stationary
• Each speech frame is typically 20-50 ms long
• Use overlapping frames with frame shift of around 10 ms
Frame-wise processing
frame 
frame size
shift
(25 ms)
(10 ms)
Speech Signal Analysis
Generate
discrete “A frame”
samples
• Need to focus on short segments of speech (speech

frames) that more or less correspond to a phoneme and are
stationary
• Each speech frame is typically 20-50 ms long
• Use overlapping frames with frame shift of around 10 ms
• Generate acoustic features corresponding to each speech
frame
Acoustic feature extraction for ASR
Desirable feature characteristics:
• Capture essential information about underlying phones
• Compress information into compact form
• Factor out information that’s not relevant to recognition e.g.

speaker-specific information such as vocal-tract length,
channel characteristics, etc.
• Would be desirable to find features that can be well-modelled

by known distributions (Gaussian models, for example)
• Feature widely used in ASR: Mel-frequency Cepstral

Coefficients (MFCCs)
MFCC Extraction
yt(j) ( )
iDFT
yt (j), et
Derivatives yt (j), et
2
log yt (j), 2 et
Mel 
Filterbank
energy
DFT
Windowing
Pre-emphasis
Sampled speech signal x(j)

Pre-emphasis
• Pre-emphasis increases the amount of energy in the high
frequencies compared with lower frequencies
• Why? Because of spectral tilt
• In voiced speech, signal has more energy at low frequencies
• Attributed to the glottal source
• Boosting high frequency energy improves phone detection
accuracy
Image credit: Jurafsky & Martin, Figure 9.9

MFCC Extraction
yt(j) ( )
iDFT
yt (j), et
Time  yt (j), et
derivatives 2
log yt (j), 2 et
Mel 
Filterbank
energy
DFT
Windowing
Pre-emphasis

Windowing
• Speech signal is modelled as a sequence of frames

(assumption: stationary across each frame)
• Windowing: multiply the value of the signal at time n, s[n] by

the value of the window at time n, w[n]: y[n] = w[n]s[n]
(
1 0nL 1
Rectangular: w[n] =
0 otherwise
(
2⇡n
0.54 0.46cos L 0nL 1
Hamming: w[n] =
0 otherwise
Windowing: Illustration
Rectangular window Hamming window

MFCC Extraction
yt(j) ( )
iDFT
yt (j), et
Time  yt (j), et
derivatives 2
log yt (j), 2 et
Mel 
Filterbank
energy
DFT
Windowing
Pre-emphasis

Discrete Fourier Transform (DFT)
Extract spectral information from the windowed signal:  
Compute the DFT of the sampled signal
N
X1
j 2⇡
N kn
X[k] = x[n]e
n=0
Input: windowed signal x[1],…,x[n]

Output: complex number X[k] giving magnitude/phase for the kth frequency component

MFCC Extraction
yt(j) ( )
iDFT
yt (j), et
Time  yt (j), et
derivatives 2
log yt (j), 2 et
Mel 
Filterbank
energy
DFT
Windowing
Pre-emphasis

Mel Filter Bank
• DFT gives energy at each frequency band
• However, human hearing is not sensitive at all frequencies:

less sensitive at higher frequencies
• Warp the DFT output to the mel scale: mel is a unit of pitch
such that sounds which are perceptually equidistant in pitch
are separated by the same number of mels
Mels vs Hertz
Mel filterbank
• Mel frequency can be computed from the raw frequency f as:
f
mel(f ) = 1127ln(1 + )
700
10
• 9.3.
Section filters
Feature spaced linearly
Extraction: MFCC below 1kHz and remaining filters17
vectors
spread logarithmically above 1kHz
1
Amplitude
0
0 1000 2000 3000 4000
Frequency (Hz)
Mel Spectrum 1 ...
T
Figure 9.13 The Mel filter bank, after Davis and Mermelstein (1980). Each triangular
filter collects energy from a given frequency range. Filters are spaced linearly below 1000
Hz, and logarithmically above 1000 Hz.
cepstrum has a number of useful processing advantages and also significantly improves
phone recognition performance.
ti on — P h y s io log y
ch Percep Mel filterbank inspired by speech perception
Mel filterbank
• Mel frequency can be computed from the raw frequency f
as:
f
mel(f ) = 1127ln(1 + )
700
Section 9.3. Feature Extraction: MFCC vectors 17

• 10 filters spaced linearly below 1kHz and remaining filters
spread logarithmically above 1kHz
1
Amplitude
0
0 1000 2000 3000 4000
Frequency (Hz)
Mel Spectrum 1 ...
T
Figure 9.13 The Mel filter bank, after Davis and Mermelstein (1980). Each triangular
• Take log of each mel spectrum value 1) human sensitivity to signal 
filter collects energy from a given frequency range. Filters are spaced linearly below 1000
Hz, and logarithmically above 1000 Hz.
energy is logarithmic 2) log makes features robust to input variations
cepstrum has a number of useful processing advantages and also significantly improves
phone recognition performance.
MFCC Extraction
yt(j) ( )
iDFT
yt (j), et
Time  yt (j), et
derivatives 2
log yt (j), 2 et
Mel 
Filterbank
energy
DFT
Windowing
Pre-emphasis

Cepstrum: Inverse DFT
• Recall speech signals are created when a glottal source of

a particular fundamental frequency passes through the
vocal tract
• Most useful information for phone detection is the vocal

tract filter (and not the glottal source)
• How do we deconvolve the source and filter to retrieve

information about the vocal tract filter? Cepstrum
Cepstrum
• Cepstrum: spectrum of the log of the spectrum
magnitude spectrum log magnitude spectrum
cepstrum
Cepstrum
• For MFCC extraction, we use the first 12 cepstral values
• Variance of the different cepstral coefficients tend to be

uncorrelated
• Useful property when modelling using GMMs in the

acoustic model — diagonal covariance matrices will
suffice
• Cepstrum is formally defined as the inverse DFT of the log

magnitude of the DFT of a signal
N
!
X1 N
X1
j 2⇡
N kn j 2⇡
N kn
c[n] = log x[n]e e
n=0 n=0
by N
MFCC Extraction
yt(j) ( )
DCT
yt (j), et
Time  yt (j), et
derivatives 2
log yt (j), 2 et
Mel 
Filterbank
energy
DFT
Windowing
Pre-emphasis

Deltas and double-deltas
• From the cepstrum, use 12 cepstral coefficients for each frame
• 13th feature represents energy from the frame — computed

as sum of the power of the samples in the frame
• Also add features related to change in cepstral features over

time to capture speech dynamics:
Δxt = xt+τ − xt−τ (if xt is feature vector at time t)
• Typical value for τ is 1 or 2.
2
• Add 13 delta features (Δxt) and 13 double-delta features (Δ xt)
Recap: MFCCs
• Motivated by human speech perception and speech production
• For each speech frame
‣ Compute frequency spectrum and apply Mel binning
‣ Compute cepstrum using inverse DFT on the log of the mel-

warped spectrum
‣ 39-dimensional MFCC feature vector: First 12 cepstral

coefficients + energy + 13 delta + 13 double-delta
coefficients
Other features
• Perceptual Linear Prediction (PLP) features
• Mel filter-bank features (used with DNNs)
• Neural network-based “bottleneck features” (covered in lecture 8)
• Train deep NN using conventional acoustic features
• Introduce a narrow hidden layer (e.g. 40 hidden units) referred

to as the bottleneck layer, forcing the neural network to encode
relevant information in this layer
• Use hidden unit activations in the bottleneck layer as features

traditional state of the art baselines. The CNN architecture uses
the average pooling layer for variable length test data. We also
all POIs whose name
since this gives a good Features used for speaker recognition
compare to two variants: ‘CNN-fc-3s’, this architecture has a
fully connected fc6 layer, and divides the test data into 3s seg-
hese POIs are not used ments and averages the scores. As is evident there is a con-
sed at test time. The siderable drop in performance compared to the average pooling
• E.g. from a recent speaker identification (VoxCeleb) task.
original – partly due to the increased number of parameters that
ed to evaluate system must be learnt; ‘CNN-fc-3s no var. norm.’, this is the CNN-fc-3s
•
he metrics are similar Input features, F: Spectrograms generated in a sliding window
architecture without the variance normalization pre-processing
allenges, such as NIST fashion using a Hamming window of width 25ms and
of the input (the input is still mean normalized). The differ-
step 10ms
metric is based on the ence in performance between the two shows the importance of
• F used as input to a CNN architecture
variance normalization for this data.
For verification, the margin over the baselines is narrower,
⇥ Pf a ⇥ (1 •
Ptar ) (1) Mean and variance normalisation performed on every
but still a significant improvement, with the embedding being
frequency bin of
the crucial step.the spectrum (crucial for performance!)
lity Ptar of 0.01 and
miss and false alarms Accuracy Top-1 (%) Top-5 (%)
minimum value of Cdet I-vectors + SVM 49.0 56.6
ive performance mea- I-vectors + PLDA + SVM 60.8 75.6
EER) which is the rate CNN-fc-3s no var. norm. 63.5 80.3
errors are equal. This CNN-fc-3s 72.4 87.4
erification systems. CNN 80.5 92.1
Table 7: Results for identification on VoxCeleb (higher is bet-
# Utterances ter). The different CNN
Nagrani architectures
et al.,“VoxCeleb: are speaker
a large-scale described in Section
identification dataset”,5.
Interspeech 2017
About pronunciations
• There exist a number of different alphabets to transcribe

phonetic sounds
• E.g. ARPAbet (used in CMUdict)
• International Phonetic Alphabet (IPA) for all languages

Pronunciation Dictionary/Lexicon
• Pronunciation model/dictionary/lexicon: Lists one or more

pronunciations for a word
• Typically derived from language experts: Sequence of

phones written down for each word
• Dictionary construction involves:
1. Selecting what words to include in the dictionary
2. Pronunciation of each word (also, check for multiple

pronunciations)
Graphemes vs. Phonemes
• Instead of a pronunciation dictionary, one could represent a

pronunciation as a sequence of graphemes (or letters). That is,
model at the grapheme level.
• Useful technique for low-resourced/under-resourced languages
• Main advantages:
1. Avoid the need for phone-based pronunciations
2. Avoid the need for a phone alphabet
3. Works pretty well for languages with a systematic relationship

between graphemes (letters) and phonemes (sounds)
54 52 59 in this section, manual segmentation of the test data was used. This
53 51 59 allows the impact of the quantity of data and lexicon to be assessed
46 44 59 without having to consider changes in the segmentation.
43 42 59
4.2. Grapheme-based
Full Language Pack Systems ASR
zakh (302) (incrementally)
ng script; attr attributes;
WER (%)
Language ID System
Vit CN CNC
er of graphemes varies with Kurmanji Phonetic 67.6 65.8
205 64.1
nd the size of the data (or Kurdish Graphemic 67.0 65.3
akh, which has the greatest Phonetic 41.8 40.6
Cyrillic and Latin script are Tok Pisin 207 39.4
Graphemic 42.1 41.1
g from the FLP to the ALP,
Phonetic 55.5 54.0
P compared to the FLP. Cebuano 301 52.6
Graphemic 55.5 54.2
d: removing capitalisation;
utes; and sign information, Phonetic 54.9 53.5
Kazakh 302 51.5
owever comparing the FLP Graphemic 54.0 52.7
ot seen in the ALP. If the Phonetic 70.6 69.1
Telugu 303 67.5
coustic data transcriptions Graphemic 70.9 69.5
al language model training Phonetic 51.5 50.2
e required for these unseen Lithuanian 304 48.3
Graphemic 50.9 49.5
e observed in all LPs. Note
re mapped to their individ- Table 4: Babel FLP Tandem-SAT Performance: Vit Viterbi decod-
ing,Image
CN confusion
from: Gales etnetwork (CN) decoding,
al., Unicode-based CNCfor
graphemic systems CN-combination.
limited resource languages, ICASSP 15
Grapheme to phoneme (G2P) conversion
• Produce a pronunciation (phoneme sequence) given a

written word (grapheme sequence)
• Learn G2P mappings from a pronunciation dictionary
• Useful for:
• ASR systems in languages with no pre-built lexicons
• Speech synthesis systems
• Deriving pronunciations for out-of-vocabulary (OOV) words

G2P Conversion
• One popular paradigm: Joint sequence models [BN12]
• Grapheme and phoneme sequences are first aligned

using EM-based algorithm
• Results in a sequence of graphones (joint G-P tokens)
• Ngram models trained on these graphone sequences
• WFST-based implementation of such a joint graphone

model [Phonetisaurus]
[BN12]:Bisani & Ney , “Joint sequence models for grapheme-to-phoneme conversion”,Specom 2012
[Phonetisaurus] J. Novak, Phonetisaurus Toolkit

Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi

Uploaded by

Copyright:

Available Formats

Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi

Uploaded by

Copyright:

Available Formats

Acoustic Feature Analysis

• Need to focus on short segments of speech (speech

• Need to focus on short segments of speech (speech

• Compress information into compact form

• Factor out information that’s not relevant to recognition e.g.

• Would be desirable to find features that can be well-modelled

• Feature widely used in ASR: Mel-frequency Cepstral

Sampled speech signal x(j)

Image credit: Jurafsky & Martin, Figure 9.9

Sampled speech signal x(j)

• Speech signal is modelled as a sequence of frames

• Windowing: multiply the value of the signal at time n, s[n] by

Rectangular window Hamming window

Sampled speech signal x(j)

Input: windowed signal x[1],…,x[n]

Image credit: Jurafsky & Martin, Figure 9.12

Sampled speech signal x(j)

• DFT gives energy at each frequency band

• However, human hearing is not sensitive at all frequencies:

Mel Spectrum 1 ...

Section 9.3. Feature Extraction: MFCC vectors 17

Mel Spectrum 1 ...

Sampled speech signal x(j)

• Recall speech signals are created when a glottal source of

• Most useful information for phone detection is the vocal

• How do we deconvolve the source and filter to retrieve

• Cepstrum: spectrum of the log of the spectrum

magnitude spectrum log magnitude spectrum

• Variance of the different cepstral coefficients tend to be

• Useful property when modelling using GMMs in the

• Cepstrum is formally defined as the inverse DFT of the log

Sampled speech signal x(j)

• 13th feature represents energy from the frame — computed

• Also add features related to change in cepstral features over

Δxt = xt+τ − xt−τ (if xt is feature vector at time t)

• Typical value for τ is 1 or 2.

• Motivated by human speech perception and speech production

• For each speech frame

‣ Compute frequency spectrum and apply Mel binning

‣ Compute cepstrum using inverse DFT on the log of the mel-

‣ 39-dimensional MFCC feature vector: First 12 cepstral

• Mel filter-bank features (used with DNNs)

• Neural network-based “bottleneck features” (covered in lecture 8)

• Train deep NN using conventional acoustic features

• Introduce a narrow hidden layer (e.g. 40 hidden units) referred

• Use hidden unit activations in the bottleneck layer as features

• There exist a number of different alphabets to transcribe

• E.g. ARPAbet (used in CMUdict)

• International Phonetic Alphabet (IPA) for all languages

• Pronunciation model/dictionary/lexicon: Lists one or more

• Typically derived from language experts: Sequence of

• Dictionary construction involves:

1. Selecting what words to include in the dictionary

2. Pronunciation of each word (also, check for multiple

• Instead of a pronunciation dictionary, one could represent a

• Useful technique for low-resourced/under-resourced languages

1. Avoid the need for phone-based pronunciations

2. Avoid the need for a phone alphabet

Acoustic Feature Analysis