Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Acoustic Feature Analysis 


for ASR

Lecture 13

CS 753
Instructor: Preethi Jyothi
Speech Signal Analysis

Generate
discrete “A frame”
samples

• Need to focus on short segments of speech (speech


frames) that more or less correspond to a subphone and
are stationary
• Each speech frame is typically 20-50 ms long
• Use overlapping frames with frame shift of around 10 ms
Frame-wise processing

frame

frame size
shift
(25 ms)
(10 ms)
Speech Signal Analysis

Generate
discrete “A frame”
samples

• Need to focus on short segments of speech (speech


frames) that more or less correspond to a phoneme and are
stationary
• Each speech frame is typically 20-50 ms long
• Use overlapping frames with frame shift of around 10 ms
• Generate acoustic features corresponding to each speech
frame
Acoustic feature extraction for ASR
Desirable feature characteristics:
• Capture essential information about underlying phones

• Compress information into compact form

• Factor out information that’s not relevant to recognition e.g.


speaker-specific information such as vocal-tract length,
channel characteristics, etc.

• Would be desirable to find features that can be well-modelled


by known distributions (Gaussian models, for example)

• Feature widely used in ASR: Mel-frequency Cepstral


Coefficients (MFCCs)
MFCC Extraction
yt(j) ( )
iDFT
yt (j), et
Derivatives yt (j), et
2
log yt (j), 2 et
Mel

Filterbank

energy
DFT

Windowing

Pre-emphasis

Sampled speech signal x(j)


Pre-emphasis
• Pre-emphasis increases the amount of energy in the high
frequencies compared with lower frequencies
• Why? Because of spectral tilt
• In voiced speech, signal has more energy at low frequencies
• Attributed to the glottal source
• Boosting high frequency energy improves phone detection
accuracy

Image credit: Jurafsky & Martin, Figure 9.9


MFCC Extraction
yt(j) ( )
iDFT
yt (j), et
Time
 yt (j), et
derivatives 2
log yt (j), 2 et
Mel

Filterbank

energy
DFT

Windowing

Pre-emphasis

Sampled speech signal x(j)


Windowing

• Speech signal is modelled as a sequence of frames


(assumption: stationary across each frame)

• Windowing: multiply the value of the signal at time n, s[n] by


the value of the window at time n, w[n]: y[n] = w[n]s[n]
(
1 0nL 1
Rectangular: w[n] =
0 otherwise

(
2⇡n
0.54 0.46cos L 0nL 1
Hamming: w[n] =
0 otherwise
Windowing: Illustration

Rectangular window Hamming window


MFCC Extraction
yt(j) ( )
iDFT
yt (j), et
Time
 yt (j), et
derivatives 2
log yt (j), 2 et
Mel

Filterbank

energy
DFT

Windowing

Pre-emphasis

Sampled speech signal x(j)


Discrete Fourier Transform (DFT)
Extract spectral information from the windowed signal: 

Compute the DFT of the sampled signal
N
X1
j 2⇡
N kn
X[k] = x[n]e
n=0

Input: windowed signal x[1],…,x[n]


Output: complex number X[k] giving magnitude/phase for the kth frequency component

Image credit: Jurafsky & Martin, Figure 9.12


MFCC Extraction
yt(j) ( )
iDFT
yt (j), et
Time
 yt (j), et
derivatives 2
log yt (j), 2 et
Mel

Filterbank

energy
DFT

Windowing

Pre-emphasis

Sampled speech signal x(j)


Mel Filter Bank

• DFT gives energy at each frequency band

• However, human hearing is not sensitive at all frequencies:


less sensitive at higher frequencies

• Warp the DFT output to the mel scale: mel is a unit of pitch
such that sounds which are perceptually equidistant in pitch
are separated by the same number of mels
Mels vs Hertz
Mel filterbank
• Mel frequency can be computed from the raw frequency f as:
f
mel(f ) = 1127ln(1 + )
700

10
• 9.3.
Section filters
Feature spaced linearly
Extraction: MFCC below 1kHz and remaining filters17
vectors
spread logarithmically above 1kHz
1

Amplitude

0
0 1000 2000 3000 4000
Frequency (Hz)

Mel Spectrum 1 ...

T
Figure 9.13 The Mel filter bank, after Davis and Mermelstein (1980). Each triangular
filter collects energy from a given frequency range. Filters are spaced linearly below 1000
Hz, and logarithmically above 1000 Hz.

cepstrum has a number of useful processing advantages and also significantly improves
Image credit: Jurafsky & Martin, Figure 9.13
phone recognition performance.
ti on — P h y s io log y
ch Percep Mel filterbank inspired by speech perception
Mel filterbank
• Mel frequency can be computed from the raw frequency f
as:
f
mel(f ) = 1127ln(1 + )
700

Section 9.3. Feature Extraction: MFCC vectors 17


• 10 filters spaced linearly below 1kHz and remaining filters
spread logarithmically above 1kHz
1

Amplitude

0
0 1000 2000 3000 4000
Frequency (Hz)

Mel Spectrum 1 ...

T
Figure 9.13 The Mel filter bank, after Davis and Mermelstein (1980). Each triangular
• Take log of each mel spectrum value 1) human sensitivity to signal

filter collects energy from a given frequency range. Filters are spaced linearly below 1000
Hz, and logarithmically above 1000 Hz.
energy is logarithmic 2) log makes features robust to input variations
cepstrum has a number of useful processing advantages and also significantly improves
Image credit: Jurafsky & Martin, Figure 9.13
phone recognition performance.
MFCC Extraction
yt(j) ( )
iDFT
yt (j), et
Time
 yt (j), et
derivatives 2
log yt (j), 2 et
Mel

Filterbank

energy
DFT

Windowing

Pre-emphasis

Sampled speech signal x(j)


Cepstrum: Inverse DFT

• Recall speech signals are created when a glottal source of


a particular fundamental frequency passes through the
vocal tract

• Most useful information for phone detection is the vocal


tract filter (and not the glottal source)

• How do we deconvolve the source and filter to retrieve


information about the vocal tract filter? Cepstrum
Cepstrum

• Cepstrum: spectrum of the log of the spectrum

magnitude spectrum log magnitude spectrum

cepstrum
Image credit: Jurafsky & Martin, Figure 9.14
Cepstrum
• For MFCC extraction, we use the first 12 cepstral values

• Variance of the different cepstral coefficients tend to be


uncorrelated

• Useful property when modelling using GMMs in the


acoustic model — diagonal covariance matrices will
suffice

• Cepstrum is formally defined as the inverse DFT of the log


magnitude of the DFT of a signal
N
!
X1 N
X1
j 2⇡
N kn j 2⇡
N kn
c[n] = log x[n]e e
n=0 n=0
by N
MFCC Extraction
yt(j) ( )
DCT
yt (j), et
Time
 yt (j), et
derivatives 2
log yt (j), 2 et
Mel

Filterbank

energy
DFT

Windowing

Pre-emphasis

Sampled speech signal x(j)


Deltas and double-deltas
• From the cepstrum, use 12 cepstral coefficients for each frame

• 13th feature represents energy from the frame — computed


as sum of the power of the samples in the frame

• Also add features related to change in cepstral features over


time to capture speech dynamics:

Δxt = xt+τ − xt−τ (if xt is feature vector at time t)

• Typical value for τ is 1 or 2.

2
• Add 13 delta features (Δxt) and 13 double-delta features (Δ xt)
Recap: MFCCs

• Motivated by human speech perception and speech production

• For each speech frame

‣ Compute frequency spectrum and apply Mel binning

‣ Compute cepstrum using inverse DFT on the log of the mel-


warped spectrum

‣ 39-dimensional MFCC feature vector: First 12 cepstral


coefficients + energy + 13 delta + 13 double-delta
coefficients
Other features
• Perceptual Linear Prediction (PLP) features

• Mel filter-bank features (used with DNNs)

• Neural network-based “bottleneck features” (covered in lecture 8)

• Train deep NN using conventional acoustic features

• Introduce a narrow hidden layer (e.g. 40 hidden units) referred


to as the bottleneck layer, forcing the neural network to encode
relevant information in this layer

• Use hidden unit activations in the bottleneck layer as features


traditional state of the art baselines. The CNN architecture uses
the average pooling layer for variable length test data. We also
all POIs whose name
since this gives a good Features used for speaker recognition
compare to two variants: ‘CNN-fc-3s’, this architecture has a
fully connected fc6 layer, and divides the test data into 3s seg-
hese POIs are not used ments and averages the scores. As is evident there is a con-
sed at test time. The siderable drop in performance compared to the average pooling
• E.g. from a recent speaker identification (VoxCeleb) task.
original – partly due to the increased number of parameters that
ed to evaluate system must be learnt; ‘CNN-fc-3s no var. norm.’, this is the CNN-fc-3s

he metrics are similar Input features, F: Spectrograms generated in a sliding window
architecture without the variance normalization pre-processing
allenges, such as NIST fashion using a Hamming window of width 25ms and
of the input (the input is still mean normalized). The differ-
step 10ms
metric is based on the ence in performance between the two shows the importance of
• F used as input to a CNN architecture
variance normalization for this data.
For verification, the margin over the baselines is narrower,
⇥ Pf a ⇥ (1 •
Ptar ) (1) Mean and variance normalisation performed on every
but still a significant improvement, with the embedding being
frequency bin of
the crucial step.the spectrum (crucial for performance!)
lity Ptar of 0.01 and
miss and false alarms Accuracy Top-1 (%) Top-5 (%)
minimum value of Cdet I-vectors + SVM 49.0 56.6
ive performance mea- I-vectors + PLDA + SVM 60.8 75.6
EER) which is the rate CNN-fc-3s no var. norm. 63.5 80.3
errors are equal. This CNN-fc-3s 72.4 87.4
erification systems. CNN 80.5 92.1
Table 7: Results for identification on VoxCeleb (higher is bet-
# Utterances ter). The different CNN
Nagrani architectures
et al.,“VoxCeleb: are speaker
a large-scale described in Section
identification dataset”,5.
Interspeech 2017
About pronunciations

• There exist a number of different alphabets to transcribe


phonetic sounds

• E.g. ARPAbet (used in CMUdict)

• International Phonetic Alphabet (IPA) for all languages


Pronunciation Dictionary/Lexicon

• Pronunciation model/dictionary/lexicon: Lists one or more


pronunciations for a word

• Typically derived from language experts: Sequence of


phones written down for each word

• Dictionary construction involves:

1. Selecting what words to include in the dictionary

2. Pronunciation of each word (also, check for multiple


pronunciations)
Graphemes vs. Phonemes

• Instead of a pronunciation dictionary, one could represent a


pronunciation as a sequence of graphemes (or letters). That is,
model at the grapheme level.

• Useful technique for low-resourced/under-resourced languages

• Main advantages:

1. Avoid the need for phone-based pronunciations

2. Avoid the need for a phone alphabet

3. Works pretty well for languages with a systematic relationship


between graphemes (letters) and phonemes (sounds)
54 52 59 in this section, manual segmentation of the test data was used. This
53 51 59 allows the impact of the quantity of data and lexicon to be assessed
46 44 59 without having to consider changes in the segmentation.
43 42 59
4.2. Grapheme-based
Full Language Pack Systems ASR
zakh (302) (incrementally)
ng script; attr attributes;
WER (%)
Language ID System
Vit CN CNC
er of graphemes varies with Kurmanji Phonetic 67.6 65.8
205 64.1
nd the size of the data (or Kurdish Graphemic 67.0 65.3
akh, which has the greatest Phonetic 41.8 40.6
Cyrillic and Latin script are Tok Pisin 207 39.4
Graphemic 42.1 41.1
g from the FLP to the ALP,
Phonetic 55.5 54.0
P compared to the FLP. Cebuano 301 52.6
Graphemic 55.5 54.2
d: removing capitalisation;
utes; and sign information, Phonetic 54.9 53.5
Kazakh 302 51.5
owever comparing the FLP Graphemic 54.0 52.7
ot seen in the ALP. If the Phonetic 70.6 69.1
Telugu 303 67.5
coustic data transcriptions Graphemic 70.9 69.5
al language model training Phonetic 51.5 50.2
e required for these unseen Lithuanian 304 48.3
Graphemic 50.9 49.5
e observed in all LPs. Note
re mapped to their individ- Table 4: Babel FLP Tandem-SAT Performance: Vit Viterbi decod-
ing,Image
CN confusion
from: Gales etnetwork (CN) decoding,
al., Unicode-based CNCfor
graphemic systems CN-combination.
limited resource languages, ICASSP 15
Grapheme to phoneme (G2P) conversion

• Produce a pronunciation (phoneme sequence) given a


written word (grapheme sequence)

• Learn G2P mappings from a pronunciation dictionary

• Useful for:

• ASR systems in languages with no pre-built lexicons

• Speech synthesis systems

• Deriving pronunciations for out-of-vocabulary (OOV) words


G2P Conversion
• One popular paradigm: Joint sequence models [BN12]

• Grapheme and phoneme sequences are first aligned


using EM-based algorithm

• Results in a sequence of graphones (joint G-P tokens)

• Ngram models trained on these graphone sequences

• WFST-based implementation of such a joint graphone


model [Phonetisaurus]

[BN12]:Bisani & Ney , “Joint sequence models for grapheme-to-phoneme conversion”,Specom 2012
[Phonetisaurus] J. Novak, Phonetisaurus Toolkit

You might also like