Audproc 2

AUDIO
PROCESSING
ARIHARASUDHAN
INTRODUCTION
Hello World! Recognizing speech is an
art, fascinating to learn. It’s Ari from
The South to dig deeper the concepts
of Speech recognition. It can be
simply defined as “The Process of
enabling a MODEL to recognize the
text in the speech or audio sample”.
Apparently, it is a way of
communicating with the computer
through speech. It involves
processing the audio which is to be
discussed in detail in a little while.
CHALLENGES
Yet, we have to overcome a lot of
challenges. What if we have to
recognize the speech of the following
person? (Not that of Rowan Atkinson
but MR.BEAN!)
Natural speech is highly variable due
to differences in accents, speaking
rates, and individual styles. This
variability poses a significant
challenge for speech recognition
systems. Environmental noise can
degrade the performance of speech
recognition systems. Robust
algorithms are required to filter out
background noise and focus on the
relevant speech signal. Speech
recognition systems need to handle
input from different speakers, each
with their unique characteristics.
Training models to be speaker
independent is crucial for
widespread applicability. The
system's ability to recognize a wide
vocabulary is essential.
SOUND: WHAT IS IT
Sound signal is produced by

variations in air pressure. The height
of waveform shows the intensity of
the sound and is known as the
amplitude. The time taken for the
signal to complete one full wave is
the period. The number of waves
made by the signal in one second is
called the frequency. (reciprocal of
the period) Majority of sounds many
not follow such regular periodic
patterns.
SOUND: HOW TO REPRESENT IT
To convert a sound wave into
numbers, we need to measure the
amplitude at fixed intervals of time.
Each such measurement is called a
sample, and the sample rate is the
number of samples per second.
SOUND: IN DEEP LEARNING
What if we convert the audios into
images and then use a standard CNN
architecture to process those images?
Converting audio into images! So,
how do we do? If there is an audio
that says “CAT”, can we convert the
audio into a CAT image and process it
with CNN? This sounds like, STRANGE
ALIEN ENTERED TIRUNELVELI.
But, we don’t (like to) do this!

What we do is representing audio in a
spectrogram image.
The Spectrogram plots all of the

frequencies that are present in the
signal along with the amplitude of
each frequency.
DOMAINS
If a waveform shows Amplitude
against Time (x-axis showing the
range of time values of the signal),
the signal is in the Time Domain. If a
waveform shows Amplitude against
Frequency (x-axis showing the range
of frequencies of the signal), the
signal is in the Frequency Domain.
Spectrogram is a Frequency Domain

representation of a waveform. It is
obtained by applying Fourier
Transform to decompose the
waveform into frequency
components.
STEPS IN “SPECTROGRAMS & CNN”
Use a library like librosa or Pydub to
load the audio file. If needed,
resample the audio to a common
sampling rate (as some models
expect audios in a fixed sampling
rate). Scale or normalize the audio
data to a common range, typically
between -1 and 1. Use a Short-Time
Fourier Transform (STFT) to convert
the audio signal into a spectrogram.
The spectrogram is a 2D
representation of the audio signal,
where time is on the x-axis, frequency
on the y-axis, and intensity
represented by color. Normalize the
spectrogram values. Resize or crop
the spectrogram to a fixed size (input
size for the CNN). Split the data into
training and testing sets. Label the
data according to the target classes.
Design a CNN architecture suitable
for the task. Choose an appropriate
loss function (e.g., categorical
crossentropy for classification tasks).
Feed the spectrogram data into the
CNN model. Train the model on the
training dataset using the compiled
settings. Assess the model's
performance on the test set.
PLAY WITH AUDIO
Audio file is stored in different
formats (.wav, .mp3, .flac, .aac, etc,.)
based on the way it is compressed.
WAV is an uncompressed audio. MP3
employs lossy compression,
sacrificing some audio data for
significantly smaller files. FLAC, on
the other hand, uses lossless
compression, reducing file sizes
without compromising audio quality,
providing a middle ground between
WAV's fidelity and MP3's efficiency.
There are many libraries such as
torchaudio, librosa and scipy to play
with audio files. Let’s read an audio
file using all these three.
To visualize the sound wave, we can
do the following.
It is really easy to hear the audio on
Jupyter Notebook.
AUDIO TENSOR
Audio is represented as a time series
of numbers, representing the
amplitude at each timestep. For
instance, if the sample rate is n, a
one-second clip of that audio would
have n numbers. [ For m sec, m*n ]
BIT DEPTH & AMPLITUDE LEVELS
Quantization means to represent or
express audio data by assigning it to
specific discrete values or levels. The
bit depth determines the number of
amplitude levels that can be used to
quantize the audio signal. A 16-bit
audio signal can represent 216
(65,536) different amplitude levels,
while a 24-bit signal can represent 224
(16,777,216) levels. The higher the bit
depth, the more finely the amplitude
levels can be represented.
PLOT SPECTROGRAM
To plot a spectrogram for an audio
file, the first step is to chop up the
duration of the sound signal into
smaller time segments and then to
apply the Fourier Transform to each
segment, to determine the
frequencies contained in that
segment. Then, The Fourier
Transforms for all those segments are
combined into a single plot. The plot
is a Frequency (y-axis) vs Time (x-axis)
representation which uses different
colors to indicate the Amplitude of
each frequency. The brighter the
color the higher the energy of the
signal.
We can display the spectrogram as
shown below.
Unfortunately, when we display this

spectrogram there isn’t much
information for us to see. What
happened to all those colorful
spectrograms we used to see?
This happens because of the way
humans perceive sound. Most of what
we are able to hear are concentrated
in a narrow range of frequencies
and amplitudes. The human ear is
capable of detecting a wide range of
frequencies, typically ranging from 20
Hz to 20,000 Hz. Different regions of
the inner ear (cochlea) are sensitive
to different frequency ranges. High-
frequency sounds stimulate portions
of the cochlea near its entrance, while
low-frequency sounds stimulate
portions near the end.
WE HEAR ON LOGARITHMIC SCALE
In our perception of sound, the term
pitch refers to how high or low a
sound seems to be. High-pitched
sounds have a higher frequency,
meaning the sound waves oscillate
more rapidly. Low-pitched sounds
have a lower frequency, meaning the
sound waves oscillate more slowly.
This sensation is linked to the
frequency of the sound. However, our
ears don't interpret frequencies in a
straightforward, linear manner.
Consider the following pairs of
sounds:
- 100Hz and 200Hz
- 1000Hz and 1100Hz
- 10000Hz and 10100Hz
Although each pair has the same
actual frequency difference of
100Hz, we'll likely perceive the
"distance" between them
differently. The pair at 100Hz and
200Hz may sound more distinct than
the pair at 1000Hz and 1100Hz, and
you might find it challenging to
distinguish between the pair at
10000Hz and 10100Hz. This
perception aligns with the fact that
doubling the frequency (as in 100Hz
to 200Hz) may seem like a more
significant change to our ears
compared to a 1% increase in
frequency (as in 10000Hz to
10100Hz). Humans hear frequencies
on a logarithmic scale, not a linear
one.
To account for this in our audio data
analysis, especially in tasks involving
audio, we often use techniques like
logarithmic scaling or converting
frequencies to the mel scale. These
approaches better align with how our
ears perceive sound.
LOUDNESS - LOGARITHMIC SCALE
The human perception of the
amplitude of a sound is its loudness.
And similar to frequency, we hear
loudness logarithmically rather than
linearly. We account for this using the
Decibel scale. On this scale, 0 dB is
total silence. From there,
measurement units increase
exponentially. 10 dB is 10 times
louder than 0 dB, 20 dB is 100 times
louder and 30 dB is 1000 times
louder. On this scale, a sound above
100 dB starts to become unbearably
loud. We can see that, to deal with
sound in a realistic manner, it is
important for us to use a logarithmic
scale via the Mel Scale and the
Decibel Scale.
MEL SPECTROGRAM
A Mel Spectrogram makes two
important changes relative to a
regular Spectrogram that plots
Amplitude in a Time vs Frequency
graph. It uses the Mel Scale instead of
Frequency on the y-axis. It uses the
Decibel Scale instead of Amplitude to
indicate colors. For deep learning
models, we usually use this rather
than a simple Spectrogram. Let’s
modify our Spectrogram code to use
the Mel Scale in place of Frequency.
This is better than before, but most
of the spectrogram is still dark and
not carrying enough useful
information. So let’s modify it to use
the Decibel Scale instead of
Amplitude.
We have now seen how we pre-
process audio data and prepare Mel
Spectrograms. But before we can
input them into deep learning
models, we have to optimize them to
obtain the best performance.
SPECTROGRAM OPTIMIZATION
To optimize the Mel Spectrograms for
our specific problem in deep learning,
we need to adjust various
hyperparameters. Understanding the
construction of Spectrograms
involves concepts such as Fast
Fourier Transform (FFT) and Short-
Time Fourier Transform (STFT).
Discrete Fourier Transform (DFT) is
a technique for computing Fourier
Transforms, but it can be
computationally expensive. In
practice, we use the more efficient
Fast Fourier Transform (FFT)
algorithm. However, FFT provides
frequency components for the entire
audio time series without detailing
how these frequencies change over
time. You will not be able to see, for
example, that the first part of the
audio had high frequencies while
the second part had low
frequencies, and so on. To capture
frequency variations over time, we
turn to the Short-time Fourier
Transform (STFT) algorithm. STFT
breaks the audio signal into smaller
sections using a sliding time
window, applies FFT to each
section, and then combines them.
This approach allows us to observe
how frequency components change
within the audio signal over time,
providing a more detailed and
granular perspective.
This splits the signal into sections
along the Time axis. Secondly, it also
splits the signal into sections along
the Frequency axis. It takes the full
range of frequencies and divides it
up into equally spaced bands (in the
Mel scale). Then, for each section of
time, it calculates the Amplitude or
energy for each frequency band.
Let’s make this clear with an example.
We have a 1-minute audio clip that
contains frequencies between 0Hz
and 10000 Hz (in the Mel scale). Let’s
say that the Mel Spectrogram
algorithm:
> Chooses windows such that it splits
our audio signal into 30 time-sections.
> Decides to split our frequency
range into 10 bands (ie. 0–1000Hz,
1000–2000Hz, … 9000–10000Hz). The
final output of the algorithm is a 2D
Numpy array of shape (10, 30) where
each of the 30 columns represents
the FFT for one time-section; each of
the 10 rows represents Amplitude
values for a frequency band. Let’s
take the first column, which is the FFT
for the first time section. It has 10
rows. The first row is the Amplitude
for the first frequency band between
0–1000 Hz. The second row is the
Amplitude for the second frequency
band between 1000–2000 Hz. Each
column in the array becomes a
column in our Mel Spectrogram
image.
HYPER-PARAMETERS
There are some hyperparameters for
tuning our Mel Spectrogram.
Frequency Bands
fmin - minimum frequency; fmax -
maximum frequency to display ;
n_mels - number of frequency bands
(ie. Mel bins). This is the height of the
Spectrogram
Time Sections
n_fft - window length for each time
section; hop_length - number of
samples by which to slide the window
at each step. Hence, the width of the
Spectrogram is = Total number of
samples / hop_length
Based on the nature of the dataset,
these should be adjusted.
MFCC
It sometimes helps to take one
additional step and convert the Mel
Spectrogram into MFCC (Mel
Frequency Cepstral Coefficients).
MFCCs produce a compressed
representation of the Mel
Spectrogram by extracting only the
most essential frequency coefficients.
But, what are they really? They are

the descriptors! In simple words,
there are like embeddings of audio,
capturing the essence which can be
applied for various audio processing
tasks. The mel frequency cepstral
coefficients (MFCCs) of a signal are a
small set of features (usually about
10-20) which concisely describe the
overall shape of a spectral envelope.
In MIR, it is often used to describe
timbre.
DATA AUGMENTATION
To increase the diversity of the given
dataset, (when we don’t have enough
data), Augmenting our data is a good
choice. We do this by modifying the
existing data samples slightly. There
are several techniques to augment
audio data.
1. One way is to augment the
spectrogram by MASKING where we
can mask some time range or
frequency range.
2. Secondly, We can augment the

audio itself by some methods listed
below.
> Time Shift : shifting audio to the left
or the right by a random amount.
> Pitch Shift : Modifying frequency

randomly
> Time Stretch : Randomly slowing
down or speeding up audio
> Noise Addition

NANDRI

Audproc 2

Uploaded by

Copyright:

Available Formats

Audproc 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Audproc 2

Uploaded by

Copyright:

Available Formats

AUDIO

Sound signal is produced by

But, we don’t (like to) do this!

The Spectrogram plots all of the

Spectrogram is a Frequency Domain

Unfortunately, when we display this

But, what are they really? They are

2. Secondly, We can augment the

> Pitch Shift : Modifying frequency

> Noise Addition

You might also like