Vocal Tract
Vocal Tract
Logan Blue, Kevin Warren, Hadi Abdullah, Cassidy Gibson, Luis Vargas, Jessica O’Dell, Kevin Butler,
Patrick Traynor
rangements of the tongue and jaw, which result in resonance Figure 2: Deepfake generation has several stages to create
chambers within the vocal tract. For a given vowel, these a fake audio sample. The encoder generates an embedding
chambers produce frequencies known as formants whose re- of the speaker, the synthesizer creates a spectrogram for a
lationship determines the actual sound. Vowels are the most targeted phrase using the speaker embedding, and the vocoder
commonly used phoneme type in the English language, mak- converts the spectrogram into the synthetic waveform.
ing up approximately 38% of all phonemes [35]. Fricatives
(e.g.,“/s/” in sun) are generated by turbulent flow caused by a ing air through the vocal cords, which induces an acoustic
constriction in the airway, while stops (e.g.,“/g/” in gate) are resonance that contains the fundamental (lowest) frequency
created by briefly halting and then quickly releasing the air- of a speaker’s voice. The resonating air then moves through
flow in the vocal tract. Affricatives (e.g.,“/tS/” in church) are the vocal cords and into the vocal tract (Figure 1). At this
a concatenation of a fricative with a stop. Nasals (e.g.,“/n/” point, different configurations of the articulators (e.g., where
in nice) are created by forcing air through the nasal cavity the tongue is placed, how large the mouth is) shape the path
and tend to be at a lower amplitude than the other phonemes. for the air to flow, which creates constructive/destructive in-
Glides (e.g.,“/l/” in lie) act as a transition between different terference that produces the unique sounds of each phoneme.
phonemes, and diphthongs (e.g.,“/eI/” in wait) refer to the
vowel sound that comes from the lips and tongue transitioning
3.3 Deepfake Audio
between two different vowel positions.
Phonemes alone do not encapsulate how humans speak. Deepfakes are digitally produced speech samples that are
The transitions between two phonemes are also important for intended to sound like a specific individual. Currently, deep-
speech since it is a continuous process. Breaking speech down fakes are produced via the use of machine learning (ML)
into pairs of phonemes (i.e., bigrams) preserves the individual algorithms. While there are numerous deepfake ML algo-
information of each phoneme as well as transitions between rithms in existence, the overall framework the techniques are
them. These bigrams generate a more accurate depiction of built on are similar. As shown in Figure 2, the framework is
the vocal tract dynamics during the speech process. comprised of three stages: encoder, synthesizer, and vocoder.
Encoder: The encoder learns the unique representation of
3.2 Organic Speech the speaker’s voice, known as the speaker embedding. These
can be learned using a model architecture similar to that of
Human speech production results from the interactions be- speaker verification systems [36]. The embedding is derived
tween different anatomical components, such as the lungs, from a short utterance using the target speaker’s voice. The
larynx (i.e., the vocal cords), and the articulators (e.g., the accuracy of the embedding can be increased by giving the
tongue, cheeks, lips), that work in conjunction to produce encoder more utterances, with diminishing returns. The output
sound. The production of sound1 starts with the lungs forc- embedding from the encoder stage is passed as an input into
1 Thisprocess is similar to how trumpets create a sound as air flows the following synthesizer stage.
through various pipe configurations. Synthesizer: A synthesizer generates a Mel Spectrogram
A
Vocal Tract Position
B B
C C
A
A B C
Who Has
/hu/ /hæz/
Figure 3: The sound produced by a phoneme is highly dependent on the structure of the vocal tract. Constriction made by tongue
movement or jaw angle filters different frequencies.
6.1 Reader Participation affected by the adjacent phonemes differently than how “/O/”
in “ball” is. In particular “thought” ends with the plosive “/t/”
Before we go further into the details of these two steps, we which requires a break in airflow, thus causing the speaker to
would like to help the reader develops a deeper intuition of abruptly end the “/O/” phoneme. In contrast, the “/O/” in “ball”
phonemes and speech generation. is followed by the lateral approximant “/l/,” which does not
For speech, air must move from the lungs to the mouth require a break-in airflow, leading the speaker to gradually
while passing through various components of the vocal tract. transition between the two phonemes.
To understand the intuition behind our technique, we invite
the reader to speak out loud the words “who” (phonetically
6.2 Vocal Tract Feature Estimator
spelled “/hu/”) and “has” (phonetically spelled “/hæz/”) while
paying close attention to how the mouth is positioned during Based on the intuition built in the previous subsection, our
the pronunciation of each vowel phoneme (i.e., “/u/” in “who” modeling technique needs to be able to extract the shape of
and “/æ/” in “has”). the vocal tract present during the articulation of a specific
Figure 3 shows how the components are arranged during bigram. To do this, we use a fluid dynamic concatenated tube
the pronunciation of the vowel phonemes for each word men- model to estimate the speaker’s vocal tract that is similar to
tioned above. Notice that during the pronunciation of the Rabiner et al.’s technique [27]. Before we go into the details
phoneme “/u/” in “who” the tongue compresses to the back of this model, it is important to discuss the assumption the
of the mouth (i.e., away from the teeth) (A) at the same time, model makes.
the lower jaw is held predominately closed. The closed jaw • Lossless Model: Our model ignores energy losses that
position lifts the tongue so that it is closer to the roof of the result from the fluid viscosity (i.e., the friction losses
mouth (B). Both of these movements create a specific path- between molecules of the air), the elastic nature of the
way through which the air must flow as it leaves the mouth. vocal tract (i.e., the cross-sectional area changing due
Conversely, the vowel phoneme “/æ/” in “has” elongates the to a change in internal pressure), and friction between
tongue into a more forward position (A) while the lower jaw the fluid and the walls of the vocal tract. Ignoring these
distends, causing there to be more space between the tongue energy losses will result in our model having acoustic
and the roof of the mouth. This tongue position results in a dampening, causing the lower formant frequencies to in-
different path for the air to flow through, and thus creates a crease in value3 and an increase in the bandwidth of all
different sound. In addition to tongue and jaw movements, the formant frequency spikes4 . Additionally, we assume the
position of the lips also differs for both phonemes. For “/u/”, walls of the vocal tract have an infinitely high acoustic
the lips round to create a smaller more circular opening (C). impedance (i.e., sound can only exit the speaker from
Alternatively, “/æ/” has the lips unrounded, leaving a larger, their mouth) which will result in our model missing trace
more elliptical opening. Just as the tongue and jaw position, amounts of low bass frequencies. Overall, these assump-
the shape of the lips also impacts the sound created. tions simplify the modeling processing while decreasing
One additional component that affects the sounds of a the accuracy of our technique by a marginal amount and
phoneme is the other phonemes that are adjacent to it. For are consistent with prior work [27].
example, take the words “ball” (phonetically spelled “/bOl/”‘) 3 This
effect is mainly caused by the elastic nature of the vocal tract walls.
and “thought” (phonetically spelled “/TOt/”). Both words con- 4 The
viscosity and friction losses predominately effect frequencies above
tain the phoneme “/O/,” however the “/O/” in “thought” is 4 kHz [27].
-
Negative flow u1
-
uo
A2 A3 A4 1
A5 A6
+ +
A1 1
uo (1+rk) u1
L11
L21
L4 L5 L6 -rk rk
L3
uo- (1-rk) u1-
5 kHz. It is also the reason why cellular codecs, such as those used in GSM where rG is the reflection coefficient at the glottis, r1 ...rN are
networks, filter out noise in higher frequencies [39]. the reflection coefficients for every consecutive tube pair in
Z
( Subtract
)
Transfer Function
( )
!0 , ..., !N
r0 + step, ..., rN
r0 step, ..., rN
Adjust r0
min < error+ , error >
Figure 6: High-level overview of how the vocal tract feature estimator works. A speaker’s audio sample (a single-window from
a bigram) has its frequency data extracted using an FFT, the output of which is used as the target for our transfer function to
reproduce. The transfer function is run twice over a range of frequencies w0 , ..., wN . The first application of the transfer function
uses the current reflection coefficients r0 , ..., rN with a step size offset added to a single coefficient. The second application
instead subtracts the step size offset from the same single coefficient. The estimated frequency response curve calculated for both
series is subtracted from the target curve. Whichever reflection coefficient results in a lower area under the resulting curve will be
selected for the next iteration. This process continues (applying the step size offset to all of the reflection coefficients) until the
area under the subtracted curves approach zero, indicating that we have found a reflection coefficient series that approximately
replicates the original speaker’s vocal tract.
the series, rAtm is the reflection coefficient at the mouth, L Rabiner et al.’s work [27]):
is the length of each tube, C is the speed of sound (34, 300 TC
cm/s), j is the imaginary constant, and ω is the frequency L= (6)
2
of the waveform in rad/s. V (ω) is the volumetric flow rate
at the lips during the pronunciation of a certain frequency, where T is the period between samples in our audio record-
which is directly related to acoustic pressure (i.e., amplitude ings. In our study, all of our audio samples had a sampling rate
of the voice at frequency ω). We separate the denominator of 16 kHz. This sampling rate was selected since it captures
of Equation 4 out separately into Equation 5 for increased the most important frequencies for speech comprehension
readability. and is also the most commonly found sampling rate for voice-
based systems [41]. By sampling at 16 kHz, our vocal tract
These equations together are a simplified representation of model will be made up of 15 distinct pipe resonators.
a system of 2N equations (N Equation 1’s and N Equation 2’s) Next, we can use our understanding of human anatomy to
that represents a series of N connected tube intersections fix the first reflection coefficient in the series (rG in Equation
(Figure 6). Since the volumetric flow rate through every tube 5). This reflection coefficient represents the fluid reflection
within this series must be equal, we can simplify the 2N that occurs at the speaker’s glottis. During large portions of
equations to Equation 4 and 5. speech (e.g., during vowels) the glottis is actively being en-
gaged. This means that the vocal folds are actively vibrating
We refer the reader to Rabiner et al.’s work for a full deriva- and thus preventing fluid flow in the reverse direction. With
tion of these equations [27]. this in mind, we can set rG to 1, symbolizing only fluid flow
It is important to note that this differential equation lacks in the forward direction. Finally, the last reflection coefficient
a closed-form solution and thus, we must specify several rAtm is representing the behavior of the flow at the opening of
boundary conditions before solving the equation. Specifically, the mouth. Here, once again, we can expect to see predomi-
we must fix the number of tubes used in the series (N) and the nately only positive flow. This is because, during speech, the
reflection coefficients at both the beginning (rG ) and end of vocal tract is raised to a higher than atmospheric pressure, pre-
the series (rAtm ). This helps to more closely bind our equation venting flow from moving from the atmosphere back into the
to the physical anatomy from which it is modeled. vocal tract. We can, therefore, set the last reflection coefficient
rAtm equal to 1.
We can determine the number of tubes necessary for our With these boundary conditions, we now have a solvable
model by taking the average human vocal tract length (approx- differential equation that describes the acoustic behavior of
imately 15.5 cm [40]) and dividing by the length of each tube. our concatenated tube model. Using this equation we are
This length, L, can be determined by the following equation now able to accurately estimate the amplitude of a certain
(derivation of this equation can be found in Section 3.4.1 of frequency ω during a bigram for a known speaker (that has
6.4.2 Optimized Detector Deepfake Audio We derived our own set of synthetic
TIMIT audio samples using the open-source Real-Time-
Finally, we construct an optimized detector that only computes
Voice-Cloning (RTVC) tool from Jemine [7, 49], the most
and analyzes bigram-features that have been shown to act as
widely used publicly available deepfake generation tool.
strong indicators (i.e., our ideal feature set). This detector will
RTCV is an implementation of Tacotron 2 by Liu et al.,
follow the same initial operation as our whole sample detector,
which uses Tacotron as the synthesizer and Wavenet as a
decorating the audio data with its corresponding timing and
vocoder [50]. For each of our 300 TIMIT speakers, we trained
phoneme data. However, unlike the whole sample detector,
an RTCV model on a concatenation of all 10 TIMIT audio
our optimized detector will only check the bigram-features
recordings (approximately 30 seconds). Each RTCV model
that best indicate whether the audio sample is organic or
was then used to create a deepfake version of every TIMIT
deepfake. More specifically, we extract every feature from the
sentence spoken by each speaker. In total, this creates 2,986
sample that exists in both itself and the ideal feature set. For
usable synthetic audio samples of our 300 original speakers.
every one of these features, we compare the previously found
The 14 missing audio samples were too noisy for Gentle to
threshold from the ideal feature set with the value found in the
process and were thus unable to be used in our experiments.
current sample. We count the number of times the values from
Additionally, we contacted several commercial companies
the test audio samples cross the threshold. If more bigram-
with Deepfake generation tools in an attempt to test our tech-
feature values cross the threshold than do not, we label the
nique against other systems. Most of these companies never
audio sample as a deepfake (i.e., majority voting).
returned our requests to use their products in our research. The
few companies that responded would only give us extremely
7 Datasets 7 External factors such as face coverings during audio recording do not
against as well as the process that was performed in generat- dataset [44]. These can be found in Appendix A.
limiting access to their product after purchase. Their restric- detector. In contrast, our technique does not require a large
tions would have limited us to at most 5 different speakers dataset to learn from since we leverage the knowledge of
compared to the 300 speakers present in the TIMIT Deep- human anatomy. As a result, our technique requires a signifi-
fakes we generated. We, therefore, took the largest available cantly smaller dataset to learn from while still being able to
of such datasets, published by Lyrebird [51], and evaluated it. generalize over a much larger evaluation set.
We note that the generation of these samples is black box and
represents a reasonable test against unknown models. 8 Evaluation
In this section, we discuss the performance of our deepfake
Feature Extraction and Evaluation Sets To evaluate and detection technique and explain the results.
test our technique, we subdivided both the organic and deep-
fake TIMIT samples into a feature extraction set (51 speakers) 8.1 Detector Performance
and an evaluation set (249 speakers). The feature extraction
set is used to determine the ideal bigram-feature pairs and We first need to find the ideal feature set using the process
their corresponding thresholds k using the ideal feature extrac- detailed in Section 6.3.1. The feature extraction dataset was
tor outlined in Section 6.3.1. Conversely, the evaluation set is used to find the set of ideal features that consisted of 865
used to evaluate the efficacy of our technique. Both datasets bigram-feature-threshold triples.
contain all of the organic and deepfake audio samples for their To evaluate the performance of our detector, we classi-
respective speakers. Our security model (Section 5) assumes fied all the audio samples in the evaluation dataset. To do
no knowledge of a speaker is known to the defender. As such, this, we concatenated all the sentences for each speaker to-
both sets were selected so that they did not share any speak- gether to form a single audio sample. We then ran each audio
ers. This demonstrates that our technique is extracting useful sample through our whole sample detection phase outlined
features that are inherent to deepfake audio as a class, rather in Section 6.3.2. Overall, we extract and compared 12,525
than features specific to the deepfake of an individual speaker. bigram-features pairs to the values found in our ideal feature
This captures a stronger threat model as we do not have any set. Finally, our detector was able to achieve a 99.9% preci-
information about the speaker who will be impersonated. sion, a 99.5% recall, and a false-positive rate (FPR) of 2.5%
The feature extraction set contains 1,020 audio files from using our ideal feature set.
51 speakers, which contain a total of 702 unique bigrams. Of
these, 510 audio files from 9 speakers are deepfake samples 8.2 Bigram Frequency Analysis
and 510 audio files are organic. The evaluation set consists
of 4,966 audio files from 249 speakers, which contain 835 We now explain why the detector performed so well by an-
unique bigrams. Of these, 2,476 audio files are deepfake sam- alyzing the bigram results. The 865 bigram-feature pairs of
ples and 2,490 audio files are organic. It is important to note the ideal feature came from 253 distinct bigrams that had 3.4
that our evaluation set is five times as large as our feature features on average within the set. These bigrams make up
extraction set. We used a smaller feature extraction set and a approximately 30.9% of the 820 unique bigrams present in
larger evaluation set to showcase the efficiency of our tech- the TIMIT dataset we tested. Since TIMIT is a phonetically
nique. Traditional ML models require large datasets, orders balanced dataset, it accurately represents the distribution of
of magnitude larger than what is used here, to learn from phonemes in spoken English. In Figure 8, we show the 30
and capture data intricacies that improve the model’s gener- most common bigrams in both the TIMIT dataset and our
alization [6]. Generating large datasets of DeepFakes can be ideal feature set. While most of the bigrams in the ideal fea-
difficult, inherently limiting the effectiveness of an ML-based ture set are not in the top 30 bigrams, the total ideal feature set
80
Cross-sectional Area (cm^2)
60
40
20
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Dimensions
a) b) c)
Figure 10: a) The cross-sectional area estimates output by the transfer function for bigram “/d – oU/.” pronounced “doh” b) The
approximate vocal tracts used to create each of the datasets. c) An anatomical approximation of a deepfaked model (bottom),
which no longer represents a regular human vocal tract (top) and instead is approximately the dimensions of a drinking straw.
This inconsistency is prevalent across more than 350 observed bigrams.
audio sample generated by Lyrebird by using an ideal feature less accurate when processing non-vowel phonemes. That
set that was sensitive to the Lyrebird model. We believe that being said, vowels make up 38% of all phonemes, meaning
this indicates that the RTVC and Lyrebird deepfake genera- most bigrams should contain at least one vowel phoneme.
tion models are failing to mimic human acoustics in different Therefore, our use of bigrams also helps to minimize the
ways. It, therefore, appears that the ideal features extracted number of processed necessary samples.
from one deepfake generation model will not necessarily ap-
ply to other models. However, the lack of overlapping ideal
features between models can be potentially circumvented by Preprocessing During the preprocessing stage of our
having a defender check all bigrams within an audio sample. pipeline, we use Gentle to automatically timestamp the audio
This would allow a defender to practically check all possible files according to their words and phonemes. Gentle requires
ideal feature sets simultaneously. However, the thoroughness sample transcriptions, which we generate using the Google
provided by checking all bigrams will result in a considerable Speech API. Thus the accuracy of the timestamps (and the fol-
increase in the processing time for the detector and would lowing stages of the pipeline) are directly tied to the accuracy
potentially require a different process for whole sample de- of Gentle and the Google Speech API. While some phonemes
tection than what was presented in Section 6.3.2. We leave are only a few milliseconds long, Gentle’s precision is to the
the further exploration of the concept of an all-bigram pro- nearest hundredth of a second. This forces Gentle to overesti-
cessing method to future work. To conclude, a defender who mate the timestamps for short phonemes, which introduces
is concerned about detecting previously unknown deepfake rounding errors. The use of bigrams helped to mitigate this
generation models will likely not be able to benefit from the problem, since using pairs gave us more appropriate target
performance increases provided by the creation of an ideal lengths for Gentle’s precision levels.
feature set. Furthermore, they will likely need to rely on a The noisiness of synthetically generated audio can also
different metric than majority voting when evaluating the set cause mistranscriptions in the Google Speech API. However,
of all bigrams within the sample. the mistranscriptions are usually phonetically similar to the
correct ones. As a result, Gentle’s timestamps will contain lit-
tle error. This limits any major impact that a mistranscription
9 Discussion could have on our results.
9.1 Limitations
Data Access There does not exist many large publicly avail-
Acoustic Model While our acoustic modeling can process able corpora of deepfake audio samples or generation tools.
all phonemes for a given speech sample, the pipe series are While we would have liked to test our technique against a
only anatomically correct for the vocal tract while the speaker larger variety of samples, this was not possible. Our dataset
is pronouncing a vowel. This means that our technique is is limited to the data and tools that are currently publicly
e) Audio Processing/
Metadata Generation
g) Average Error
h) Pre-Calculated f) Vocal Tract
across VT from
Organic VT Estimation Estimation
Organic
A ASVSpoof Dataset
We explored the potential use of the ASVSpoof2019 dataset to eval-
uate our deepfake detection technique. The ASVspoof2019 dataset
contains a collection of synthetically modified audio samples, none
of which are actual deepfakes. Instead, these audio samples are used
for speaker verification tasks, such as voice authentication. While
our algorithm can still detect these audio samples, they should not
be used for evaluating deepfake detection algorithms. This dataset
was not designed for this task.
We ran the full dataset against our approach, which required over
1,400 hours of processing time. However, we noticed that such tests
produced very high word error (WER) rates of 0.45. This means that
nearly half of all words were transcribed incorrectly. Upon manual
listening tests, we found these audio samples sounded very robotic,
thus resulting in poor transcriptions. Therefore, the lower quality
of the audio was the source of these failures, and therefore served
as efficient filters. Further investigation revealed that contrary to
popular belief, ASVSpoof2019 is not a deepfake audio dataset -
the maintainers of this challenge note that deepfake detection is
a separate challenge, and have identified it as such in their yet-to-
be-released 2021 dataset. Even though our preprocessing stage can
detect these audio samples as being abnormal due to the high WER,
they were never intended to be used for deepfake detection and
instead target the related problem of automatic speaker verification
(ASV) (e.g., authentication), hence the name of the challenge.
B Phrases
TIMIT phrase that was converted into our deepfake dataset.
1. Cattle which died from them winter storms were referred to as
the winter
2. The odor here was more powerful than that which surrounded
the town aborigines. (si1077)
3. No, they could kill him just as easy right now. (si1691)
4. Yet it exists and has an objective reality which can be experi-
enced and known. (si654)
5. I took her word for it, but is she really going with you? (sx395)