Paper6

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Introduction to Audio Deepfake Generation: Academic Insights

for Non-Experts
Jeong-Eun Choi Karla Schäfer Sascha Zmudzinski
[email protected] [email protected] [email protected]
Fraunhofer SIT | ATHENE Fraunhofer SIT | ATHENE Fraunhofer SIT | ATHENE
Darmstadt, Germany Darmstadt, Germany Darmstadt, Germany

ABSTRACT that impersonate the voice of a US president1 . This can, for example
With the advancement of artificial intelligence, the methods for gen- have the effect of discouraging people from voting.
erating audio deepfakes have improved, but the technology behind Audio deepfakes are a growing concern in different sectors, and
it has become more complex. Despite this, non-expert users are able interest in these technologies has increased due to recent advance-
to generate audio deepfakes due to the increased accessibility of ments. However, their complexity makes them difficult for non-
the latest technologies. These technologies can be used to support experts to understand, leading to uncertainties when interpreting
content creators, singers, and businesses such as the advertisement audio materials. Furthermore, in order to effectively combat disin-
or entertainment industries. However, they can also be misused to formation and use audio deepfake detectors appropriately, practi-
create disinformation, CEO fraud, and voice scams. Therefore, with tioners require a fundamental understanding of the technology and
the increasing demand for countermeasures against their misuse, awareness of the current state-of-the-art.
continuous interdisciplinary exchange is required. This work in- This paper aims to explain the technical advances in audio deep-
troduces recent techniques for generating audio deepfakes, with a fake generation for non-experts. It begins with a background sec-
focus on Text-to-Speech Synthesis and Voice Conversion for non- tion that provides a brief explanation of relevant terminologies
experts. It covers background knowledge, the latest trends and and fundamental concepts related to audio processing and audio
models, as well as open-source and closed-source software to ex- deepfakes. It then provides a comprehensive overview of the trends
plore both technological and practical aspects of audio deepfakes. in academia and state-of-the-art models for generating audio deep-
fakes by observing two main approaches: text-to-speech synthesis
CCS CONCEPTS (TTS) and voice conversion (VC), seeking to equip readers with
a deeper understanding of the underlying principles driving the
• Security and privacy → Spoofing attacks; • Computing
generation of audio deepfakes. Additionally, this paper explores
methodologies → Machine learning; • General and reference
open-source tools and closed-source software for audio deepfake
→ Surveys and overviews.
generation.
KEYWORDS
2 BACKGROUND INFORMATION
Audio Deepfakes, Attacks, Disinformation, Voice Conversion, Text-
to-Speech Synthesis Audio deepfakes are audio clips that have been modified or entirely
generated using deep learning. They are often used to replicate the
ACM Reference Format: voice of a specific individual which can be misused for imperson-
Jeong-Eun Choi, Karla Schäfer, and Sascha Zmudzinski. 2024. Introduction
ation. This paper exclusively focuses on audio deepfakes created
to Audio Deepfake Generation: Academic Insights for Non-Experts. In 3rd
ACM International Workshop on Multimedia AI against Disinformation (MAD
using deep learning technology. It is worth to note that similar
’24), June 10–14, 2024, Phuket, Thailand. ACM, New York, NY, USA, 10 pages. concepts exist, such as “synthetic speech” for entirely synthesized
https://doi.org/10.1145/3643491.3660286 speeches and “spoofed audio” for both modified and synthesized
speeches, with or without the use of deep learning.
1 INTRODUCTION
The quality of audio deepfakes is improving rapidly due to advance-
2.1 Advances in Audio Processing
ments in Artificial Intelligence (AI) and machine learning. This has With the advances in deep learning, particularly the rise of neural
led to the mass production of highly convincing speeches that are network architectures such as recurrent neural networks (RNNs)
indistinguishable from real recordings. Unfortunately, these manip- and convolutional neural networks (CNNs), significant strides have
ulated or synthesized speeches are increasingly being used with been made in the realm of audio generation. One of the break-
malicious purposes, such as for producing disinformation materials throughs in audio deepfake technology is the application of gen-
erative adversarial networks (GANs). GANs enable the training of
generative models through a competitive process between a gen-
erator and a discriminator, resulting in the creation of synthetic
This work is licensed under a Creative Commons
Attribution-NonCommercial-ShareAlike International 4.0 License. data. In the context of audio deepfakes, GANs have been used to
produce speech with remarkable similarity to the target speaker,
MAD ’24, June 10–14, 2024, Phuket, Thailand
© 2024 Copyright held by the owner/author(s).
1 BidenRobocall: https://edition.cnn.com/2024/02/07/politics/biden-robocall-texas-
ACM ISBN 979-8-4007-0552-6/24/06
https://doi.org/10.1145/3643491.3660286 strip-mall-invs/index.html

3
MAD ’24, June 10–14, 2024, Phuket, Thailand Jeong-Eun Choi, Karla Schäfer, and Sascha Zmudzinski

enabling the replication of their voice characteristics, intonation, Cooper and Yamagishi [16] observed that MOS tests are influ-
and mannerisms [67]. enced by the range-equalizing bias, which means that listeners tend
In addition to the various neural network architectures, there to use the entire range of scoring options available to them. In
have been significant advancements in the representation of audio other words, if two systems are close together in quality to observe
data. The challenge is to identify an efficient and concise represen- significant differences in MOS, it is necessary to keep the overall
tation that can be utilised for deep learning. Initially, the audio data range of quality small.
has been transformed into visual data, such as mel-spectrograms, Word Error Rate (WER) or Character Error Rate (CER) are OMs
which are then used as input for training a model, specifically a commonly used in automatic speech recognition (ASR). In TTS and
neural network. As information may be lost during this process, VC, the text is extracted from the synthesised speech by an ASR
there are still approaches and models that use the raw audio data system and compared with the original or source text. The scores
as input [46]. represent the intelligibility of the synthesised speech.

2.2 Vocoders 2.4 Datasets


Independent of whether VC or TTS is used to fake a person’s iden- Although datasets are important for developing VC or TTS models,
tity in audio, a vocoder is mostly used in the generation process. this paper does not extensively cover the discussion of datasets due
Hereby, the vocoder is used to convert a speech/feature represen- to the limited space. Instead, it primarily focuses on providing a
tation to its waveform. Vocoders can be divided into two differ- brief description of the CSTR VCTK Corpus2 as one of the most
ent categories: traditional non-AI models (e.g., Griffin-Lim [17], used datasets for training VC and TTS models. For a more detailed
WORLD [45]) or neural vocoders which can be further divided into discussion on datasets, it is advisable to refer to other overview
autoregressive models (e.g., WaveNet, WaveRNN [20]), GAN-based papers that explore corresponding datasets [6, 41, 66].
(e.g., MelGAN [30], parallel WaveGAN) and diffusion-based (e.g., The CSTR VCTK dataset is monolingual, containing 110 English
WaveGrad [10], DiffWave [29]) models. speakers talking in various accents. Each speaker was asked to read
WaveNet [48] is a well-known autoregressive end-to-end vocoder out about 400 sentences, selected from newspapers, the rainbow
that leverages from convolutions to generate waveform autoregres- passage and an elicitation paragraph from the speech accent archive.
sively. An important aspect of its architecture is that dilated convo- As a result, the dataset consists of 44,242 voice samples (44 hours of
lution is applied to reflect long dependencies between prior samples utterances) from 110 speakers, of whom 47 are male, 61 are female,
to predict the next one. With this technique, pixels are skipped to and 2 speakers did not specify their gender. On average, voice
cover a large area of input without increasing the number of pa- recordings of 24 minutes duration are provided for each speaker, in
rameters. However, WaveNet suffers from slow inference speed which each of them reads a different set of sentences [58]. Along
and as a result, alternative vocoders such as WaveGlow [55] and with the audio recordings, transcriptions of the spoken text were
parallel Wave-GAN [80] have been developed to achieve faster in- also made available.
ference. WaveGlow utilizes flow-based generative modeling, while
parallel Wave-GAN combines WaveNet as a generator in a non- 3 AUDIO DEEPFAKE GENERATION
autoregressive framework. Methods for generating audio deepfakes can be categorised into
Another vocoder frequently used in recent models is HiFi-GAN [26]. two types: Text-to-Speech (TTS) and Voice Conversion (VC). TTS
It has a multi-receptive field fusion generator that can process dif- uses a written text as input and generates speech in the target voice.
ferent patterns of various lengths in parallel and multiple discrim- VC, on the other hand, uses a recording of one speaker´s voice as
inators allowing to capture structures from different parts of an input and reproduces the content in the voice of another, the target
input audio in different periods. speaker. Both TTS and VC produce speech, but differ in their input
during the generation and training process. However, there are
2.3 Evaluation Metrics for Synthetic Speech technical overlaps between TTS and VC, such as the use of vocoder
The evaluation metrics for synthetic speech can be divided into for audio processing. This section will briefly present the general
objective (OM) and subjective metrics (SM). The objective evalua- frameworks found in VC and TTS.
tion metrics are inexpensive, but do not match human perception
well [72]. Therefore, there are approaches like RAMP (Retrieval- 3.1 Text-To-Speech Synthesis (TTS)
Augmented MOS Prediction) [75], which try to predict subjective TTS uses written text and audio recordings of a target speaker
metrics automatically such as the Mean Opinion Score (MOS). as input. The content of the text is then reproduced in the target
Mean Opinion Score (MOS) is an SM that captures signal quality speaker’s voice. In addition to naturalness, expressiveness is an
by showing subjects the synthetic recording. However, it is ques- important factor that distinguishes synthetic speech from human
tionable whether evaluators’ perceptions of naturalness (MOS) or speech. Many factors influence the expressiveness of a synthetic
speaker similarity (SMOS) are consistent or comparable. There are voice, including content, timbre, phonation, style, emotion, and
several studies that discuss different aspects of MOS. Camp et al. [7] others. Expressive TTS requires a mapping that matches voice
showed that most papers on speech synthesis do not mention details variants to a selection of text in terms of pitch, volume, time, and
of subjective evaluation like geographic and linguistic backgrounds speaker accent [39].
of evaluators or other details about the MOS test settings which
can impact the resulting MOS. 2 CSTR VCTK: https://datashare.ed.ac.uk/handle/10283/3443

4
Introduction to Audio Deepfake Generation: Academic Insights for Non-Experts MAD ’24, June 10–14, 2024, Phuket, Thailand

Figure 1: Overview of Text-to-Speech Processes (based on [67]).

In Figure 1 different synthesis processes of TTS models are pre- In the context of impersonation attacks, it is important to con-
sented. In the first step, the text analysis, the text is broken down sider studies on expressive TTS and adaptive TTS. Expressive TTS
into its linguistic features, such as characters and phonemes. These emphasizes the need to synthesize intelligible and natural speech
linguistic features are then converted into acoustic features, such with a large variety. Therefore, TTS should be able to take a single
as spectrograms, by an acoustic model, such as Tacotron 2 [64] or text and transform it into different variations of synthesized speech.
Deep voice [1] that has been trained on one or more target speaker. From this perspective, TTS is a one-to-many task that requires dis-
The acoustic features are then converted into a waveform by a entangling, controlling, and transferring variations of information.
vocoder. Fully end-to-end TTS model, such as VITS [24], can gen- While some approaches rely on explicit information labelled by
erate speech waveforms from characters or phoneme sequences humans, more recent approaches use implicit embeddings that can
directly [67]. contain various types of information or that can model such vari-
TTS model types can be distinguished by the amount of training ance information using techniques such as variational autoencoder.
data required from the target speaker. One example is multi-speaker Adaptive TTS emphasizes the capability to synthesize voices of
TTS, which can create the voices of several target speakers. Another different speakers. This task of changing the voice of synthesized
type is zero-shot multi-speaker TTS, where each target voice can speech is also called voice adaptation, voice cloning or custom
be imitated using only a few minutes, sometimes just a few seconds, voice [68]. Thus, adaptive TTS requires some kind of generalization
of audio recordings. where new speakers or styles can be adapted while it also requires
an efficient adaptation method. For the former, conditional mod-
3.1.1 Trends in TTS. In neural network-based TTS, models can be
elling [25] or increasing the amount of diversity of training data
divided into two categories: autoregressive models (AR) or non-
such as multilingual data [82] or data augmentation [15] is pro-
autoregressive models (NAR). The strength of autoregressive
posed to prevent overfitting and thus can easily adapt to an unseen
models is that they implicitly learn alignments between charac-
speaker. For the latter, the aim is to adapt a new style of voice with
ters/phonemes and mel-spectrograms through attention. However,
few data (low resource) using few-shot, zero-shot [74, 77] as well
attention led to word skipping and repeating or attention collapse,
as untranscribed data [25], thus speech data without transcript. In
which are caused by mismatched alignments. Therefore, attention
one-shot TSS, only one reference utterance of the target’s voice is
is now being replaced by duration prediction which aims to im-
given, and the target speaker is unseen during training. Zero-shot
prove the performance of AR. However, AR models face further
can be considered here as an intensified form of one-shot, where
problems such as slow training and inference speed, and bias and
only a few seconds of speech from a target speaker are given.
error propagation during autoregressive generation. To solve these
issues, NAR models have been proposed, which can handle error 3.1.2 Recent TTS models. This section lists and discusses some
propagation and training as well as inference speech better than of the well-known models and the latest models for TTS which
AR models. The current NAR models are either applying alignment also reveals the recent research trend in TTS. An overview of the
through attention or duration predictions, and thus further research models presented below can be found in the Table 1.
is required to find a good pre-training task for audio models. Tacotron 2 [65] is one of the earliest AR models with atten-
One limitation of NAR models is the over-smoothing effect. This tion and uses modified WaveNet as vocoder. This model is also
means that the predicted mel-spectrograms are averaged, resulting frequently applied within VCs. FastSpeech [60] uses a transformer
in a voice that sounds flat and monotone. To overcome this problem, architecture to generate mel-spectrograms and synthesize speech
Ren et al. [61] proposed that the over-smoothness is closely related from mel-spectrograms in parallel, thus NAR. It is trained for
to the gap between the complexity of the data distribution and the phoneme duration prediction and enabled faster inference for AR
capability of modelling methods. Moreover, they pointed out that models. FastSpeech2 [59] improves several weaknesses of the pre-
TTS is a multimodal task where the data points are dependent on vious model by removing the teacher-student distillation pipeline
each other. Since TTS is trying to map text to speech sequence, due to its complexity and by introducing more varying information
they suggested simplifying the data distribution to a conditional of speech as conditional inputs to improve the resulting speech
distribution, which is also used in autoregressive models, where quality. Glow-TTS [23] is another NAR model that removed the
the next mel-spectrogram is predicted provided the text and the external aligner and finds the alignment between text and the latent
current frame. representation of speech on its own. VITS [24] is an end-to-end

5
MAD ’24, June 10–14, 2024, Phuket, Thailand Jeong-Eun Choi, Karla Schäfer, and Sascha Zmudzinski

Table 1: Overview of TTS Models from Academia: all from the year 2023

Type Name (Vocoder) Structure Dataset MOS Model compared (MOS)


AR-TTs Adaptermix TransformerTTS + LibriTTS 3.33 (VCTK) 3 Finetuned TransformerTTS (VCTK: 3.45)
few-shot (WaveNet) [40] Adapters VCTK Single Adapter (VCTK: 2.82)
NAR-TTS VITS2 (HiFi- VITS LJSpeech 4.47 (LJSpeech) VITS (LJSpeech: 4.38, VCTK: 3.79)
GAN) [28]4 VCTK 3.99 (VCTK)
NAR-TTS Eden-TTS (HiFi-GAN) new architecture LJSpeech 4.32 (LJSpeech) Tacotron2 (LJSpeech: 4.47)
[38]5 Glow-TTS (LJSpeech: 4.09)
EF-TTS (LJSpeech: 4.02)
NAR-TTS VITS + XPhoneBERT modified VITS LJSpeech 4.14 (LJSpeech) VITS (LJSpeech: 4.00, V-TTS: 3.74)
(HiFi-GAN)[69] Viet TTS dataset (V-TTS) 3.89 (V-TTS)
diffusion UnitSpeech (HiFi- modified Grad- LibriTTS 4.13 (LJSpeech) Guided-TTS27 LJSpeech: 4.10)
model GAN) [22]6 TTS [53] + Hu- VoxCeleb2 4.26 (VoxCeleb2) YourTT (LJSpeech: 3.57)
few-shot BERT [19] DiffV (LJSpeech: 3.97)
YourTTS (VoxCeleb2: 3.88)

NAR TTS model that uses a variational autoencoder to connect for duration prediction. A small transformer block is also added
the acoustic model and the vocoder. VITS is frequently used as a to capture not only adjacent but also long-term dependencies. A
baseline for evaluation, and other VC models adopted the VITS speaker conditioned text encoder allows to model multiple speakers,
framework within their models (e.g. FreeVC [31]). YourTTS [8] which improved speaker similarity in a multi-speaker settings.
is a model that builds on VITS with the capability of zero-shot Eden-TTS [38] represents an improvement over EfficientTTS,
multi-speaker TTS and as well as zero-shot VC. and it jointly allows to learn duration prediction, text-speech align-
Recent studies on TTS focus on expressivity and adaptivity of ment, and speech generation within a single architecture. The model
TTS models. For example, some papers on expressive TTS aim to in- utilizes a simple non-autoregressive design, which facilitates faster
corporate appropriate emotions and natural laughter into dialogue training and faster inference speed when compared to EfficientTTS.
systems [43, 63, 79]. Whereas for adaptive TTS, there is a strong UnitSpeech [22] proposes a method of finetuning a diffusion-
focus on efficiently adapting to low-resource data of various ac- based TTS with a small amount (from 7 to 32 seconds) of untran-
cents or languages [3, 14], including zero-shot [52, 70] approaches. scribed speech. This is achieved by training a unit encoder, which
Additionally, there is a growing interest in training TTS models enhances speaker adaptation without the need to retrain the entire
with restricted data, such as synthesizing speech by using data model. This approach is not limited to TTS but can also be applied
of a different language than the target language [81] or by using to VC settings.
untranscribed data, thus only speech without transcript [22]. This
has led to the exploration of multilingual models in the field of
adaptive TTS.
Another emerging topic in TTS is the controllability and multi-
modality of speech synthesis. For example, Liu et al. [34] proposes 3.2 Voice Conversion (VC)
using text to control style transfer in TTS or Luong and Yamag- The human voice consists of two components, the speaker de-
ishi [37] uses labels to control human vocalization generation. As pendent and independent component. VC attempts to separate
for multimodality, there are studies that aim to edit or reconstruct and recombine these components in such a way that the speaker-
missing speech in video by using visual information extracted from independent component is adopted and only the speaker-dependent
video as guidance [13, 42]. component is converted [66, 71]. As a result, the quality of con-
To enhance speaker adaption in TTS or other downstream tasks verted speech relies on (1) the disentanglement ability of VC model,
like speaker verification or recognition, several papers proposed to and (2) the reconstruction ability of VC model [31]. The human
apply adapters for improving the efficiency of learning new speak- voice has three attributes, language content, spectral pattern, and
ers, such as Adaptermix [40]. By using adapters, the entire model prosody. The spectral and prosodic features indicate the characteris-
does not need to be finetuned and only the adapter parameters tics of the speaker such as the timbre, intonation (tones of syllables
require adjustment. The Adaptermix uses a mixture of adapters to and speaker accent), pitch and the length of sound [78]. A source
capture multiple information about the speaker that can be used for speaker’s voice is converted to a target’s voice by modifying its
a large number of speakers instead of training an adapter for each timbre but retaining the spoken content and prosodic features[76].
speaker as proposed by [44]. Moreover, Adaptermix requires only In Figure 2 the typical flow of a voice conversion process is dis-
one minute of data for a new speaker and finetunes the adapter played. This process can be divided into (1) analysis and feature
parameters to reach comparable MOS scores to a finetuned model. extraction, (2) the generation or mapping of the source and target
The VITS2 [28] improved VITS by applying a stochastic duration speakers features through a conversion model, and (3) the recon-
predictor trained through adversarial learning to increase efficiency struction. During voice conversion, linguistic content and speaker
identity features are extracted from the input samples (1), these
31 minute of target speech features are then used to generate a combined feature representa-
4 VITS 2 Demo: https://vits-2.github.io/demo/
5 Eden-TTS tion, such as a spectrogram (2), which is finally converted into a
Demo:https://edenynm.github.io/edentts-demo/
6 UnitSpeech Demo: https://unitspeech.github.io/ waveform using a vocoder (3). Vocoders such as Hifi-GAN [27] are
7 zero-shot typically used for analysis and reconstruction.

6
Introduction to Audio Deepfake Generation: Academic Insights for Non-Experts MAD ’24, June 10–14, 2024, Phuket, Thailand

Figure 2: Overview of the Voice Conversion Process [73]

In the second part of the pipeline, during generation, the modi- Another way to differentiate between VC models is their struc-
fication of voice features from the source to the target speaker is ture. Inspired by image style transfer in computer vision, varia-
performed. tional autoencoders (VAEs) and generative adversarial net-
Depending on the amount of data available about the source works (GANs) gained popularity in VC. Autoencoders are special
and target speaker, the VC processes can be divided into different cases of encoder-decoder models in which the input and output
categories. If just one set of data about the specific source and target are the same. In VC applications, the VAE is a neural network that
speaker is passed to the model, only this source speaker can be separates speech into two parts: speaker identity and language
mapped to the target speaker, called one-to-one VC. Another type content. It then uses this information to perform the conversion.
of VC deal with many-to-many conversion tasks, also called multi- The VAE concatenates the target speaker’s identity embedding and
speaker conversion. In this case, conversions are performed only on the source speaker’s content embedding to deliver the desired sen-
speakers that are seen in the training corpus. Another form is the tence [73]. The converted speech quality depends on how much
any-to-many VC, which is the conversion of any speaker’s source linguistic information can be retrieved from the latent space [32].
speech signal to a target speaker that is present in the training set. Autoencoders need to enforce a proper information bottleneck to
Any-to-one VC allows each source speaker to be converted to a reduce speaker leakage from the encoder, often suffering from low
single target speaker.Since training a model for a specific voice is intelligibility at the cost of acceptable target speaker similarity
very time and computational intensive and requires a lot of data [56, 58]. Autoencoder-based models do not need text or language-
from the target speaker, many practical applications choose to use specific information at inference time, which makes them valuable
models that have already been pre-trained on different voices and for language-independent VC models [56].
are also capable of generating new, previously unseen voices in one A GAN consists of a generator and a discriminator that decides
pass. For these models, only one reference utterance of the target’s whether a piece of audio is genuine or fake, thus synthesized by
and source voice is required, and the target speaker is absent in the generator. Hereby, the discriminator teaches the decoder to
the training data. In the literature, such models are referred to as generate speech that sounds like the target speaker [32]. GANs
one-shot, zero-shot or any-to-any VC models [54]. come with a justification that the generated data would match
the distribution of the true data. However, there is no guarantee
3.2.1 Trends in VC. When deep learning models are used for VC, that the discriminator learns meaningful features. GAN-based VC
the steps, namely (1) analysis and feature extraction, (2) generation, models therefore often suffer from problems such as dissimilarity
and (3) reconstruction; are partially performed in a single step. In between converted and target speech or distortions in the voice of
real-world scenarios, the availability of hours-long audio recordings the generated speech. Moreover, GANs are very difficult to train,
are scarce. Therefore, the following discussion focuses on many- and they may not converge [58]. As opposed to GANs, VAEs are
to-many or any-to-any VC models that take advantage of deep easier to train and require less training data. They perform self-
learning. Hence, these models require only a few recordings of reconstruction and maximize a variational lower bound of the
the target speaker for successful conversion. Many-to-many VC output probability. However, VAEs suffer from over-smoothing.
models learn speaker-specific representations based on examples GAN-based methods address this problem by using a discrimina-
of different speakers. Sometimes, depending on the model, these tor that amplifies this artefact in the loss function [57]. Another
approaches can also be extended to an any-to-any application. point is the complexity of human speech. VAE-based algorithms
A major problem in VC is speech representation disentanglement often require careful adjustment of the bottleneck. With a carefully
(SRD). The way a VC system disentangles content information, can adjusted bottleneck, redundant information is transferred to the
be divided into a text-based and text-free VC approach. For text- general signal in a controlled manner so that it can prevent the
based approaches, an automatic speech recognition (ASR) model over-smoothing effect. On the other hand, GANs themselves cannot
can be used to extract a content representation. Another method always build a sufficiently generalized distribution that allows the
is the use of shared linguistic knowledge from a TTS model [31]. generation of samples beyond the training data range. For example,
But these require a huge amount of annotated data for training StarGANv2-VC [32] does not include any-to-any conversion and
the ASR or TTS. Therefore, text-free methods were introduced. FreeVC [31] has a poor performance of similarity MOS (SMOS)
Text-free methods learn to extract content information without the for unseen data [73]. The developers of FreeVC and HiFi-VC [21]
guidance of text annotation. Common text-free VC approaches use implemented a combination of encoder-decoder and GAN models,
for example an information bottleneck, vector quantization and obtaining excellent results in many-to-many conversion but worse
instance normalization for SRD [31]. However, their performance results in SMOS for any-to-any conversion [73]. Diffusion-based
is, in general, lower in comparison to text-based approaches.

7
MAD ’24, June 10–14, 2024, Phuket, Thailand Jeong-Eun Choi, Karla Schäfer, and Sascha Zmudzinski

Table 2: Overview of VC Models from Academia. Abbreviations: M-M: many-to-many; A-A: any-to-any; A-M: any-to-many

Type Name (Vocoder) Year Structure Training data MOS (nat/sim) Test data
M-M StarGANv2-VC [32]8 (Parallel WaveGAN) 2021 GAN, encoder-decoder VCTK, JVS, ESD (ENG) 4.02/3.86 split of training data
M-M DYGAN-VC [9]9 (Parallel WaveGAN) 2022 GAN, SSL VCC2020 (ENG) 3.83/3.92 split of training data
A-A DiffVC [54]10 (HiFi-GAN) 2022 autoencoder, Diff VCTK (ENG); LibriTTS (ENG) 4.02/3.39 split of training data
A-A HiFi-VC [21]11 (HiFi GAN) 2022 GAN, encoder-decoder VCTK (ENG) 4.09/3.34 split of training data
A-A FreeVC [31]12 (HiFi-GAN) 2023 CVAE with GAN training, SSL VCTK (ENG) 4.06/2.83 LibriTTS
A-A TriAAN-VC [51]13 (Parallel WaveGAN) 2023 encoder-decoder VCTK (ENG) 3.45/4.16 split of training data
A-A kNN-VC [2]14 (HiFi-GAN) 2023 encoder-converter-vocoder, SSL LibriSpeech (only vocoder) 4.03/2.91 LibriSpeech test-clean
A-M QuickVC [18]15 (Hifi-GAN) 2023 GAN, encoder-decoder VCTK (ENG) 4.28/3.58 LibriSpeech, LJ Speech

models such as DiffVC [54] define another type of VC model, but as a component for extracting the latent phonetic structure of the
are less common in the literature. utterance from the source speaker. FragmentVC is not considered
further here, as the audio quality was comparatively poor (MOS sim-
3.2.2 Recent VC models. In the following, different VC models
ilarity: 3.32; naturalness 3.26 [33]) and no information was given on
from academia are presented. The models were selected mainly for
the real-time capability. WavLM, the newest SSL model of the three,
their ability to work with little data from the target speaker focusing
was used in the VC model kNN-VC [2], which is also presented in
on many-to-many or any-to-any conversions, high-quality audio
the following.
output using MOS/SMOS as the defining metric, and/or real-time
kNN-VC[2], performing any-to-any conversion, is another VC
capability. The Table 2 gives an overview of the models.
model using a SSL model for feature representation. kNN-VC stands
StarGANv2-VC [32], is a GAN-based many-to-many VC model
for k-nearest neighbours voice conversion which is a text-free SRD
trained on 20 speakers from the VCTK dataset, 10 speakers of the
and consists of an encoder-converter-vocoder structure. It extracts
JVS dataset and 10 English speakers from the emotional speech
self-supervised representations of the source and reference speech
dataset (ESD). The F0 model was trained on the LibriSpeech dataset.
using WavLM [11]. To convert the source utterance to the target
From the source speaker, the fundamental frequency is extracted
speaker, each frame of the source representation is replaced with
from the input mel-spectrogram by an F0 network. A style en-
its nearest neighbour in the reference, whereby the average of its
coder extracts the style code from the target speaker. Then, the
k-nearest neighbours in the matching set is calculated. Since certain
encoded input of the source speaker is combined with the result
self-supervised representations capture phonetic similarity, the idea
of the F0 network, which is then passed to the decoder together
is that the matched target frames would have the same content as
with the style code extracted from the style encoder. The resulting
the source. Finally, a pretrained vocoder synthesizes audio from
mel-spectrogram is passed to the discriminator. The discriminator
the converted representation. As vocoder HiFi-GAN is used and
consists of two classifiers, one distinguishing between real/fake
adapted to take self-supervised features as input. For this, HiFi-GAN
samples and the other classifying the domain. Each speaker is con-
is trained on the LibriSpeech train-clean-100 dataset [50], which
sidered as a separate domain. According to the paper, the model
consists of 40 English speakers. A MOS of 4.03 for naturalness
generalizes well on any-to-many and cross-language conversion
and a score similarity MOS (SMOS) of 2.91 was achieved. A major
tasks, even though it was trained only on monolingual speech data.
benefit of kNN-VC is that it does not require an explicit speaker
Trained with diverse speech styles, the model can convert plain
embedding model. Therefore, also conversions to unseen languages
reading voice into an emotive acting voice.
achieve good results. In their cross-lingual demo, they converted
DYGAN-VC [9] uses vector quantized embeddings obtained
German source speech to a Japanese target speaker. The conversion
from a speech self-supervised learning (SSL) model and performs
is intelligible despite the system having only seen English during its
many-to-many conversion. SSL has recently achieved success in
design and training. Baas et al. [2] also stated, that with as little as
tasks such as speech recognition, speaker verification and speech
five seconds of target speaker audio, they can still retain moderate
conversion, showing the performance of SSL features over tradi-
intelligibility and speaker similarity.
tional acoustic features such as mel-spectrograms [31]. For encod-
FreeVC [31] adopted the end-to-end framework of the TTS
ing the speech to features, the VQWav2vec [4] model was used
method VITS [24], a conditional VAE augmented with GAN train-
in DYGAN-VC. VQWav2vec aims to learn unsupervised speech
ing, for high-quality waveform reconstruction [73], but learns to
representations that can be used for finetuning models for multi-
disentangle content information without the need of text anno-
ple downstream tasks. Besides VQWav2vec other SSL models like
tation. Performing any-to-any conversion, FreeVC uses the pre-
Wav2Vec 2.0 [5] and WavLM [11] exist, which are also used in other
trained WavLM [11] model to extract linguistic features from the
VC models. Wav2Vec 2.0 was used for example in FragmentVC [33],
waveform and disentangles content information by imposing an
8 StarGan2 Demo: https://starganv2-vc.github.io/ information bottleneck to WavLM features, to extract the content
9 Dygan-VC Demo: https://mingjiechen.github.io/dygan-vc/ information and removing speaker information (text-free SRD). Fur-
10 DiffVC Demo: https://diffvc-fast-ml-solver.github.io/
11 HIFI-VC
thermore, spectrogram-resize-based (SR) data augmentation was
Demo: https://paint-kitten-d96.notion.site/HiFi-VC-demo-samples-
2fbe30b894a64f7fa8bccb96f8d09540 proposed, which distorts speaker information without changing
12 FreeVC Demo: https://olawod.github.io/FreeVC-demo/ content information, to strengthen the disentanglement ability of
13 TriAAN-VC Demo: https://winddori2002.github.io/vc-demo.github.io/
14 kNN-VC Demo: https://bshall.github.io/knn-vc/
the model. For distortion, the source speakers mel-spectrogram is
15 QuickVC Demo: https://quickvc.github.io/quickvc-demo/ resized vertically and then padded or cut to the original shape. A

8
Introduction to Audio Deepfake Generation: Academic Insights for Non-Experts MAD ’24, June 10–14, 2024, Phuket, Thailand

pre-trained speaker encoder (speaker verification model) adopted In any-to-any conversion an MOS quality for male-to-male con-
from [36] was used for the target speakers encoding, and compared version of 4.06 (similarity; 2.7) was achieved, for male to female
to a non-pre-trained speaker encoder (LSTM based) jointly trained conversion 4.09 (similarity: 3.34). With this, HiFi-VC outperformed
with the rest of the model from scratch. During training, a discrimi- NVC-Net [47], from which HiFi-VC was inspired, and AutoVC [58].
nator is used to classify the created speech recordings as real or fake. DiffVC [54] is a diffusion-based model with autoencoder struc-
HiFi-GAN was used as the vocoder. FreeVC creates slightly better ture, performing any-to-any conversion. A forward diffusion gradu-
results than kNN-VC [2]. Without the pre-trained speaker encoder ally adding Gaussian noise to data can be regarded as encoder while
still a MOS (unseen-to-unseen) similarity of 2.78 and naturalness a reverse diffusion trying to remove this noise acts as a decoder.
of 4.02 was achieved. DiffVC is trained to minimize the distance between the trajectories
QuickVC [18] is another VC model based on VITS, but per- of forward and reverse diffusion processes, thus, speaking from
forms an any-to-many conversion. Unlike the original VITS model, the perspective of autoencoders, it minimizes reconstruction error.
QuickVC uses the inverse short-time Fourier transform (iSTFT) as As vocoder, HiFi-GAN is used. Based on its MOS, DiffVC outper-
part of the decoder to speed up the inference, and HuBERT-Soft [19] formed AGAIN-VC [12] (naturalness: 1.87/similarity: 1.75) and Frag-
as part of the prior encoder to extract content information features, mentVC [33] (naturalness: 1.91/similarity: 1.93). The DiffVC model
eliminating the need for text transcription. Speaker embeddings are trained on VCTK data with 30 reverse diffusion iterations, achieved
extracted by a speaker encoder which is trained from scratch with a MOS in naturalness of 3.44 and similarity (SMOS) of 2.71. DiffVC
the rest of the model, enabling multi-speaker VC. In addition to the trained with the LibriTTS dataset (again 30 steps) even reached a
prior encoder and the speaker encoder, QuickVC also consists of a MOS of naturalness 4.02 and similarity of 3.39. According to their
posterior encoder, a decoder, and a discriminator, with the architec- evaluative results, taking 6 reverse diffusion iterations resulted in
ture of the posterior encoder and discriminator coming from VITS. an RTF of around 0.1 on GPU; the one taking 30 steps had an RTF
As with FreeVC, QuickVC performs spectrogram-resized-based (SR) around 0.5.
data augmentation on the speech in the training dataset so that the TriAAN-VC [51] is an encoder-decoder based model that achieves
content encoder learns to better extract the content information. any-to-any conversion while minimizing the loss of the source con-
QuickVC achieved a SMOS score of 3.58 and a naturalness of 4.28 tent with siamese loss. Additionally, the reconstruction loss is cal-
with SR-based data augmentation, without data augmentation, a culated via L1 loss between the ground truth mel-spectrogram and
lower result was achieved with SMOS of 3.21 and naturalness of predicted mel-spectrogram. Siamese loss is calculated by the loss
4.03. However, according to the authors, in the case of zero-shot between feature extraction from raw audio and prediction by the
VC, the speaker similarity of the created voice recordings decreased model with input features augmented by time masking. With the
significantly. additional siamese loss, the robustness and consistency of the model
HiFi-VC [21] performs any-to-any conversion and is based can be improved. In particular, since time masking removes content
on the popular HiFi-GAN vocoder, but consists of a new condi- information during training, the loss with the siamese branch makes
tional GAN architecture capable of directly predicting a waveform the model robust for maintaining content information. TriAAN-VC
from intermediate encoded features, avoiding the intermediate uses Contrastive Predictive Coding (CPC) [49] features extracted
mel-spectrogram prediction and a separate vocoder. For this, the through a pre-trained model as input for the model. Furthermore,
HiFi-GAN vocoder was adapted for a general decoding task, by com- to represent the pitch information of the source speaker, the log
bining ASR-based content encoding with GAN generation. HiFi-VC frequency (f0) is extracted in the feature extraction step (text-free
uses ASR features (text-based SRD), pitch tracking, and a state- SRD). The two encoders for extracting content and speaker in-
of-the-art waveform prediction model. For the target speaker, a formation are connected with the decoder via a bottleneck layer.
speaker encoder using a similar architecture as in NVC-Net [47] Riviere et al. [62] showed that a slight modification of the CPC
creates speaker embeddings. The result of the speaker encoder, to- pre-training extracts features that transfer well to other languages,
gether with the results of the linguistic encoder and F0 encoder of showing the potential of unsupervised methods for languages with
the source speech are given to the decoder. The linguistic encoder few linguistic resources and the possible potential of TriAAN-VC in
extracts speaker-agnostic content information from source speaker a cross-lingual setting. TriAAN-VC achieved a MOS in naturalness
using the Conformer ASR model pretrained by NVIDIA16 . As ASR of 3.45 and similarity (SMOS) of 4.16.
is trained to extract linguistic information, it is not very accurate
at capturing prosody. To overcome this limitation an F0 encoder 3.3 Open-Source Tools for TTS and VC
capturing the fundamental frequency, based on the model by BNE-
Besides official research in academia, open-source code can be
Seq2seqMoL [36], is used. In HiFi-VC, the decoder and vocoder are
found online that deals with the creation of audio manipulations.
combined into a single module, the conditional HiFi-GAN, with
The voice conversion systems So-Vits-SVC and RVC are heavily dis-
conditions obtained from the speaker encoder. The ASR bottleneck
cussed on the Discord Server AIHub17 . AIHub has 317k members in
and F0 features are directly serving as GAN input. During training,
March 2024 and is one of many online presences, where people are
the decoder is connected to the discriminators, similar to the struc-
meeting and discussing novel methods for creating audio manipu-
ture of StarGANv2-VC, but with only one type of discriminator
lations. Originally, the server started with So-Vits-SVC18 (SoftVC
and without the intermediate representation as mel-spectrogram.

16 Conformer: 17 AIHub Discord: https://discord.com/invite/aihub


https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_
conformer_ctc_large_ls 18 So-vits-svc Github: https://github.com/svc-develop-team/so-vits-svc

9
MAD ’24, June 10–14, 2024, Phuket, Thailand Jeong-Eun Choi, Karla Schäfer, and Sascha Zmudzinski

VITS Singing Voice Conversion), which performs, according to AI- software TTS and VC can be performed. Additionally, at the end
Hub, consistently good as long as one has a high-quality dataset. of April 2023, the company released a multilingual model, support-
So-Vits-SVC recommends the content encoder of SoftVC [71] to ing seven new languages: French, German, Hindi, Italian, Polish,
extract source audio speech features. These feature vectors are fed Portuguese, and Spanish. The Tagesschau26 , the Spiegel27 and also
into VITS. As vocoder NSF HiFiGAN, from DiffSinger [35], is used. some Reddit groups, like AIVoiceMemes28 , have already addressed
Other encoders, like ContentVec or WavLM are also available. The ElevenLabs and how easily voices can be cloned with it.
target speaker must be seen during training and in inference, the ResembleAI is another voice generator29 , with which one can
speaker ID must be given for the conversion. SoftVC is an any- add emotions to the own voice without new data, transform the
to-one VC model created from the same authors as kNN-VC and voice to a target voice in real-time and transform their own voice
uses, as kNN-VC, SSL for speaker representation. With SoftVC, soft in 60 different languages. Also offered by ResembleAI is an AI-
speech units were introduced, extracted by the content encoder based watermarking solution that can be used to mark whether a
also used in So-Vits-SVC. Discrete representations effectively re- particular audio clip was generated by ResembleAI30 . Other closed-
move speaker information but discard linguistic content. With the source software for audio deepfake generation are Uberduck31
use of soft speech units, predicting a distribution over the discrete performing VC and TTS, and VoiceAI32 focusing on streamers,
units, they try to retain more content information, and improve the gamers, and meeting participants.
intelligibility and naturalness of converted speech [71]. CPC and Companies such as Amazon (Amazon AWS Polly33 ), Baidu (Baidu
HuBERT [19], which is also used by QuickVC, were used to create TTS34 ) and Google (Google Cloud TTS35 ) have also developed mod-
the soft speech units. els for TTS, focusing on application in companies for customer
According to AIHub, RVC19 is a newer AI approach which usu- service. There are also other, more recent closed-sourced TTS and
ally produces audio recordings with the same or higher quality than VC applications, such as play.ht, murf.ai, or listnr.tec. It is very
So-Vits-SVC, and trains to maximum quality in only a few hours. likely that more of such closed-sourced applications will appear in
Retrieval-based Voice Conversion (RVC) is a technique that uses a the near future. Please note that these tools were selected based on
deep neural network to transform the voice of a speaker into an- their frequency of mention in the media. We have no interests or
other voice. The model is based on the VITS model, an end-to-end advantages in presenting them within the context of this paper.
text-to-speech system. Around 10 minutes to 1 hour of high-quality
clear voice recordings (no background noise or instrumental parts) 4 CONCLUSION
are needed for transformation. One can train an own voice model This paper presents the technical and practical aspects of audio
or use a pretrained one. Pre-trained models can be found on rvc- deepfakes for non-experts. It discusses the basics of audio process-
models.com (e.g. Donald Trump, Joe Biden, Vladimir Putin), AIHub ing, TTS, and VC, and introduces various TTS and VC models from
Discord and Huggingface. research. The paper also presents newly published TTS models
ESPNet20 is an end-to-end speech processing toolkit, includ- from 2023, which can be used as a starting point not only for fur-
ing text-to-speech implementations. Similarly, TortoiseTTS21 is a ther research to improve the adaptivity, expressivity, and efficiency
text-to-speech program with multi-voice capabilities. Another TTS of TTS models but also for interdisciplinary discussions.
library is coquiTTS 22 , containing pre-trained models in 1100 lan- The paper explores the possibilities of audio manipulation using
guages and tools for training new models and fine-tuning existing VC and selects methods based on their real-time capability, out-
models, including Tacotron, Tacotron2 and Glow-TTS and vocoder put quality using MOS, and the smallest possible amount of data
like MelGAN, Parallel-WaveGAN and WaveRNN.Nvidia NeMo required from the target speaker. Additionally, open-source and
is a toolkit that supports several Automatic Speech Recognition closed-source tools are presented, as the accessibility of these tools
(ASR) 23 models and numerous TTS models, including well-known have a great impact on generating audio deepfakes with malicious
TTS models and vocoders such as FastSpeech2, GlowTTS, and HiFi- intentions.
GAN. Sprocket24 is a software for traditional VC systems based on
a Gaussian mixture model (GMM) and a vocoder-free VC system ACKNOWLEDGMENTS
based on a differential GMM (DIFFGMM) using a parallel dataset This research work was supported by the National Research Cen-
of the source and target speakers. ter for Applied Cybersecurity ATHENE as well as within the joint
project SecMedID with the Federal Office for Information Secu-
3.4 Closed-Source Software for TTS and VC rity (BSI).
In addition, commercial software exists that can be used for the gen- 26 Cloning Voices: https://www.tagesschau.de/investigativ/swr/ki-kuenstliche-
eration of audio deepfakes. One of the possibly most well known intelligenz-voice-cloning-100.html
commercial software up-to-date for audio deepfake generation 27 Emma Watson: https://www.spiegel.de/netzwelt/web/elevenlabs-stimmengenerator-

comes from the US start-up company ElevenLabs25 . With this online-trolle-lassen-emma-watson-mein-kampf-vorlesen-a-780f1457-5a1c-40e0-


b909-57835f89125d
28 AIVoiceMemes Tutorial: https://www.reddit.com/r/AIVoiceMemes/wiki/tutorial/
19 RVC Github: https://github.com/Mangio621/Mangio-RVC-Fork 29 ResembleAI: https://www.resemble.ai/
20 ESPNet Github: https://github.com/espnet/espnet 30 PerTH-ResembleAIs Watermarker: https://www.resemble.ai/watermarker/
21 TortoiseGitHub: https://github.com/neonbjb/tortoise-tts 31 Uberduck: https://uberduck.ai/
22 coquiTTS: https://github.com/coqui-ai/ttsWebsite 32 VoiceAI: https://voice.ai/
23 Nemo-ASR: https://catalog.ngc.nvidia.com/orgs/nvidia/collections/nemo_asr 33 Amazon AWS Polly
24 sprocket GitHub: https://github.com/k2kobayashi/sprocket 34 Baidu TTS: https://intl.cloud.baidu.com/product/speech.html
25 Elevenlabs: https://elevenlabs.io/ 35 Google Cloud TTS: https://cloud.google.com/text-to-speech#section-11

10
Introduction to Audio Deepfake Generation: Academic Insights for Non-Experts MAD ’24, June 10–14, 2024, Phuket, Thailand

REFERENCES [24] Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoen-
[1] Sercan Ö Arık, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew coder with adversarial learning for end-to-end text-to-speech. In International
Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, Conference on Machine Learning. PMLR, 5530–5540.
et al. 2017. Deep voice: Real-time neural text-to-speech. In International conference [25] Sungwon Kim, Heeseung Kim, and Sungroh Yoon. 2022. Guided-TTS 2: A Diffu-
on machine learning. PMLR, 195–204. sion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data.
[2] Matthew Baas, Benjamin van Niekerk, and Herman Kamper. 2023. Voice Conver- http://arxiv.org/abs/2205.15370 arXiv:2205.15370 [cs, eess].
sion With Just Nearest Neighbors. In Interspeech. [26] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative
[3] Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gu- adversarial networks for efficient and high fidelity speech synthesis. Advances in
rurani, and Bryan Catanzaro. 2023. RAD-MMM: Multilingual Multiaccented Neural Information Processing Systems 33 (2020), 17022–17033.
Multispeaker Text To Speech. In INTERSPEECH 2023. ISCA, 626–630. https: [27] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Genera-
//doi.org/10.21437/Interspeech.2023-2330 tive Adversarial Networks for Efficient and High Fidelity Speech Synthesis.
[4] Alexei Baevski, Steffen Schneider, and Michael Auli. 2019. vq-wav2vec: In Advances in Neural Information Processing Systems, H. Larochelle, M. Ran-
Self-supervised learning of discrete speech representations. arXiv preprint zato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates,
arXiv:1910.05453 (2019). Inc., 17022–17033. https://proceedings.neurips.cc/paper_files/paper/2020/file/
[5] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. c5d736809766d46260d816d8dbc9eb44-Paper.pdf
wav2vec 2.0: A framework for self-supervised learning of speech representations. [28] Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong, and
Advances in neural information processing systems 33 (2020), 12449–12460. Sangjin Kim. 2023. VITS2: Improving Quality and Efficiency of Single-Stage Text-
[6] Anders R. Bargum, Stefania Serafin, and Cumhur Erkut. 2023. Reimagin- to-Speech with Adversarial Learning and Architecture Design. In INTERSPEECH
ing Speech: A Scoping Review of Deep Learning-Powered Voice Conversion. 2023. ISCA, 4374–4378. https://doi.org/10.21437/Interspeech.2023-534
arXiv:2311.08104 [cs.SD] [29] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2021.
[7] Joshua Camp, Tom Kenter, Lev Finkelstein, and Rob Clark. 2023. MOS vs. AB: DiffWave: A Versatile Diffusion Model for Audio Synthesis. http://arxiv.org/
Evaluating Text-to-Speech Systems Reliably Using Clustered Standard Errors. In abs/2009.09761 arXiv:2009.09761 [cs, eess, stat].
INTERSPEECH 2023. ISCA, 1090–1094. https://doi.org/10.21437/Interspeech.2023- [30] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen
2014 Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, and Aaron Courville.
[8] Edresson Casanova, Julian Weber, Christopher Shulby, Arnaldo Candido Junior, 2019. MelGAN: Generative Adversarial Networks for Conditional Waveform
Eren Gölge, and Moacir Antonelli Ponti. 2023. YourTTS: Towards Zero-Shot Synthesis. http://arxiv.org/abs/1910.06711 arXiv:1910.06711 [cs, eess].
Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone. http://arxiv. [31] Jingyi Li, Weiping Tu, and Li Xiao. 2023. Freevc: Towards High-Quality Text-Free
org/abs/2112.02418 arXiv:2112.02418 [cs, eess]. One-Shot Voice Conversion. In ICASSP 2023-2023 IEEE International Conference
[9] Mingjie Chen, Yanghao Zhou, Heyan Huang, and Thomas Hain. 2022. Efficient on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
non-autoregressive gan voice conversion using vqwav2vec features and dynamic [32] Yinghao Aaron Li, Ali Zare, and Nima Mesgarani. 2021. Starganv2-vc: A diverse,
convolution. arXiv preprint arXiv:2203.17172 (2022). unsupervised, non-parallel framework for natural-sounding voice conversion.
[10] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and arXiv preprint arXiv:2107.10394 (2021).
William Chan. 2020. WaveGrad: Estimating Gradients for Waveform Generation. [33] Yist Y Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-yi Lee, and Lin-shan Lee.
http://arxiv.org/abs/2009.00713 arXiv:2009.00713 [cs, eess, stat]. 2021. Fragmentvc: Any-to-any voice conversion by end-to-end extracting and
[11] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, fusing fine-grained voice fragments with attention. In ICASSP 2021-2021 IEEE
Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. 2022. Wavlm: International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
Large-scale self-supervised pre-training for full stack speech processing. IEEE 5939–5943.
Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518. [34] Guanghou Liu, Yongmao Zhang, Yi Lei, Yunlin Chen, Rui Wang, Lei Xie, and
[12] Yen-Hao Chen, Da-Yi Wu, Tsung-Han Wu, and Hung-yi Lee. 2021. Again-vc: Zhifei Li. 2023. PromptStyle: Controllable Style Transfer for Text-to-Speech
A one-shot voice conversion using activation guidance and adaptive instance with Natural Language Descriptions. In INTERSPEECH 2023. ISCA, 4888–4892.
normalization. In ICASSP 2021-2021 IEEE International Conference on Acoustics, https://doi.org/10.21437/Interspeech.2023-1779
Speech and Signal Processing (ICASSP). IEEE, 5954–5958. [35] Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. 2022. Diffsinger:
[13] Jeongsoo Choi, Minsu Kim, and Yong Man Ro. 2023. Intelligible Lip-to-Speech Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the
Synthesis with Speech Units. In INTERSPEECH 2023. ISCA, 4349–4353. https: AAAI conference on artificial intelligence, Vol. 36. 11020–11028.
//doi.org/10.21437/Interspeech.2023-194 [36] Songxiang Liu, Yuewen Cao, Disong Wang, Xixin Wu, Xunying Liu, and Helen
[14] Giulia Comini, Sam Ribeiro, Fan Yang, Heereen Shim, and Jaime Lorenzo-Trueba. Meng. 2021. Any-to-many voice conversion with location-relative sequence-
2023. Multilingual context-based pronunciation learning for Text-to-Speech. In to-sequence modeling. IEEE/ACM Transactions on Audio, Speech, and Language
INTERSPEECH 2023. ISCA, 631–635. https://doi.org/10.21437/Interspeech.2023- Processing 29 (2021), 1717–1728.
861 [37] Hieu-Thi Luong and Junichi Yamagishi. 2023. Controlling Multi-Class Human
[15] Erica Cooper, Cheng-I Lai, Yusuke Yasuda, and Junichi Yamagishi. 2020. Can Vocalization Generation via a Simple Segment-based Labeling Scheme. In INTER-
Speaker Augmentation Improve Multi-Speaker End-to-End TTS?. In Interspeech SPEECH 2023. ISCA, 4379–4383. https://doi.org/10.21437/Interspeech.2023-1175
2020. ISCA, 3979–3983. https://doi.org/10.21437/Interspeech.2020-1229 [38] Youneng Ma, Junyi He, Meimei Wu, Guangyue Hu, and Haojun Fei. 2023.
[16] Erica Cooper and Junichi Yamagishi. 2023. Investigating Range-Equalizing Bias EdenTTS: A Simple and Efficient Parallel Text-to-speech Architecture with Col-
in Mean Opinion Score Ratings of Synthesized Speech. In INTERSPEECH 2023. laborative Duration-alignment Learning. In INTERSPEECH 2023. ISCA, 4449–4453.
ISCA, 1104–1108. https://doi.org/10.21437/Interspeech.2023-1076 https://doi.org/10.21437/Interspeech.2023-700
[17] D. Griffin and Jae Lim. 1984. Signal estimation from modified short-time Fourier [39] Momina Masood, Mariam Nawaz, Khalid Mahmood Malik, Ali Javed, Aun Irtaza,
transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32, 2 and Hafiz Malik. 2023. Deepfakes Generation and Detection: State-of-the-art,
(April 1984), 236–243. https://doi.org/10.1109/TASSP.1984.1164317 open challenges, countermeasures, and way forward. Applied Intelligence 53, 4
[18] Houjian Guo, Chaoran Liu, Carlos Toshinori Ishi, and Hiroshi Ishiguro. 2023. (2023), 3974–4026.
QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier [40] Ambuj Mehrish, Abhinav Ramesh Kashyap, Li Yingting, Navonil Majumder,
Transform for Faster Conversion. arXiv:2302.08296 [cs.SD] and Soujanya Poria. 2023. ADAPTERMIX: Exploring the Efficacy of Mixture
[19] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Rus- of Adapters for Low-Resource TTS Adaptation. arXiv preprint arXiv:2305.18028
lan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised (2023).
speech representation learning by masked prediction of hidden units. IEEE/ACM [41] Ambuj Mehrish, Navonil Majumder, Rishabh Bhardwaj, Rada Mihalcea, and Sou-
Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460. janya Poria. 2023. A Review of Deep Learning Techniques for Speech Processing.
[20] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, arXiv:2305.00359 [eess.AS]
Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and [42] Juan Felipe Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, and
Koray Kavukcuoglu. 2018. Efficient Neural Audio Synthesis. http://arxiv.org/ Jesper Jensen. 2023. Speech inpainting: Context-based speech synthesis guided
abs/1802.08435 arXiv:1802.08435 [cs, eess]. by video. In INTERSPEECH 2023. ISCA, 4459–4463. https://doi.org/10.21437/
[21] Anton Kashkin, Ivan Karpukhin, and Svyatoslav Shishkin. 2022. Hifi-vc: High Interspeech.2023-1020
quality asr-based voice conversion. arXiv preprint arXiv:2203.16937 (2022). [43] Hiroki Mori and Shunya Kimura. 2023. A Generative Framework for Conversa-
[22] Heeseung Kim, Sungwon Kim, Jiheum Yeom, and Sungroh Yoon. 2023. Unit- tional Laughter: Its ’Language Model’ and Laughter Sound Synthesis. In INTER-
Speech: Speaker-adaptive Speech Synthesis with Untranscribed Data. In INTER- SPEECH 2023. ISCA, 3372–3376. https://doi.org/10.21437/Interspeech.2023-2453
SPEECH 2023. ISCA, 3038–3042. https://doi.org/10.21437/Interspeech.2023-2326 [44] Nobuyuki Morioka, Heiga Zen, Nanxin Chen, Yu Zhang, and Yifan Ding. 2022.
[23] Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. 2020. Glow- Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation. http:
TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. //arxiv.org/abs/2210.15868 arXiv:2210.15868 [cs, eess].
http://arxiv.org/abs/2005.11129 arXiv:2005.11129 [cs, eess]. [45] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. 2016. WORLD: A Vocoder-
Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE

11
MAD ’24, June 10–14, 2024, Phuket, Thailand Jeong-Eun Choi, Karla Schäfer, and Sascha Zmudzinski

Transactions on Information and Systems E99.D, 7 (2016), 1877–1884. https: [66] Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li. 2021. An
//doi.org/10.1587/transinf.2015EDP7457 Overview of Voice Conversion and Its Challenges: From Statistical Modeling to
[46] Anastasia Natsiou and Seán O’Leary. 2021. Audio representations for deep Deep Learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing
learning in sound synthesis: A review. 2021 IEEE/ACS 18th International Con- 29 (2021), 132–157. https://doi.org/10.1109/TASLP.2020.3038524
ference on Computer Systems and Applications (AICCSA) (2021), 1–8. https: [67] Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. 2021. A survey on neural speech
//api.semanticscholar.org/CorpusID:245827795 synthesis. arXiv preprint arXiv:2106.15561 (2021).
[47] Bac Nguyen and Fabien Cardinaux. 2022. Nvc-net: End-to-end adversarial voice [68] Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. 2021. A Survey on Neural Speech
conversion. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech Synthesis. http://arxiv.org/abs/2106.15561 arXiv:2106.15561 [cs, eess].
and Signal Processing (ICASSP). IEEE, 7012–7016. [69] Linh The Nguyen, Thinh Pham, and Dat Quoc Nguyen. 2023. XPhoneBERT: A Pre-
[48] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol trained Multilingual Model for Phoneme Representations for Text-to-Speech. In
Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. INTERSPEECH 2023. ISCA, 5506–5510. https://doi.org/10.21437/Interspeech.2023-
2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 444
(2016). [70] Chung Tran, Chi Mai Luong, and Sakriani Sakti. 2023. STEN-TTS: Improving
[49] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning Zero-shot Cross-Lingual Transfer for Multi-Lingual TTS with Style-Enhanced
with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018). Normalization Diffusion Framework. In INTERSPEECH 2023. ISCA, 4464–4468.
[50] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. https://doi.org/10.21437/Interspeech.2023-2243
Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE [71] Benjamin van Niekerk, Marc-André Carbonneau, Julian Zaïdi, Matthew Baas,
international conference on acoustics, speech and signal processing (ICASSP). IEEE, Hugo Seuté, and Herman Kamper. 2022. A comparison of discrete and soft speech
5206–5210. units for improved voice conversion. In ICASSP 2022-2022 IEEE International
[51] Hyun Joon Park, Seok Woo Yang, Jin Sob Kim, Wooseok Shin, and Sung Won Han. Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6562–6566.
2023. TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any [72] Petra Wagner, Jonas Beskow, Simon Betz, Jens Edlund, Joakim Gustafson, Gustav
Voice Conversion. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Eje Henter, Sébastien Le Maguer, Zofia Malisz, Éva Székely, Christina Tånnander,
Speech and Signal Processing (ICASSP). IEEE, 1–5. and Jana Voße. 2019. Speech Synthesis Evaluation — State-of-the-Art Assessment
[52] Seongyeon Park, Bohyung Kim, and Tae-Hyun Oh. 2023. Automatic Tuning and Suggestion for a Novel Research Program. In 10th ISCA Workshop on Speech
of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Synthesis (SSW 10). ISCA, 105–110. https://doi.org/10.21437/SSW.2019-19
Speech Synthesis. In INTERSPEECH 2023. ISCA, 4319–4323. https://doi.org/10. [73] Tomasz Walczyna and Zbigniew Piotrowski. 2023. Overview of Voice Conversion
21437/Interspeech.2023-58 Methods Based on Deep Learning. Applied Sciences 13, 5 (2023), 3100.
[53] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail [74] Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu,
Kudinov. 2021. Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech. Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and
http://arxiv.org/abs/2105.06337 arXiv:2105.06337 [cs, stat]. Furu Wei. 2023. Neural Codec Language Models are Zero-Shot Text to Speech
[54] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudi- Synthesizers. http://arxiv.org/abs/2301.02111 arXiv:2301.02111 [cs, eess].
nov, and Jiansheng Wei. 2021. Diffusion-based voice conversion with fast maxi- [75] Hui Wang, Shiwan Zhao, Xiguang Zheng, and Yong Qin. 2023. RAMP: Retrieval-
mum likelihood sampling scheme. arXiv preprint arXiv:2109.13821 (2021). Augmented MOS Prediction via Confidence-based Dynamic Weighting. 1095–
[55] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. 2019. Waveglow: A Flow- 1099. https://doi.org/10.21437/Interspeech.2023-851
based Generative Network for Speech Synthesis. In ICASSP 2019 - 2019 IEEE [76] Ruobai Wang, Yu Ding, Lincheng Li, and Changjie Fan. 2020. One-shot voice
International Conference on Acoustics, Speech and Signal Processing (ICASSP). conversion using star-GAN. In ICASSP 2020-2020 IEEE International Conference
3617–3621. https://doi.org/10.1109/ICASSP.2019.8683143 on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7729–7733.
[56] Magdalena Proszewska, Grzegorz Beringer, Daniel Sáez-Trigueros, Thomas [77] Yihan Wu, Xu Tan, Bohan Li, Lei He, Sheng Zhao, Ruihua Song, Tao Qin, and
Merritt, Abdelhamid Ezzerg, and Roberto Barra-Chicote. 2022. GlowVC: Mel- Tie-Yan Liu. 2022. AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios.
spectrogram space disentangling model for language-independent text-free voice http://arxiv.org/abs/2204.00436 arXiv:2204.00436 [cs, eess].
conversion. arXiv preprint arXiv:2207.01454 (2022). [78] Zhizheng Wu and Haizhou Li. 2014. Voice conversion versus speaker verification:
[57] Kaizhi Qian, Zeyu Jin, Mark Hasegawa-Johnson, and Gautham J Mysore. 2020. an overview. APSIPA Transactions on Signal and Information Processing 3 (2014),
F0-consistent many-to-many non-parallel voice conversion via conditional au- e17. https://doi.org/10.1017/ATSIP.2014.17
toencoder. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech [79] Detai Xin, Shinnosuke Takamichi, Ai Morimatsu, and Hiroshi Saruwatari. 2023.
and Signal Processing (ICASSP). IEEE, 6284–6288. Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild
[58] Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa- Laughter Corpus. In INTERSPEECH 2023. ISCA, 17–21. https://doi.org/10.21437/
Johnson. 2019. Autovc: Zero-shot voice style transfer with only autoencoder loss. Interspeech.2023-806
In International Conference on Machine Learning. PMLR, 5210–5219. [80] Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. 2020. Parallel WaveGAN: A
[59] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan fast waveform generation model based on generative adversarial networks with
Liu. 2022. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference
http://arxiv.org/abs/2006.04558 arXiv:2006.04558 [cs, eess]. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6199–6203.
[60] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan [81] Hong-Sun Yang, Ji-Hoon Kim, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim,
Liu. 2019. FastSpeech: Fast, Robust and Controllable Text to Speech. http: Shuk-Jae Choi, and Hyung-Yong Kim. 2023. FACTSpeech: Speaking a Foreign
//arxiv.org/abs/1905.09263 arXiv:1905.09263 [cs, eess]. Language Pronunciation Using Only Your Native Characters. In INTERSPEECH
[61] Yi Ren, Xu Tan, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2022. Revisiting Over- 2023. ISCA, 606–610. https://doi.org/10.21437/Interspeech.2023-2303
Smoothness in Text to Speech. In Proceedings of the 60th Annual Meeting of the [82] Jingzhou Yang and Lei He. 2020. Towards Universal Text-to-Speech. In Interspeech
Association for Computational Linguistics (Volume 1: Long Papers). Association 2020. ISCA, 3171–3175. https://doi.org/10.21437/Interspeech.2020-1590
for Computational Linguistics, Dublin, Ireland, 8197–8213. https://doi.org/10.
18653/v1/2022.acl-long.564
[62] Morgane Riviere, Armand Joulin, Pierre-Emmanuel Mazaré, and Emmanuel
Dupoux. 2020. Unsupervised pretraining transfers well across languages. In
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 7414–7418.
[63] Yuki Saito, Shinnosuke Takamichi, Eiji Iimori, Kentaro Tachibana, and Hiroshi
Saruwatari. 2023. ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained
from ChatGPT-derived Context Word Embeddings. In INTERSPEECH 2023. ISCA,
3048–3052. https://doi.org/10.21437/Interspeech.2023-1095
[64] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly,
Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al.
2018. Natural tts synthesis by conditioning wavenet on mel spectrogram pre-
dictions. In 2018 IEEE international conference on acoustics, speech and signal
processing (ICASSP). IEEE, 4779–4783.
[65] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly,
Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A.
Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. 2018. Natural TTS Syn-
thesis by Conditioning Wavenet on MEL Spectrogram Predictions. In 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
Calgary, AB, 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368

12

You might also like