CosyVoice v1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer

based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang
Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhijie Yan
Speech Lab, Alibaba Group, China
{neo.dzh,sly.zsl,h.lu}@alibaba-inc.com

Abstract with a higher degree of naturalness and the ability


Recent years have witnessed a trend that large to synthesize voices in a zero-shot fashion (Betker,
language model (LLM) based text-to-speech 2023; Wang et al., 2023; Lajszczak et al., 2024).
(TTS) emerges into the mainstream due to These LLM-based TTS models function by con-
their high naturalness and zero-shot capac- verting speech signals into sequences of tokens,
ity. In this paradigm, speech signals are dis- with the LLM utilizing text as a condition to model
cretized into token sequences, which are mod- these token sequences. A token vocoder is then
eled by an LLM with text as prompts and re-
employed to reconstruct the raw waveforms from
constructed by a token-based vocoder to wave-
forms. Obviously, speech tokens play a criti-
the tokenized speech (Kong et al., 2020; Défossez
cal role in LLM-based TTS models. Current et al., 2022).
speech tokens are learned in an unsupervised A critical aspect of the TTS process is the rep-
manner, which lacks explicit semantic infor- resentation of speech tokens. Traditionally, to-
mation and alignment to the text. In this pa- kens are acquired through unsupervised learning,
per, we propose to represent speech with su-
which may not capture explicit semantic informa-
pervised semantic tokens, which are derived
from a multilingual speech recognition model tion or align well with corresponding text (Hsu
by inserting vector quantization into the en- et al., 2021; Défossez et al., 2022). Recognizing
coder. Based on the tokens, we further propose this gap, our work introduces supervised semantic
a Codec-based synthesizer for Voice genera- tokens extracted from a multilingual speech recog-
tion, CosyVoice1 , which consists of an LLM nition model, Whisper (Radford et al., 2023), by
for text-to-token generation and a conditional integrating vector quantization into the encoder.
flow matching model for token-to-speech syn- This innovation allows for more accurate seman-
thesis. Experimental results show that su-
tic representation and alignment with text. Early
pervised semantic tokens significantly outper-
form existing unsupervised tokens in terms of studies have shown that quantizers with auxiliary
content consistency and speaker similarity for automatic speech recognition (ASR) loss outper-
zero-shot voice cloning. Moreover, we find form k-means clustering on the universal speech
that utilizing large-scale data further improves model (USM) for speech-to-text translation and
the synthesis performance, indicating the scal- ASR tasks, as demonstrated in Rubenstein et al.
able capacity of CosyVoice. To the best of our (2023). Additionally, Ye et al. (2024) employed
knowledge, this is the first attempt to involve
Gumbel-Softmax vector quantization to extract dis-
supervised speech tokens into TTS models.
crete speech representations that prioritize ASR-
1 Introduction relevant information for ASR tasks. However, the
impact of these approaches on text-to-speech (TTS)
Text-to-Speech (TTS) technology has made re-
remains unclear.
markable strides in recent years, transitioning from
robotic-sounding speech to producing voices that Furthermore, leveraging these supervised tokens,
are nearly indistinguishable from human speakers. we propose CosyVoice, a scalable and efficient
At the forefront of this advancement are Large Lan- zero-shot TTS synthesizer. CosyVoice is comprised
guage Models (LLMs), which have been increas- of an LLM for converting text into semantic token
ingly utilized in TTS systems to generate speech sequences and a conditional flow matching model
1
for the subsequent synthesis of speech from these
Models and codes are released at https://github.com/
FunAudioLLM/CosyVoice. Demos can be found at https: tokens. In contrast to prior systems like TorToise
//fun-audio-llm.github.io TTS (Betker, 2023), which employs an LLM in
conjunction with a denoising diffusion probabilis- matching model. Specifically, the text encoder is
tic models (DDPM) (Ho et al., 2020), CosyVoice used to align the semantic spaces of text and speech
utilizes a conditional flow matching approach, as tokens, while the speech tokenizer is utilized to ex-
it has been demonstrated to accelerate both train- tract semantic tokens as illustrated in Figure 1(a).
ing and inference compared to traditional diffusion We employ a large language model to learn the
models (Le et al., 2024). While existing meth- whole sequence of text encodings and speech to-
ods incorporate flow matching in TTS (Le et al., kens, reformulating TTS as an auto-regressive se-
2024; Guo et al., 2023; Mehta et al., 2023; Guan quence generation problem given text as prompts.
et al., 2023), they often rely on phoneme duration Then, as shown in Figure 1(c), a conditional flow
prediction, necessitating the use of supplementary matching model is utilized to convert speech tokens
phonemizers and forced aligners. CosyVoice, how- into a Mel spectrogram via a denoising process on
ever, bypasses these dependencies, offering a more the optimal path. To obtain a perceptible signal,
direct and efficient pathway from text to speech. the HifiGAN vocoder (Kong et al., 2020) is used
Our research contributes to the field of speech to synthesize a waveform with the generated Mel
generation in several novel ways: spectrogram as input.

• We are the first to integrate supervised speech 2.1 Supervised Semantic Tokens for Speech
tokens into TTS models, enhancing content con- In CosyVoice, a supervised automatic speech recog-
sistency and speaker similarity in zero-shot voice nition (ASR) model is employed to derive the su-
cloning. pervised semantic speech (S 3 ) tokenizer for speech.
• We propose CosyVoice, a scalable zero-shot The model is a finetuned version of our proprietary
TTS synthesis system that combines an LLM for SenseVoice ASR model. It is trained on multilin-
text-to-token generation with a conditional flow gual audio data and possesses rich audio content
matching model for token-to-speech synthesis, understanding capabilities. Different from the orig-
forsaking the need for additional phonemizers inal ASR model, we split the encoder into two parts
and forced aligners. and insert a vector quantization layer between them.
• To further refine the quality of generated speech, Given a Mel spectrogram X as input, it undergoes
we incorporate the x-vector (Snyder et al., 2018) the positional encoding and Encoder1 to obtain a
into the LLM to separate the modeling of speech context-aware representations H:
into semantic, speaker, and prosody compo-
nents. The LLM models the semantic content H = Encoder1 (PosEnc(X)) (1)
and prosody, while the conditional flow matching
model captures timbre and environmental infor- Then, a vector quantizer (VQ) is involved to obtain
mation. We optimize the flow matching process discrete tokens. For the hidden representation hl
with techniques such as classifier-free guidance at the frame l, the index of nearest embedding in
(Ho and Salimans, 2022a), a cosine scheduler, the codebook C is treated as the speech token µl at
and masked conditions. this timestep:

Our experimental results demonstrate the supe- µl = VQ(hl , C) = arg min ||hl − cn ||2 (2)
cn ∈C
riority of supervised semantic tokens over unsu-
pervised counterparts. Additionally, the scalability where || · ||2 denotes the L2 norm. At the train-
of CosyVoice is evidenced by improved synthe- ing stage, codebook embeddings are updated via
sis performance when utilizing large-scale data. exponentially moving average (EMA):
This work, therefore, represents a significant step
forward in the development of natural-sounding, cµl := αcµl + (1 − α)hl (3)
versatile TTS systems.
where α is a pre-defined decay coefficient. The cor-
2 CosyVoice: A Scalable TTS model responding codebook embeddings of speech tokens
using Supervised Semantic Tokens are used as the quantized hidden representations
H̄ = {cµ1 , cµ2 , . . . , cµL } and passed through the
As shown in Figure 1(b), our CosyVoice consists of remaining encoder layers Encoder2 :
four components, namely text encoder, speech tok- 
enizer, large language model and conditional flow H̃ = Encoder2 PosEnc(H̄) (4)
Speech

ASR Decoder CosyVoice Flow Matching


Transformer Block
Embedding layer
ResNet1D

Positional E
Transformer Block
Encoding
Speech Tokens ResNet1D
Text-to-token LM
Speech Vector Quantizer Transformer Block
Tokenizer S ResNet1D

Text Encoder Speech Tokenizer Conditions Timestep


Positional
Embedding
Encoding
Speech x-Vec Text T Speech
(a) Supervised speech tokenizer (b) An overview of the proposed CosyVoice LM (c) Conditional flow matching

Figure 1: An overview of the proposed CosyVoice model. (a) demonstrates the S 3 tokenizer, where dashed
modules are only used at the training stage. (b) is a schematic diagram of CosyVoice, consisting of a text-to-token
LLM and a token-to-speech flow matching model. S , E and T denote the “start of sequence”, “end of sequence”
and “turn of speech” tokens. Dashed lines indicate the autoregressive decoding at the inference stage. (c) provides
an enlarged view of our flow matching model conditioning on a speaker embedding v, semantic tokens µ, masked
speech features X̃ and intermediate state Xt at timestep t on the probabilistic density path.

Note that, before the remaining encoder layers, semantic spaces and benefit the LLM modeling. A
we add an extra positional encoding to enhance start identifier T is inserted between text encod-
the temporal information. After Encoder2 , a ings and speech tokens {µl }l∈[1:L] that is extracted
transformer-based ASR decoder is followed, pre- with the supervised semantic tokenizer as described
dicting the posterior probability of text labels: in 2.1. At the training stage, we employ the teacher-
  forcing scheme, in which the left-shifted sequence
P (Y |X) = ASRDecoder H̃, Y Z−1 (5) is employed as the mode inputs and the original
sequence serves as the expected outputs. Note that
where Y Z−1 represents the left-shifted text labels only the cross entropy losses of speech tokens and
in the teacher-forcing training scheme. E are considered during the training:
2.2 Large Language Model for TTS 1 X
L+1

In this section, we formulate the TTS task as an LLM = − log q(µl ) (8)
L+1
l=1
auto-regressive speech token generation problem
with a large language model (LLM). For LLM, the where µL+1 is the “end of sequence” token E .
sequence construction is the most important matter, q(µl ) denotes the posterior probability of µl , which
which is constructed as follows: is predicted by the softmax layer following LLM.
 
S , v, {ȳu }u∈[1:U ] , T , {µl }l∈[1:L] , E (6) 2.3 Optimal-transport Conditional Flow
Matching
S and E denote the start and end of sequence,
respectively. v is a speaker embedding vector ex- In CosyVoice, an optimal-transport conditional
tracted from the speech X with a pre-trained voice- flow matching model (OT-CFM) is employed to
print model2 . The text encodings Ȳ = {ȳu }u∈[1:U ] learn the distribution of Mel spectrogram and gen-
is obtained by passing the text through a Byte Pair erate samples from it with generated speech tokens
Encoded (BPE) tokenizer and text encoder: as conditions. OT-CFM can achieve better perfor-
mance compared to diffusion probabilistic mod-
Ȳ = TextEncoder(BPE(Y )) (7) els (DPMs) with simpler gradients, easier training
and faster generation (Lipman et al., 2023; Tong
Since text and speech tokens lie at different seman- et al., 2023; Mehta et al., 2023). In continuous-time
tic levels, the text encoder is used to align their normalizing flows (CNFs), a probability density
2
Available at https://github.com/alibaba-damo-academy/ path is constructed from a prior distribution p0 (X)
3D-Speaker/tree/main/egs/3dspeaker/sv-cam++ to the data distribution of Mel spectrogram q(X).
The probability density path is defined by a time- S E
Spk Prompt Input Prompt Generated
dependent vector field νt (X) : [0, 1] × RL∗D → Emb Text Text Speech Token Speech Token
RL∗D , which generates the flow φt through the (a) Zero-shot In-context Learning

following ordinary differential equation (ODE): S E


Spk
Input Text LID Generated Speech Token
d Emb
(b) Cross-lingual Voice Cloning
φt (X) = νt (φt (X), t)
dt
φ0 (X) ∼ p0 (X) = N (X; 0, I) (9) Figure 2: Sequence construction for (a) zero-shot in-
context learning and (b) cross-lingual voice cloning.
φ1 (X) ∼ p1 (X) LID represents language identifier.
where t ∈ [0, 1]. By solving the initial value prob-
lem Eq. (9), we can approximate the speech dis- can learn both conditional and unconditional flows.
tribution q(X) with p1 (X) and sample from it. To During generation, the vector field is modified as
learn the vector field νt (X), we define the optimal- follows:
transport (OT) flow and force a neural network to
match it by minimizing the following loss: ν̃t (φOT
t (X0 , X1 )|θ; Ψ)

LOT −CF M = (1 + β) · νt (φOT


t (X0 , X1 )|θ; Ψ) (14)
= Et,p0 (X0 ),q(X1 ) |ωt (φOT
t (X0 , X1 )|X1 ) (10) −β· νt (φOT
t (X0 , X1 )|θ)
OT
− νt (φt (X0 , X1 )|θ)|
where β is the guidance strength of 0.7.
where
2.3.1 Zero-shot In-context Learning
φOT
t (X0 , X1 ) = (1 − (1 − σ)t)X0 + tX1
(11) CosyVoice models exhibit zero-shot in-context
ωt (φOT
t (X0 , X1 )|X1 ) = X1 − (1 − σ)X0 learning capabilities, allowing for the replication
of an arbitrary voice with only a brief reference
The speaker embedding v, speech tokens {µl }1:L ,
speech sample. This process entails the careful con-
and masked Mel spectrogram X̃1 are also fed into
struction of input sequences for the token language
the neural network to match the vector field with
model (LM), depicted in Figure 2. For prompt
learnable parameters θ:
speech and input text in the same language, we
νt (φOT
t (X0 , X1 )|θ)
merge them to form a unified input, treating the
  (12) prompt speech tokens as pre-generated. With this
= NNθ φOTt (X0 , X1 ), t; v, {µl }1:L , X̃1 input sequence, the autoregressive LM iteratively
predicts subsequent tokens until it encounters the
X̃1 is a masked version of X1 by setting continu- “end of sequence” token E . However, when the
ous frames to zeros from a random start point to the prompt speech and input text differ linguistically,
end. Considering the generation process at the be- we omit the text and tokens associated with the
ginning is harder than follows, we involve a cosine prompt to prevent prosodic characteristics of the
scheduler for the timestep t: original language from influencing the target lan-
  guage. It is important to note that the prompt text,
1
t := 1 − cos tπ (13) which corresponds to the prompt speech’s content,
2
can be transcribed either through human annotation
Under the scheduled flow, there are more genera- or ASR models, such as SenseVoice. Similar to the
tion steps at the beginning. prompt text, the prompt tokens are extracted from
Classifier-free guidance (CFG) has been proven the prompt speech with S 3 tokenizer.
to improve the generation quality of diffusion prob- After generating the speech tokens, they are ap-
abilistic models (Ho and Salimans, 2022b; Nichol pended after the prompt tokens, forming a compos-
and Dhariwal, 2021; Le et al., 2024). Therefore, ite condition for the flow-matching model. Addi-
we propose to adapt the CFG into conditional flow tionally, the speaker embedding and the Mel spec-
matching models. At the training stage, we ran- trogram of the prompt speech are incorporated to
domly drop the conditions Ψ = {v, {µl }1:L , X̃1 } further enhance timbre and environmental consis-
with a fixed probability of 0.2. In this manner, we tency.
Speaker Identity
1. Selene ’Moonshade’, is a mysterious, elegant dancer with a connection to the night. Her movements are both mesmeriz-
ing and deadly.<endofprompt>Hope is a good thing.
2. Theo ’Crimson’, is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with
impulsiveness.<endofprompt>You don’t know about real loss.
Speaking Style
1. A happy girl with high tone and quick speech.<endofprompt>The sun is shining brightly today.
2. A sad woman with normal tone and slow speaking speed.<endofprompt>I failed my important exam.
Fine-grained Paralinguistics
1. Well that’s kind of scary [laughter].
2. I don’t think I over eat yeah [breath] and um I do exercise regularly.
3. Well that pretty much covers <laughter>the subject</laughter> well thanks for calling me.
4. The team’s <strong>unity</strong> and <strong>resilience</strong> helped them win the championship.

Table 1: Examples of speaker identity, speaking style, and fine-grained paralinguistics.

2.4 Rich Generation with Instruction Language Duration (hr)


To enable further controllability on CosyVoice, we ZH 130,000
experiment with integrating additional instruction EN 30,000
fine-tuning (Ji et al., 2023). CosyVoice-instruct Yue 5,000
extends CosyVoice-base with enhanced instruction- JP 4,600
following capabilities. Specifically, it supports con- KO 2,200
trollability over various aspects such as speaker
Table 2: Hours of CosyVoice training data across lan-
identity (i.e., speaker’s characteristics), speaking guages in the large-scale experiments.
style (including emotion, gender, speaking rate,
and pitch), and fine-grained paralinguistic features.
Type Duration (hr)
These features include the ability to insert laughter,
breaths, speaking while laughing, and emphasize Speaker Identity 101
Speaking Style 407
certain words.
Fine-grained Paralinguistics 48
We fine-tuned CosyVoice-base using this train-
ing data without incorporating speaker embedding Table 3: Duration statistics of instruction training data
in the autoregressive language model. Table 1 by type.
shows some examples of speaker identity, speaking
style, and fine-grained paralinguistic features.
text labels are generated using SenseVoice-Large
3 Dataset and Paraformer. These labels undergo a refinement
process with the aid of force-alignment (FA) mod-
3.1 Small-scale Single-lingual Dataset
els, which helps eliminate low-quality data and
We conduct experiments on the LibriTTS (Zen enhances the accuracy of punctuation. A compre-
et al., 2019) corpus, which contains 585 hours hensive breakdown of the training data’s duration
from 2,456 English speakers. We follow the offi- across various languages is presented in Table 2.
cial data partition, where “train-clean-100”, “train- Table 3 presents the duration of the training data
clean-360” and “train-other-500” are merged for for different types of instructions.
training and the “dev-clean” is used for model se-
lections. “test-clean” is used to construct the evalu- 4 Experimental Settings
ation set as described in (Du et al., 2024).
4.1 Supervised Semantic Speech Tokenizer
3.2 Large-scale Multi-lingual Dataset For the small-scale single-lingual dataset, we em-
To train the CosyVoice models, we have amassed ploy the ESPNet Conformer ASR model as the
a considerable dataset comprising multiple lan- backbone and insert the vector quantizer after the
guages. Throughout the collection process, we uti- first six encoder layers. There is a single code-
lize specialized in-house tools for speech detection, book with 4,096 codes. The first six encoder
signal-to-noise ratio (SNR) estimation, speaker di- layers and vector quantizer are employed as the
arization, and separation. Subsequently, pseudo speech tokenizer. As for the text tokenizer, a word
Settings Tiny Normal Model dev_clean test_clean test_other
Text Encoder Conformer 2.62 2.89 6.57
Layers 6 6 Conformer-VQ 3.13 3.18 7.56
Attention Dim. 512 1,024
Attention Heads 8 16 Table 5: Impact of inserting vector quantization on
Linear Units 2,048 4,096 speech recognition in terms of word error rate (%).
Language Model
Layers 12 14 Whisper-L-V3 SenseVoice-L S 3 tokens
Attention Dim. 512 1,024 Test set w/o lid w/ lid w/o lid w/ lid w/o lid w/ lid
Attention Heads 8 16 zh-CN 12.82 12.55 8.76 8.68 12.24 12.06
Linear Units 2,048 4,096 en 13.55 9.39 9.79 9.77 15.43 15.38

Table 4: Details of model architecture settings in the Table 6: The evaluation on S 3 tokens’ capability to
tiny and normal CosyVoice models. preserve semantic information. We employ word and
character error rates for zh-CN and en languages on the
Common Voice benchmarks.
sentence-piece model is trained on the text of train-
ing, which has a vocabulary size of 4,000. We
train the quantizer-augmented ASR model on the riTTS test sets. From the table, we can see that
Librispeech (Panayotov et al., 2015) corpus for 50 inserting a vector quantizer into the ASR encoder
epochs from scratch. only affects the recognition performance slightly.
For the large-scale multi-lingual dataset, we em- As a result, the VQ-inserted Conformer ASR model
ploy the SenseVoice-Large rich recognition model achieves comparable WERs of 3.18% and 7.56%
(TongyiSpeech, 2024) as the backbone. Similar to on “test-clean” and “test-other” sets, respectively.
small-scale dataset, we still insert the vector quan- This indicates that tokenizers trained in a super-
tizer after the first six encoder layers with a single vised manner can maintain sufficient semantic in-
codebook of 4,096 codes. More hyper-parameter formation and the alignment to text.
selections, such as quantizer-inserted layer and the To assess the multi-lingual S 3 tokenizer’s abil-
number of codes, are left for future work. Differ- ity to preserve semantic information, we com-
ent from single-lingual experiments, we use the pared the recognition performance of the quantizer-
pre-trained checkpoint to initialize the SenseVoice- augmented SenseVoice-L against its original ver-
Large model rather than train it from scratch. After sion and the Whisper-Large V3 model. The models
inserting the quantizer, we further fine-tune the underwent evaluation using the Common Voice zh-
whole parameters for 210,000 training steps on CN and en benchmarks, with the findings detailed
eight A800 GPUs. in Table 6. From the table, we can see that our S 3
tokens demonstrate robust recognition performance
4.2 CosyVoice Model Settings in both the Chinese and English test sets. Notably,
on the common_voice_zh-CN set, S 3 tokens sur-
We train the tiny and normal size models in single-
pass the performance of the Whisper-Large V3
lingual and multi-lingual experiments. Details of
model (TongyiSpeech, 2024), achieving a 4.14%
model architecture settings are shown in Table 4.
relative reduction in error rate. This suggests a
The tiny model is trained on LibriTTS training set
substantial correlation between S 3 tokens and se-
for 50 epochs with four V100-32M GPUs, while
mantic content. It is worth noting that there is only
the multi-lingual model is trained on our inter-
a single codebook in the S 3 tokenizer with a dictio-
nal dataset for 800,000 steps with 64 V100-32M
nary size of 4,096 entries.
GPUs. Tiny and normal models are trained with
the learning rate of 10−3 and 10−4 , respectively.
5.2 Comparison with Baselines
The warmup step is set to 10,000.
We compare the proposed CosyVoice models with
5 Experimental Results other TTS systems on content consistency and
speaker similarity. For content consistency, an ASR
5.1 Evaluation on S 3 Tokenizer model is employed to recognize the generated ut-
In table 5, we demonstrate how the vector quanti- terances. We report the word error rate (WER), and
zation affects the recognition performance on Lib- the number of insertion, deletion and substation
Model Text token Speech Token WER (%) #INS+DEL #SUB SS
Original - - 3.01 66 200 69.67
VALL-E (Wang et al., 2023) Phone Encodec 18.70 342 1312 53.19
UniAudio (Yang et al., 2023) Phone Encodec 8.74 254 519 47.56
SpearTTS (Kharitonov et al., 2023) Phone Hubert 6.14 133 410 51.71
Exp-1-LibriTTS Phone Hubert 7.41 325 409 67.85
Exp-2-LibriTTS Phone 3
Sen 5.05 122 325 67.85
Exp-3-LibriTTS BPEen 3
Sen 3.93 108 239 67.85
Exp-4-LibriTTS BPE S3 4.76 134 287 65.94
Exp-4-Large-scale BPE S3 3.17 96 184 69.49

Table 7: Comparison with other TTS models on the LibriTTS test-clean set in terms of content consistency and
speaker similarity (SS). Non-autoregressive ASR model, Paraformer-en, is employed for fast evaluation.

errors. As for the speaker similarity, we employ Model WER (%) #Ins.&Del. SS
the ERes2Net model (Chen et al., 2023) to extract Original 2.66 92 69.67
speaker embeddings of prompt and generated ut- ChatTTS 8.32 441 -
terances, and their raw cosine similarity is treated CosyVoice 2.89±0.18 88.60±3.88 74.30±0.15
+ 5× re-ranking 1.51 47 74.30
as the speaker similarity. Experimental results are
shown in Table 7. Table 8: The comparison of original and CosyVoice
Compared with other TTS models, the proposed generated speeches on the LibriTTS test-clean set in
CosyVoice framework achieves comparable con- terms of word error rate (WER) and speaker similar-
tent consistency and higher speaker similarity even ity (SS). “±” joins the mean and standard deviation for
using the same text and speech tokenizers. Compar- each evaluation metric. Whisper-Large V3 is employed
ing Exp-1, Exp-2 and Exp-3, we can see that both as the ASR model.
the text speech tokenizers are critical for content
consistency and negligible for speaker similarity.
In Exp 4 experiments, we replace the single-lingual Similar to other autoregressive language models,
text and speech tokenizers with the multi-lingual we employ a random sampling decoding strategy
one. Only using the LibriTTS corpus to train the for our token LM and assessed the synthesis pro-
model degrades both the content consistency and cess using five different random seed values: 0,
speaker similarity. By involving the internal large- 7, 42, 123, and 1,337. The resultant evaluation
scale dataset, the performance is significantly im- metrics were averaged to determine the mean and
proved, achieving the human parity quality. standard deviation. Additionally, we conducted an
ASR re-ranking to demonstrate potential perfor-
5.3 Evaluation on Generation Quality of mance improvements in offline mode.
CosyVoice Tables 8 and 9 present the results for English
We evaluate the quality of CosyVoice’s speech and Chinese, respectively. On the English dataset,
synthesis by examining content consistency and CosyVoice attained human-level performance with
speaker similarity. The “test-clean” subset of similar content recognition and higher speaker sim-
LibriTTS (Zen et al., 2019) and the test set of ilarity. ASR re-ranking notably enhanced con-
AISHELL-3 (Shi et al., 2021) are employed to con- tent consistency, yielding a reduced word error
struct an evaluation set for English and Chinese, rate (WER) of 1.51%. CosyVoice outperformed
respectively. For each text in these sets, we ran- ChatTTS in WER and the number of insertion and
domly select a prompt speech. Content consistency deletion errors, indicating superior content consis-
was evaluated using Whisper-Large V3 (Radford tency. We did not assess speaker similarity for
et al., 2023) for English and Paraformer (Gao et al., ChatTTS as it doesn’t release voice cloning capa-
2022) for Chinese recognition. Speaker similarity bilities.
was quantified by calculating the cosine similarity As for the results in Chinese, the generated ut-
between speaker embeddings of the generated and terances of CosyVoice achieve a comparable CER
prompt speeches, extracted using ERes2Net (Chen as well as the errors of insertion and deletion com-
et al., 2023). pared with the original utterances. It seems that
Model CER (%) #Ins.&Del. SS 5.5 CosyVoice as a Data Generator
Original 2.52 25 74.15
A straightforward application of CosyVoice is as a
ChatTTS 3.87 111 -
CosyVoice 3.82±0.24 24.4±2.24 81.58±0.16 data generator to augment the training data of other
+ 5× re-ranking 1.84 11 81.58 tasks, such as ASR, speech-to-speech translation
(S2ST). Taking the ASR task an example, we con-
Table 9: The comparison of original and CosyVoice duct an experiment on the Librispeech corpus to
generated speeches on the AISHELL-3 test set in terms evaluate CosyVoice’s capability in generating high-
of character error rate (CER) and speaker similarity
quality data. The experimental results are shown in
(SS). Paraformer-zh is employed as the ASR model.
Table 11, where “Librispeech” denotes the original
960-hour data. “Syn on LS text” and “Syn on LS
text” denote the generated data with the text from
ChatTTS has a better generation ability on Chinese
Librispeech and MLS training sets, respectively.
than English in terms of CER. Although ChatTTS
From the table, we can see that only training on
and CosyVoice achieve a similar CER, ChatTTS
the synthesized data, the ASR model can achieve
produces more insertion and deletion errors, This
a comparable result than the original Librispeech
is due to the problem of speaker leaking, where
training set. Upon integration of them, a notable en-
modal particles of another speaker is generated un-
hancement in recognition accuracy is observed. An
expectedly. On the contrary, CosyVoice doesn’t
interesting finding is that involving the synthesized
suffer from this problem with much fewer inser-
data on the MLS text significantly improves the
tion and deletion errors. With ASR re-ranking,
recognition performance. This may indicates that
CosyVoice reached a remarkably low CER of
the text diversity is more critical for ASR task than
1.84%. As seen with English, CosyVoice also ex-
the duration of speech itself. This improvement
hibited greater speaker similarity than the original
can be attributed to the varied linguistic content in-
utterances, showcasing its effective voice-cloning
troduced by CosyVoice synthesized samples. The
proficiency.
findings from our evaluation underscore the high
quality of the samples generated by CosyVoice.
5.4 Emotion Controllability of CosyVoice

To verify the emotion controllability, we use 6 Conclusion


the public speech emotion recognition model
In this paper, we introduce CosyVoice, a scalable
emo2vec3 (Ma et al., 2023). We generated and
multi-lingual speech generation model, which sup-
evaluated 100 English utterances for each of the
ports zero-shot in-context learning, cross-lingual
six emotions: happy, angry, sad, surprised, fearful,
voice cloning, instructed generation and fine-
and disgusted. The content of the synthesized text
grained controlling of emotion, paralinguistic fea-
is designed to match the target emotion. We then
tures. Experimental results show that the system
measure the accuracy of the predicted emotions
architecture of CosyVoice is important for speaker
from the synthesized speech for each emotion.
similarity, while the text and speech tokenizers af-
Table 10 shows the comparison of emo- fect the content consistency much. Besides, we find
tion control accuracy between CosyVoice-base that scaling up the model size and data volume can
and CosyVoice-instruct. For CosyVoice-instruct, improve the performance significantly. As a result,
the input consists of content text accompa- CosyVoice achieves the human parity generation
nied by a speaking style instruction (e.g., quality.
“Happy.<endofprompt>Content Text”). In con-
trast, CosyVoice-base only receives the content
text as input. The results indicate that CosyVoice- References
instruct with emotional instructions demonstrates a
significant improvement over both CosyVoice-base James Betker. 2023. Better speech synthesis through
scaling. CoRR, abs/2305.07243.
and CosyVoice-instruct without emotional instruc-
tions. Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng,
Qian Chen, and Jiajun Qi. 2023. An enhanced
3 res2net with local and global feature fusion for
https://modelscope.cn/models/iic/emotion2vec_
base_finetuned speaker verification. In Interspeech. ISCA.
Model Happy Sad Angry Surprised Fearful Disgusted
CosyVoice-base 1.00±0.00 0.45±0.05 0.59±0.03 0.26±0.02 0.88±0.01 0.46±0.06
CosyVoice-instruct 1.00±0.00 0.98±0.02 0.83±0.04 0.64±0.03 0.87±0.03 0.93±0.02
w/o instruction 0.98±0.01 0.77±0.04 0.49±0.12 0.28±0.06 0.83±0.04 0.45±0.16

Table 10: Comparison of emotion control accuracy between CosyVoice-base-300M and CosyVoice-instruct-300M.
“±” joins the mean and standard deviation for each evaluation metric.

Training Data dev_clean dev_other test_clean test_other


Librispeech 2.77 5.84 2.79 5.97
Syn on LS text 2.79 6.37 3.00 6.59
Librispeech + Syn on LS text 2.44 5.52 2.56 5.68
Librispeech + Syn on LS text ×2 2.51 5.23 2.68 5.26
Librispeech + Syn on LS, MLS text 1.93 4.43 2.04 4.53

Table 11: Evaluation on CosyVoice generation quality by treating it as a data generator. Word error rates (%) on
the human-uttered test sets are employed as the evaluation metrics.

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Self-supervised speech representation learning by
and Yossi Adi. 2022. High fidelity neural audio com- masked prediction of hidden units. IEEE ACM
pression. CoRR, abs/2210.13438. Trans. Audio Speech Lang. Process., 29:3451–3460.
Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Shengpeng Ji, Jialong Zuo, Minghui Fang, Ziyue Jiang,
Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, Feiyang Chen, Xinyu Duan, Baoxing Huai, and
and Kai Yu. 2024. Unicats: A unified context- Zhou Zhao. 2023. Textrolspeech: A text style
aware text-to-speech framework with contextual vq- control speech corpus with codec language text-to-
diffusion and vocoding. In AAAI, pages 17924– speech models. CoRR, abs/2308.14430.
17932. AAAI Press.
Eugene Kharitonov, Damien Vincent, Zalán Borsos,
Zhifu Gao, Shiliang Zhang, Ian McLoughlin, and Zhi- Raphaël Marinier, Sertan Girgin, Olivier Pietquin,
jie Yan. 2022. Paraformer: Fast and accurate par- Matt Sharifi, Marco Tagliasacchi, and Neil Zeghi-
allel transformer for non-autoregressive end-to-end dour. 2023. Speak, read and prompt: High-fidelity
speech recognition. In Interspeech, pages 2063– text-to-speech with minimal supervision. Trans. As-
2067. ISCA. soc. Comput. Linguistics, 11:1703–1718.
Wenhao Guan, Qi Su, Haodong Zhou, Shiyu Miao,
Xingjia Xie, Lin Li, and Qingyang Hong. 2023. Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020.
Reflow-tts: A rectified flow model for high-fidelity Hifi-gan: Generative adversarial networks for effi-
text-to-speech. CoRR, abs/2309.17056. cient and high fidelity speech synthesis. In Advances
in Neural Information Processing Systems 33: An-
Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, nual Conference on Neural Information Processing
and Kai Yu. 2023. Voiceflow: Efficient text- Systems 2020, NeurIPS 2020, December 6-12, 2020,
to-speech with rectified flow matching. CoRR, virtual.
abs/2309.05027.
Mateusz Lajszczak, Guillermo Cámbara, Yang Li,
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. De- Fatih Beyhan, Arent van Korlaar, Fan Yang, Ar-
noising diffusion probabilistic models. In Advances naud Joly, Álvaro Martín-Cortinas, Ammar Ab-
in Neural Information Processing Systems 33: An- bas, Adam Michalski, Alexis Moinet, Sri Karla-
nual Conference on Neural Information Processing pati, Ewa Muszynska, Haohan Guo, Bartosz Pu-
Systems 2020, NeurIPS 2020, December 6-12, 2020, trycz, Soledad López Gambino, Kayeon Yoo, Elena
virtual. Sokolova, and Thomas Drugman. 2024. BASE
TTS: lessons from building a billion-parameter text-
Jonathan Ho and Tim Salimans. 2022a. Classifier-free
to-speech model on 100k hours of data. CoRR,
diffusion guidance. CoRR, abs/2207.12598.
abs/2402.08093.
Jonathan Ho and Tim Salimans. 2022b. Classifier-
free diffusion guidance. arXiv preprint Matthew Le, Apoorv Vyas, Bowen Shi, Brian Kar-
arXiv:2207.12598. rer, Leda Sari, Rashel Moritz, Mary Williamson,
Vimal Manohar, Yossi Adi, Jay Mahadeokar, et al.
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hu- 2024. Voicebox: Text-guided multilingual universal
bert Tsai, Kushal Lakhotia, Ruslan Salakhutdi- speech generation at scale. Advances in neural in-
nov, and Abdelrahman Mohamed. 2021. Hubert: formation processing systems, 36.
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Guy Wolf, and Yoshua Bengio. 2023. Improving
Maximilian Nickel, and Matthew Le. 2023. Flow and generalizing flow-based generative models with
matching for generative modeling. In ICLR. Open- minibatch optimal transport. In ICML Workshop on
Review.net. New Frontiers in Learning, Control, and Dynamical
Systems.
Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jin-
chao Li, Zhifu Gao, Shiliang Zhang, and Xie Team TongyiSpeech. 2024. Funaudiollm: Voice under-
Chen. 2023. emotion2vec: Self-supervised pre- standing and generation foundation models for natu-
training for speech emotion representation. CoRR, ral interaction
abs/2312.15185. between humans and llms.

Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang,
and Gustav Eje Henter. 2023. Matcha-tts: A fast Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu,
TTS architecture with conditional flow matching. Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and
CoRR, abs/2309.03199. Furu Wei. 2023. Neural codec language models
are zero-shot text to speech synthesizers. CoRR,
Alexander Quinn Nichol and Prafulla Dhariwal. 2021. abs/2301.02111.
Improved denoising diffusion probabilistic models.
In International Conference on Machine Learning, Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie
pages 8162–8171. PMLR. Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi,
Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao,
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Shinji Watanabe, and Helen Meng. 2023. Uniaudio:
Sanjeev Khudanpur. 2015. Librispeech: an asr cor- An audio foundation model toward universal audio
pus based on public domain audio books. In 2015 generation. CoRR, abs/2310.00704.
IEEE international conference on acoustics, speech
and signal processing (ICASSP), pages 5206–5210. Lingxuan Ye, Changfeng Gao, Gaofeng Cheng, Liup-
IEEE. ing Luo, and Qingwei Zhao. 2024. ASQ: an ultra-
low bit rate asr-oriented speech quantization method.
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- IEEE Signal Process. Lett., 31:221–225.
man, Christine McLeavey, and Ilya Sutskever. 2023.
Robust speech recognition via large-scale weak su- Heiga Zen, Viet Dang, Rob Clark, and et al. 2019. Lib-
pervision. In International Conference on Ma- ritts: A corpus derived from librispeech for text-to-
chine Learning, ICML 2023, 23-29 July 2023, Hon- speech. arXiv:1904.02882.
olulu, Hawaii, USA, volume 202 of Proceedings of
Machine Learning Research, pages 28492–28518.
PMLR.

Paul K. Rubenstein, Chulayuth Asawaroengchai,


Duc Dung Nguyen, Ankur Bapna, Zalán Borsos,
Félix de Chaumont Quitry, Peter Chen, Dalia El
Badawy, Wei Han, Eugene Kharitonov, Hannah
Muckenhirn, Dirk Padfield, James Qin, Danny
Rozenberg, Tara N. Sainath, Johan Schalkwyk,
Matthew Sharifi, Michelle Tadmor Ramanovich,
Marco Tagliasacchi, Alexandru Tudor, Mihajlo Ve-
limirovic, Damien Vincent, Jiahui Yu, Yongqiang
Wang, Vicky Zayats, Neil Zeghidour, Yu Zhang,
Zhishuai Zhang, Lukas Zilka, and Christian Havnø
Frank. 2023. Audiopalm: A large language model
that can speak and listen. CoRR, abs/2306.12925.

Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li.
2021. AISHELL-3: A multi-speaker mandarin TTS
corpus. In Interspeech, pages 2756–2760. ISCA.

David Snyder, Daniel Garcia-Romero, Gregory Sell,


Daniel Povey, and Sanjeev Khudanpur. 2018. X-
vectors: Robust DNN embeddings for speaker recog-
nition. In 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing, ICASSP
2018, Calgary, AB, Canada, April 15-20, 2018,
pages 5329–5333. IEEE.

Alexander Tong, Nikolay Malkin, Guillaume Huguet,


Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras,

You might also like