CosyVoice v1
CosyVoice v1
CosyVoice v1
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang
Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhijie Yan
Speech Lab, Alibaba Group, China
{neo.dzh,sly.zsl,h.lu}@alibaba-inc.com
• We are the first to integrate supervised speech 2.1 Supervised Semantic Tokens for Speech
tokens into TTS models, enhancing content con- In CosyVoice, a supervised automatic speech recog-
sistency and speaker similarity in zero-shot voice nition (ASR) model is employed to derive the su-
cloning. pervised semantic speech (S 3 ) tokenizer for speech.
• We propose CosyVoice, a scalable zero-shot The model is a finetuned version of our proprietary
TTS synthesis system that combines an LLM for SenseVoice ASR model. It is trained on multilin-
text-to-token generation with a conditional flow gual audio data and possesses rich audio content
matching model for token-to-speech synthesis, understanding capabilities. Different from the orig-
forsaking the need for additional phonemizers inal ASR model, we split the encoder into two parts
and forced aligners. and insert a vector quantization layer between them.
• To further refine the quality of generated speech, Given a Mel spectrogram X as input, it undergoes
we incorporate the x-vector (Snyder et al., 2018) the positional encoding and Encoder1 to obtain a
into the LLM to separate the modeling of speech context-aware representations H:
into semantic, speaker, and prosody compo-
nents. The LLM models the semantic content H = Encoder1 (PosEnc(X)) (1)
and prosody, while the conditional flow matching
model captures timbre and environmental infor- Then, a vector quantizer (VQ) is involved to obtain
mation. We optimize the flow matching process discrete tokens. For the hidden representation hl
with techniques such as classifier-free guidance at the frame l, the index of nearest embedding in
(Ho and Salimans, 2022a), a cosine scheduler, the codebook C is treated as the speech token µl at
and masked conditions. this timestep:
Our experimental results demonstrate the supe- µl = VQ(hl , C) = arg min ||hl − cn ||2 (2)
cn ∈C
riority of supervised semantic tokens over unsu-
pervised counterparts. Additionally, the scalability where || · ||2 denotes the L2 norm. At the train-
of CosyVoice is evidenced by improved synthe- ing stage, codebook embeddings are updated via
sis performance when utilizing large-scale data. exponentially moving average (EMA):
This work, therefore, represents a significant step
forward in the development of natural-sounding, cµl := αcµl + (1 − α)hl (3)
versatile TTS systems.
where α is a pre-defined decay coefficient. The cor-
2 CosyVoice: A Scalable TTS model responding codebook embeddings of speech tokens
using Supervised Semantic Tokens are used as the quantized hidden representations
H̄ = {cµ1 , cµ2 , . . . , cµL } and passed through the
As shown in Figure 1(b), our CosyVoice consists of remaining encoder layers Encoder2 :
four components, namely text encoder, speech tok-
enizer, large language model and conditional flow H̃ = Encoder2 PosEnc(H̄) (4)
Speech
Positional E
Transformer Block
Encoding
Speech Tokens ResNet1D
Text-to-token LM
Speech Vector Quantizer Transformer Block
Tokenizer S ResNet1D
Figure 1: An overview of the proposed CosyVoice model. (a) demonstrates the S 3 tokenizer, where dashed
modules are only used at the training stage. (b) is a schematic diagram of CosyVoice, consisting of a text-to-token
LLM and a token-to-speech flow matching model. S , E and T denote the “start of sequence”, “end of sequence”
and “turn of speech” tokens. Dashed lines indicate the autoregressive decoding at the inference stage. (c) provides
an enlarged view of our flow matching model conditioning on a speaker embedding v, semantic tokens µ, masked
speech features X̃ and intermediate state Xt at timestep t on the probabilistic density path.
Note that, before the remaining encoder layers, semantic spaces and benefit the LLM modeling. A
we add an extra positional encoding to enhance start identifier T is inserted between text encod-
the temporal information. After Encoder2 , a ings and speech tokens {µl }l∈[1:L] that is extracted
transformer-based ASR decoder is followed, pre- with the supervised semantic tokenizer as described
dicting the posterior probability of text labels: in 2.1. At the training stage, we employ the teacher-
forcing scheme, in which the left-shifted sequence
P (Y |X) = ASRDecoder H̃, Y Z−1 (5) is employed as the mode inputs and the original
sequence serves as the expected outputs. Note that
where Y Z−1 represents the left-shifted text labels only the cross entropy losses of speech tokens and
in the teacher-forcing training scheme. E are considered during the training:
2.2 Large Language Model for TTS 1 X
L+1
In this section, we formulate the TTS task as an LLM = − log q(µl ) (8)
L+1
l=1
auto-regressive speech token generation problem
with a large language model (LLM). For LLM, the where µL+1 is the “end of sequence” token E .
sequence construction is the most important matter, q(µl ) denotes the posterior probability of µl , which
which is constructed as follows: is predicted by the softmax layer following LLM.
S , v, {ȳu }u∈[1:U ] , T , {µl }l∈[1:L] , E (6) 2.3 Optimal-transport Conditional Flow
Matching
S and E denote the start and end of sequence,
respectively. v is a speaker embedding vector ex- In CosyVoice, an optimal-transport conditional
tracted from the speech X with a pre-trained voice- flow matching model (OT-CFM) is employed to
print model2 . The text encodings Ȳ = {ȳu }u∈[1:U ] learn the distribution of Mel spectrogram and gen-
is obtained by passing the text through a Byte Pair erate samples from it with generated speech tokens
Encoded (BPE) tokenizer and text encoder: as conditions. OT-CFM can achieve better perfor-
mance compared to diffusion probabilistic mod-
Ȳ = TextEncoder(BPE(Y )) (7) els (DPMs) with simpler gradients, easier training
and faster generation (Lipman et al., 2023; Tong
Since text and speech tokens lie at different seman- et al., 2023; Mehta et al., 2023). In continuous-time
tic levels, the text encoder is used to align their normalizing flows (CNFs), a probability density
2
Available at https://github.com/alibaba-damo-academy/ path is constructed from a prior distribution p0 (X)
3D-Speaker/tree/main/egs/3dspeaker/sv-cam++ to the data distribution of Mel spectrogram q(X).
The probability density path is defined by a time- S E
Spk Prompt Input Prompt Generated
dependent vector field νt (X) : [0, 1] × RL∗D → Emb Text Text Speech Token Speech Token
RL∗D , which generates the flow φt through the (a) Zero-shot In-context Learning
Table 4: Details of model architecture settings in the Table 6: The evaluation on S 3 tokens’ capability to
tiny and normal CosyVoice models. preserve semantic information. We employ word and
character error rates for zh-CN and en languages on the
Common Voice benchmarks.
sentence-piece model is trained on the text of train-
ing, which has a vocabulary size of 4,000. We
train the quantizer-augmented ASR model on the riTTS test sets. From the table, we can see that
Librispeech (Panayotov et al., 2015) corpus for 50 inserting a vector quantizer into the ASR encoder
epochs from scratch. only affects the recognition performance slightly.
For the large-scale multi-lingual dataset, we em- As a result, the VQ-inserted Conformer ASR model
ploy the SenseVoice-Large rich recognition model achieves comparable WERs of 3.18% and 7.56%
(TongyiSpeech, 2024) as the backbone. Similar to on “test-clean” and “test-other” sets, respectively.
small-scale dataset, we still insert the vector quan- This indicates that tokenizers trained in a super-
tizer after the first six encoder layers with a single vised manner can maintain sufficient semantic in-
codebook of 4,096 codes. More hyper-parameter formation and the alignment to text.
selections, such as quantizer-inserted layer and the To assess the multi-lingual S 3 tokenizer’s abil-
number of codes, are left for future work. Differ- ity to preserve semantic information, we com-
ent from single-lingual experiments, we use the pared the recognition performance of the quantizer-
pre-trained checkpoint to initialize the SenseVoice- augmented SenseVoice-L against its original ver-
Large model rather than train it from scratch. After sion and the Whisper-Large V3 model. The models
inserting the quantizer, we further fine-tune the underwent evaluation using the Common Voice zh-
whole parameters for 210,000 training steps on CN and en benchmarks, with the findings detailed
eight A800 GPUs. in Table 6. From the table, we can see that our S 3
tokens demonstrate robust recognition performance
4.2 CosyVoice Model Settings in both the Chinese and English test sets. Notably,
on the common_voice_zh-CN set, S 3 tokens sur-
We train the tiny and normal size models in single-
pass the performance of the Whisper-Large V3
lingual and multi-lingual experiments. Details of
model (TongyiSpeech, 2024), achieving a 4.14%
model architecture settings are shown in Table 4.
relative reduction in error rate. This suggests a
The tiny model is trained on LibriTTS training set
substantial correlation between S 3 tokens and se-
for 50 epochs with four V100-32M GPUs, while
mantic content. It is worth noting that there is only
the multi-lingual model is trained on our inter-
a single codebook in the S 3 tokenizer with a dictio-
nal dataset for 800,000 steps with 64 V100-32M
nary size of 4,096 entries.
GPUs. Tiny and normal models are trained with
the learning rate of 10−3 and 10−4 , respectively.
5.2 Comparison with Baselines
The warmup step is set to 10,000.
We compare the proposed CosyVoice models with
5 Experimental Results other TTS systems on content consistency and
speaker similarity. For content consistency, an ASR
5.1 Evaluation on S 3 Tokenizer model is employed to recognize the generated ut-
In table 5, we demonstrate how the vector quanti- terances. We report the word error rate (WER), and
zation affects the recognition performance on Lib- the number of insertion, deletion and substation
Model Text token Speech Token WER (%) #INS+DEL #SUB SS
Original - - 3.01 66 200 69.67
VALL-E (Wang et al., 2023) Phone Encodec 18.70 342 1312 53.19
UniAudio (Yang et al., 2023) Phone Encodec 8.74 254 519 47.56
SpearTTS (Kharitonov et al., 2023) Phone Hubert 6.14 133 410 51.71
Exp-1-LibriTTS Phone Hubert 7.41 325 409 67.85
Exp-2-LibriTTS Phone 3
Sen 5.05 122 325 67.85
Exp-3-LibriTTS BPEen 3
Sen 3.93 108 239 67.85
Exp-4-LibriTTS BPE S3 4.76 134 287 65.94
Exp-4-Large-scale BPE S3 3.17 96 184 69.49
Table 7: Comparison with other TTS models on the LibriTTS test-clean set in terms of content consistency and
speaker similarity (SS). Non-autoregressive ASR model, Paraformer-en, is employed for fast evaluation.
errors. As for the speaker similarity, we employ Model WER (%) #Ins.&Del. SS
the ERes2Net model (Chen et al., 2023) to extract Original 2.66 92 69.67
speaker embeddings of prompt and generated ut- ChatTTS 8.32 441 -
terances, and their raw cosine similarity is treated CosyVoice 2.89±0.18 88.60±3.88 74.30±0.15
+ 5× re-ranking 1.51 47 74.30
as the speaker similarity. Experimental results are
shown in Table 7. Table 8: The comparison of original and CosyVoice
Compared with other TTS models, the proposed generated speeches on the LibriTTS test-clean set in
CosyVoice framework achieves comparable con- terms of word error rate (WER) and speaker similar-
tent consistency and higher speaker similarity even ity (SS). “±” joins the mean and standard deviation for
using the same text and speech tokenizers. Compar- each evaluation metric. Whisper-Large V3 is employed
ing Exp-1, Exp-2 and Exp-3, we can see that both as the ASR model.
the text speech tokenizers are critical for content
consistency and negligible for speaker similarity.
In Exp 4 experiments, we replace the single-lingual Similar to other autoregressive language models,
text and speech tokenizers with the multi-lingual we employ a random sampling decoding strategy
one. Only using the LibriTTS corpus to train the for our token LM and assessed the synthesis pro-
model degrades both the content consistency and cess using five different random seed values: 0,
speaker similarity. By involving the internal large- 7, 42, 123, and 1,337. The resultant evaluation
scale dataset, the performance is significantly im- metrics were averaged to determine the mean and
proved, achieving the human parity quality. standard deviation. Additionally, we conducted an
ASR re-ranking to demonstrate potential perfor-
5.3 Evaluation on Generation Quality of mance improvements in offline mode.
CosyVoice Tables 8 and 9 present the results for English
We evaluate the quality of CosyVoice’s speech and Chinese, respectively. On the English dataset,
synthesis by examining content consistency and CosyVoice attained human-level performance with
speaker similarity. The “test-clean” subset of similar content recognition and higher speaker sim-
LibriTTS (Zen et al., 2019) and the test set of ilarity. ASR re-ranking notably enhanced con-
AISHELL-3 (Shi et al., 2021) are employed to con- tent consistency, yielding a reduced word error
struct an evaluation set for English and Chinese, rate (WER) of 1.51%. CosyVoice outperformed
respectively. For each text in these sets, we ran- ChatTTS in WER and the number of insertion and
domly select a prompt speech. Content consistency deletion errors, indicating superior content consis-
was evaluated using Whisper-Large V3 (Radford tency. We did not assess speaker similarity for
et al., 2023) for English and Paraformer (Gao et al., ChatTTS as it doesn’t release voice cloning capa-
2022) for Chinese recognition. Speaker similarity bilities.
was quantified by calculating the cosine similarity As for the results in Chinese, the generated ut-
between speaker embeddings of the generated and terances of CosyVoice achieve a comparable CER
prompt speeches, extracted using ERes2Net (Chen as well as the errors of insertion and deletion com-
et al., 2023). pared with the original utterances. It seems that
Model CER (%) #Ins.&Del. SS 5.5 CosyVoice as a Data Generator
Original 2.52 25 74.15
A straightforward application of CosyVoice is as a
ChatTTS 3.87 111 -
CosyVoice 3.82±0.24 24.4±2.24 81.58±0.16 data generator to augment the training data of other
+ 5× re-ranking 1.84 11 81.58 tasks, such as ASR, speech-to-speech translation
(S2ST). Taking the ASR task an example, we con-
Table 9: The comparison of original and CosyVoice duct an experiment on the Librispeech corpus to
generated speeches on the AISHELL-3 test set in terms evaluate CosyVoice’s capability in generating high-
of character error rate (CER) and speaker similarity
quality data. The experimental results are shown in
(SS). Paraformer-zh is employed as the ASR model.
Table 11, where “Librispeech” denotes the original
960-hour data. “Syn on LS text” and “Syn on LS
text” denote the generated data with the text from
ChatTTS has a better generation ability on Chinese
Librispeech and MLS training sets, respectively.
than English in terms of CER. Although ChatTTS
From the table, we can see that only training on
and CosyVoice achieve a similar CER, ChatTTS
the synthesized data, the ASR model can achieve
produces more insertion and deletion errors, This
a comparable result than the original Librispeech
is due to the problem of speaker leaking, where
training set. Upon integration of them, a notable en-
modal particles of another speaker is generated un-
hancement in recognition accuracy is observed. An
expectedly. On the contrary, CosyVoice doesn’t
interesting finding is that involving the synthesized
suffer from this problem with much fewer inser-
data on the MLS text significantly improves the
tion and deletion errors. With ASR re-ranking,
recognition performance. This may indicates that
CosyVoice reached a remarkably low CER of
the text diversity is more critical for ASR task than
1.84%. As seen with English, CosyVoice also ex-
the duration of speech itself. This improvement
hibited greater speaker similarity than the original
can be attributed to the varied linguistic content in-
utterances, showcasing its effective voice-cloning
troduced by CosyVoice synthesized samples. The
proficiency.
findings from our evaluation underscore the high
quality of the samples generated by CosyVoice.
5.4 Emotion Controllability of CosyVoice
Table 10: Comparison of emotion control accuracy between CosyVoice-base-300M and CosyVoice-instruct-300M.
“±” joins the mean and standard deviation for each evaluation metric.
Table 11: Evaluation on CosyVoice generation quality by treating it as a data generator. Word error rates (%) on
the human-uttered test sets are employed as the evaluation metrics.
Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Self-supervised speech representation learning by
and Yossi Adi. 2022. High fidelity neural audio com- masked prediction of hidden units. IEEE ACM
pression. CoRR, abs/2210.13438. Trans. Audio Speech Lang. Process., 29:3451–3460.
Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Shengpeng Ji, Jialong Zuo, Minghui Fang, Ziyue Jiang,
Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, Feiyang Chen, Xinyu Duan, Baoxing Huai, and
and Kai Yu. 2024. Unicats: A unified context- Zhou Zhao. 2023. Textrolspeech: A text style
aware text-to-speech framework with contextual vq- control speech corpus with codec language text-to-
diffusion and vocoding. In AAAI, pages 17924– speech models. CoRR, abs/2308.14430.
17932. AAAI Press.
Eugene Kharitonov, Damien Vincent, Zalán Borsos,
Zhifu Gao, Shiliang Zhang, Ian McLoughlin, and Zhi- Raphaël Marinier, Sertan Girgin, Olivier Pietquin,
jie Yan. 2022. Paraformer: Fast and accurate par- Matt Sharifi, Marco Tagliasacchi, and Neil Zeghi-
allel transformer for non-autoregressive end-to-end dour. 2023. Speak, read and prompt: High-fidelity
speech recognition. In Interspeech, pages 2063– text-to-speech with minimal supervision. Trans. As-
2067. ISCA. soc. Comput. Linguistics, 11:1703–1718.
Wenhao Guan, Qi Su, Haodong Zhou, Shiyu Miao,
Xingjia Xie, Lin Li, and Qingyang Hong. 2023. Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020.
Reflow-tts: A rectified flow model for high-fidelity Hifi-gan: Generative adversarial networks for effi-
text-to-speech. CoRR, abs/2309.17056. cient and high fidelity speech synthesis. In Advances
in Neural Information Processing Systems 33: An-
Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, nual Conference on Neural Information Processing
and Kai Yu. 2023. Voiceflow: Efficient text- Systems 2020, NeurIPS 2020, December 6-12, 2020,
to-speech with rectified flow matching. CoRR, virtual.
abs/2309.05027.
Mateusz Lajszczak, Guillermo Cámbara, Yang Li,
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. De- Fatih Beyhan, Arent van Korlaar, Fan Yang, Ar-
noising diffusion probabilistic models. In Advances naud Joly, Álvaro Martín-Cortinas, Ammar Ab-
in Neural Information Processing Systems 33: An- bas, Adam Michalski, Alexis Moinet, Sri Karla-
nual Conference on Neural Information Processing pati, Ewa Muszynska, Haohan Guo, Bartosz Pu-
Systems 2020, NeurIPS 2020, December 6-12, 2020, trycz, Soledad López Gambino, Kayeon Yoo, Elena
virtual. Sokolova, and Thomas Drugman. 2024. BASE
TTS: lessons from building a billion-parameter text-
Jonathan Ho and Tim Salimans. 2022a. Classifier-free
to-speech model on 100k hours of data. CoRR,
diffusion guidance. CoRR, abs/2207.12598.
abs/2402.08093.
Jonathan Ho and Tim Salimans. 2022b. Classifier-
free diffusion guidance. arXiv preprint Matthew Le, Apoorv Vyas, Bowen Shi, Brian Kar-
arXiv:2207.12598. rer, Leda Sari, Rashel Moritz, Mary Williamson,
Vimal Manohar, Yossi Adi, Jay Mahadeokar, et al.
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hu- 2024. Voicebox: Text-guided multilingual universal
bert Tsai, Kushal Lakhotia, Ruslan Salakhutdi- speech generation at scale. Advances in neural in-
nov, and Abdelrahman Mohamed. 2021. Hubert: formation processing systems, 36.
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Guy Wolf, and Yoshua Bengio. 2023. Improving
Maximilian Nickel, and Matthew Le. 2023. Flow and generalizing flow-based generative models with
matching for generative modeling. In ICLR. Open- minibatch optimal transport. In ICML Workshop on
Review.net. New Frontiers in Learning, Control, and Dynamical
Systems.
Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jin-
chao Li, Zhifu Gao, Shiliang Zhang, and Xie Team TongyiSpeech. 2024. Funaudiollm: Voice under-
Chen. 2023. emotion2vec: Self-supervised pre- standing and generation foundation models for natu-
training for speech emotion representation. CoRR, ral interaction
abs/2312.15185. between humans and llms.
Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang,
and Gustav Eje Henter. 2023. Matcha-tts: A fast Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu,
TTS architecture with conditional flow matching. Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and
CoRR, abs/2309.03199. Furu Wei. 2023. Neural codec language models
are zero-shot text to speech synthesizers. CoRR,
Alexander Quinn Nichol and Prafulla Dhariwal. 2021. abs/2301.02111.
Improved denoising diffusion probabilistic models.
In International Conference on Machine Learning, Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie
pages 8162–8171. PMLR. Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi,
Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao,
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Shinji Watanabe, and Helen Meng. 2023. Uniaudio:
Sanjeev Khudanpur. 2015. Librispeech: an asr cor- An audio foundation model toward universal audio
pus based on public domain audio books. In 2015 generation. CoRR, abs/2310.00704.
IEEE international conference on acoustics, speech
and signal processing (ICASSP), pages 5206–5210. Lingxuan Ye, Changfeng Gao, Gaofeng Cheng, Liup-
IEEE. ing Luo, and Qingwei Zhao. 2024. ASQ: an ultra-
low bit rate asr-oriented speech quantization method.
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- IEEE Signal Process. Lett., 31:221–225.
man, Christine McLeavey, and Ilya Sutskever. 2023.
Robust speech recognition via large-scale weak su- Heiga Zen, Viet Dang, Rob Clark, and et al. 2019. Lib-
pervision. In International Conference on Ma- ritts: A corpus derived from librispeech for text-to-
chine Learning, ICML 2023, 23-29 July 2023, Hon- speech. arXiv:1904.02882.
olulu, Hawaii, USA, volume 202 of Proceedings of
Machine Learning Research, pages 28492–28518.
PMLR.
Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li.
2021. AISHELL-3: A multi-speaker mandarin TTS
corpus. In Interspeech, pages 2756–2760. ISCA.