Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

Chen, Zhengyang; Chen, Sanyuan; Wu, Yu; Qian, Yao; Wang, Chengyi; Liu, Shujie; Qian, Yanmin; Zeng, Michael

Computer Science > Sound

arXiv:2110.05777 (cs)

[Submitted on 12 Oct 2021 (v1), last revised 24 Jan 2022 (this version, v2)]

Title:Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

Authors:Zhengyang Chen, Sanyuan Chen, Yu Wu, Yao Qian, Chengyi Wang, Shujie Liu, Yanmin Qian, Michael Zeng

View PDF

Abstract:The speech representations learned from large-scale unlabeled data have shown better generalizability than those from supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN [1], as a downstream model. The representations from all hidden layers of the pre-trained model are firstly averaged with learnable weights and then fed into the ECAPA-TDNN as input features. The experimental results on Voxceleb dataset show that the weighted average representation is significantly superior to FBank, a conventional handcrafted feature for ASV. Our best single system achieves 0.537%, 0.569%, and 1.180% equal error rate (EER) on the three official trials of VoxCeleb1, separately. Accordingly, the ensemble system with three pre-trained models can further improve the EER to 0.479%, 0.536% and 1.023%. Among the three evaluation trials, our best system outperforms the winner system [2] of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC2021) on the VoxCeleb1-E trial.

Comments:	Accepted by ICASSP 2022
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2110.05777 [cs.SD]
	(or arXiv:2110.05777v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2110.05777

Submission history

From: Zhengyang Chen [view email]
[v1] Tue, 12 Oct 2021 07:15:21 UTC (1,628 KB)
[v2] Mon, 24 Jan 2022 12:07:23 UTC (1,629 KB)

Computer Science > Sound

Title:Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators