Combination of Deep Speaker Embeddings for Diarisation

Sun, Guangzhi; Zhang, Chao; Woodland, Phil

doi:10.1016/j.neunet.2021.04.020

Computer Science > Sound

arXiv:2010.12025 (cs)

[Submitted on 22 Oct 2020 (v1), last revised 7 May 2021 (this version, v3)]

Title:Combination of Deep Speaker Embeddings for Diarisation

Authors:Guangzhi Sun, Chao Zhang, Phil Woodland

View PDF

Abstract:Significant progress has recently been made in speaker diarisation after the introduction of d-vectors as speaker embeddings extracted from neural network (NN) speaker classifiers for clustering speech segments. To extract better-performing and more robust speaker embeddings, this paper proposes a c-vector method by combining multiple sets of complementary d-vectors derived from systems with different NN components. Three structures are used to implement the c-vectors, namely 2D self-attentive, gated additive, and bilinear pooling structures, relying on attention mechanisms, a gating mechanism, and a low-rank bilinear pooling mechanism respectively. Furthermore, a neural-based single-pass speaker diarisation pipeline is also proposed in this paper, which uses NNs to achieve voice activity detection, speaker change point detection, and speaker embedding extraction. Experiments and detailed analyses are conducted on the challenging AMI and NIST RT05 datasets which consist of real meetings with 4--10 speakers and a wide range of acoustic conditions. For systems trained on the AMI training set, relative speaker error rate (SER) reductions of 13% and 29% are obtained by using c-vectors instead of d-vectors on the AMI dev and eval sets respectively, and a relative reduction of 15% in SER is observed on RT05, which shows the robustness of the proposed methods. By incorporating VoxCeleb data into the training set, the best c-vector system achieved 7%, 17% and16% relative SER reduction compared to the d-vector on the AMI dev, eval, and RT05 sets respectively

Comments:	Manualscript accepted by Neural Networks
Subjects:	Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2010.12025 [cs.SD]
	(or arXiv:2010.12025v3 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2010.12025
Related DOI:	https://doi.org/10.1016/j.neunet.2021.04.020

Submission history

From: Guangzhi Sun [view email]
[v1] Thu, 22 Oct 2020 20:16:36 UTC (2,633 KB)
[v2] Thu, 6 May 2021 08:49:19 UTC (2,728 KB)
[v3] Fri, 7 May 2021 08:59:17 UTC (2,728 KB)

Computer Science > Sound

Title:Combination of Deep Speaker Embeddings for Diarisation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Combination of Deep Speaker Embeddings for Diarisation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators