Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification

Li, Nian; Wei, Jianguo

Computer Science > Sound

arXiv:2405.12031 (cs)

[Submitted on 20 May 2024 (v1), last revised 30 May 2024 (this version, v2)]

Title:Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification

Authors:Nian Li, Jianguo Wei

View PDF HTML (experimental)

Abstract:Transformer-based architectures for speaker verification typically require more training data than ECAPA-TDNN. Therefore, recent work has generally been trained on VoxCeleb1&2. We propose a backbone network based on self-attention, which can achieve competitive results when trained on VoxCeleb2 alone. The network alternates between neighborhood attention and global attention to capture local and global features, then aggregates features of different hierarchical levels, and finally performs attentive statistics pooling. Additionally, we employ a progressive channel fusion strategy to expand the receptive field in the channel dimension as the network deepens. We trained the proposed PCF-NAT model on VoxCeleb2 and evaluated it on VoxCeleb1 and the validation sets of VoxSRC. The EER and minDCF of the shallow PCF-NAT are on average more than 20% lower than those of similarly sized ECAPA-TDNN. Deep PCF-NAT achieves an EER lower than 0.5% on VoxCeleb1-O. The code and models are publicly available at this https URL.

Comments:	8 pages, 2 figures, 3 tables; added github link
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2405.12031 [cs.SD]
	(or arXiv:2405.12031v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2405.12031

Submission history

From: Nian Li [view email]
[v1] Mon, 20 May 2024 13:55:19 UTC (706 KB)
[v2] Thu, 30 May 2024 02:37:51 UTC (706 KB)

Computer Science > Sound

Title:Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators