UniCon: Unified Context Network for Robust Active Speaker Detection

Zhang, Yuanhang; Liang, Susan; Yang, Shuang; Liu, Xiao; Wu, Zhongqin; Shan, Shiguang; Chen, Xilin

doi:10.1145/3474085.3475275

Computer Science > Computer Vision and Pattern Recognition

arXiv:2108.02607 (cs)

[Submitted on 5 Aug 2021]

Title:UniCon: Unified Context Network for Robust Active Speaker Detection

Authors:Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu, Shiguang Shan, Xilin Chen

View PDF

Abstract:We introduce a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD). Traditional methods for ASD usually operate on each candidate's pre-cropped face track separately and do not sufficiently consider the relationships among the candidates. This potentially limits performance, especially in challenging scenarios with low-resolution faces, multiple candidates, etc. Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information: spatial context to indicate the position and scale of each candidate's face, relational context to capture the visual relationships among the candidates and contrast audio-visual affinities with each other, and temporal context to aggregate long-term information and smooth out local uncertainties. Based on such information, our model optimizes all candidates in a unified process for robust and reliable ASD. A thorough ablation study is performed on several challenging ASD benchmarks under different settings. In particular, our method outperforms the state-of-the-art by a large margin of about 15% mean Average Precision (mAP) absolute on two challenging subsets: one with three candidate speakers, and the other with faces smaller than 64 pixels. Together, our UniCon achieves 92.0% mAP on the AVA-ActiveSpeaker validation set, surpassing 90% for the first time on this challenging dataset at the time of submission. Project website: this https URL.

Comments:	10 pages, 6 figures; to appear at ACM Multimedia 2021
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
Cite as:	arXiv:2108.02607 [cs.CV]
	(or arXiv:2108.02607v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2108.02607
Related DOI:	https://doi.org/10.1145/3474085.3475275

Submission history

From: Yuanhang Zhang [view email]
[v1] Thu, 5 Aug 2021 13:25:44 UTC (2,335 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:UniCon: Unified Context Network for Robust Active Speaker Detection

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:UniCon: Unified Context Network for Robust Active Speaker Detection

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators