Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model

Liu, Yufan; Qiao, Minglang; Xu, Mai; Li, Bing; Hu, Weiming; Borji, Ali

doi:10.1007/978-3-030-58565-5_25

Computer Science > Computer Vision and Pattern Recognition

arXiv:2103.15438 (cs)

[Submitted on 29 Mar 2021]

Title:Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model

Authors:Yufan Liu, Minglang Qiao, Mai Xu, Bing Li, Weiming Hu, Ali Borji

View PDF

Abstract:Recently, video streams have occupied a large proportion of Internet traffic, most of which contain human faces. Hence, it is necessary to predict saliency on multiple-face videos, which can provide attention cues for many content based applications. However, most of multiple-face saliency prediction works only consider visual information and ignore audio, which is not consistent with the naturalistic scenarios. Several behavioral studies have established that sound influences human attention, especially during the speech turn-taking in multiple-face videos. In this paper, we thoroughly investigate such influences by establishing a large-scale eye-tracking database of Multiple-face Video in Visual-Audio condition (MVVA). Inspired by the findings of our investigation, we propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face. The visual branch takes the RGB frames as the input and encodes them into visual feature maps. The audio and face branches encode the audio signal and multiple cropped faces, respectively. A fusion module is introduced to integrate the information from three modalities, and to generate the final saliency map. Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works. It performs closer to human multi-modal attention.

Comments:	Published as an ECCV2020 paper
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2103.15438 [cs.CV]
	(or arXiv:2103.15438v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2103.15438
Related DOI:	https://doi.org/10.1007/978-3-030-58565-5_25

Submission history

From: Yufan Liu [view email]
[v1] Mon, 29 Mar 2021 09:09:39 UTC (9,564 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators