Enhancing No-Reference Audio-Visual Quality Assessment via Joint Cross-Attention Fusion
As the consumption of multimedia content continues to rise, audio and video have become
central to everyday entertainment and social interactions. This growing reliance amplifies
the demand for effective and objective audio-visual quality assessment (AVQA) to
understand the interaction between audio and visual elements, ultimately enhancing user
satisfaction. However, existing state-of-the-art AVQA methods often rely on simplistic
machine learning models or fully connected networks for audio-visual signal fusion, which …
central to everyday entertainment and social interactions. This growing reliance amplifies
the demand for effective and objective audio-visual quality assessment (AVQA) to
understand the interaction between audio and visual elements, ultimately enhancing user
satisfaction. However, existing state-of-the-art AVQA methods often rely on simplistic
machine learning models or fully connected networks for audio-visual signal fusion, which …
As the consumption of multimedia content continues to rise, audio and video have become central to everyday entertainment and social interactions. This growing reliance amplifies the demand for effective and objective audio-visual quality assessment (AVQA) to understand the interaction between audio and visual elements, ultimately enhancing user satisfaction. However, existing state-of-the-art AVQA methods often rely on simplistic machine learning models or fully connected networks for audio-visual signal fusion, which limits their ability to exploit the complementary nature of these modalities. In response to this gap, we propose a novel no-reference AVQA method that utilizes joint cross-attention fusion of audio-visual perception. Our approach begins with a dual-stream feature extraction process that simultaneously captures long-range spatiotemporal visual features and audio features. The fusion model then dynamically adjusts the contributions of features from both modalities, effectively integrating them to provide a more comprehensive perception for quality score prediction. Experimental results on the LIVE-SJTU and UnB-AVC datasets demonstrate that our model outperforms state-of-the-art methods, achieving superior performance in audio-visual quality assessment
ieeexplore.ieee.org
Showing the best result for this search. See all results