Abstractive summarization for video: A revisit in multistage fusion network with forget gate

N Liu, X Sun, H Yu, F Yao, G Xu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
N Liu, X Sun, H Yu, F Yao, G Xu, K Fu
IEEE Transactions on Multimedia, 2022ieeexplore.ieee.org
Multimodal abstractive summarization for videos is an emerging task that aims to generate a
summary from multi-source information (ie, video, audio transcript). The challenge is how to
merge multimodal long sequences to capture rich semantic information without allowing
possible noise from either lengthy modal sequence to degrade the other modality and thus
hurt the entire model. To address the issues, we propose am ultistage f usion network with f
orget g ate (MFFG), which selectively integrates multi-source information through the cross …
Multimodal abstractive summarization for videos is an emerging task that aims to generate a summary from multi-source information (i.e., video, audio transcript). The challenge is how to merge multimodal long sequences to capture rich semantic information without allowing possible noise from either lengthy modal sequence to degrade the other modality and thus hurt the entire model. To address the issues, we propose a m ultistage f usion network with f orget g ate (MFFG), which selectively integrates multi-source information through the cross-fusion in encoding and hierarchical fusion in decoding between modalities, and design a fusion forget gate module to suppress the potential multimodal noise flow of multi-source long sequence. Meanwhile, considering that the source text in this task is lengthy and has the same distribution as the output summary text, we inherit the partial structure of the MFFG model and again propose its variant, single-stage fusion network with forget gate (SFFG), which simplifies the fusion schema, and leverages the long source text to enhance the representation of the target summary. Experimental results on How2 dataset and How2-300 dataset demonstrate the superiority of the two multimodal fusion methods. Further, we provide a version of ASR transcription data of How2 dataset to evaluate model performance under noisy scenarios, and experimental results show obvious advantages of our proposed models over prior systems.
ieeexplore.ieee.org
Showing the best result for this search. See all results