Multimodal Pretraining for Dense Video Captioning

Gabriel Huang; Bo Pang; Zhenhai Zhu; Clara Rivera; Radu Soricut

Multimodal Pretraining for Dense Video Captioning

Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, Radu Soricut

Abstract

Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is known to be improved by meta-information such as time-stamped annotations for the main steps involved. Generating such annotations automatically is challenging, and we describe here two relevant contributions. First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations. Second, we explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts. We pretrain and subsequently finetune dense video captioning models using both YouCook2 and ViTT. We show that such models generalize well and are robust over a wide variety of instructional videos.

Anthology ID:: 2020.aacl-main.48
Volume:: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
Month:: December
Year:: 2020
Address:: Suzhou, China
Editors:: Kam-Fai Wong, Kevin Knight, Hua Wu
Venue:: AACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 470–490
Language:
URL:: https://aclanthology.org/2020.aacl-main.48
DOI:
Bibkey:
Cite (ACL):: Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, and Radu Soricut. 2020. Multimodal Pretraining for Dense Video Captioning. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 470–490, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Multimodal Pretraining for Dense Video Captioning (Huang et al., AACL 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.aacl-main.48.pdf
Code: google-research-datasets/Video-Timeline-Tags-ViTT
Data: ViTT, HowTo100M, Kinetics, Recipe1M+, WikiHow, YouCook2, YouTube-8M

PDF Cite Search Code