Storytelling from an Image Stream Using Scene Graphs

Ruize Wang; Zhongyu Wei; Piji Li; Qi Zhang; Xuanjing Huang

doi:10.1609/aaai.v34i05.6455

Authors

Ruize Wang Fudan University
Zhongyu Wei Fudan University
Piji Li Tencent AI Lab
Qi Zhang Fudan University
Xuanjing Huang Fudan University

DOI:

https://doi.org/10.1609/aaai.v34i05.6455

Abstract

Visual storytelling aims at generating a story from an image stream. Most existing methods tend to represent images directly with the extracted high-level features, which is not intuitive and difficult to interpret. We argue that translating each image into a graph-based semantic representation, i.e., scene graph, which explicitly encodes the objects and relationships detected within image, would benefit representing and describing images. To this end, we propose a novel graph-based architecture for visual storytelling by modeling the two-level relationships on scene graphs. In particular, on the within-image level, we employ a Graph Convolution Network (GCN) to enrich local fine-grained region representations of objects on scene graphs. To further model the interaction among images, on the cross-images level, a Temporal Convolution Network (TCN) is utilized to refine the region representations along the temporal dimension. Then the relation-aware representations are fed into the Gated Recurrent Unit (GRU) with attention mechanism for story generation. Experiments are conducted on the public visual storytelling dataset. Automatic and human evaluation results indicate that our method achieves state-of-the-art.

Storytelling from an Image Stream Using Scene Graphs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription