Google Scholar

Visual relationship attention for image captioning

Z Zhang, Y Wang, Q Wu, F Chen - 2019 International Joint …, 2019 - ieeexplore.ieee.org

2019 International Joint Conference on Neural Networks (IJCNN), 2019•ieeexplore.ieee.org

Visual attention mechanisms have been broadly used by image captioning models to attend to related visual information dynamically, allowing fine-grained image understanding and reasoning. However, they are only designed to discover the region-level alignment between visual features and the language feature. The exploration of higher-level visual relationship information between image regions, which is rarely researched in recent works, is beyond their capabilities. To fill this gap, we propose a novel visual relationship attention model based on the parallel attention mechanism under the learnt spatial constraints. It can extract relationship information from visual regions and language and then achieve the relationship-level alignment between them. Using combined visual relationship attention and visual region attention to attend to related visual relationships and regions respectively, our image captioning model can achieve state-of-the-art performances on the MSCOCO dataset. Both quantitative analysis and qualitative analysis demonstrate that our novel visual relationship attention model can capture related visual relationship and further improve the caption quality.

ieeexplore.ieee.org

Show moreShow less

Save Cite Cited by 5 Related articles All 2 versions

Showing the best result for this search. See all results

Cite

Advanced search

Saved to My library

Visual relationship attention for image captioning