Visual relationship attention for image captioning
2019 International Joint Conference on Neural Networks (IJCNN), 2019•ieeexplore.ieee.org
Visual attention mechanisms have been broadly used by image captioning models to attend
to related visual information dynamically, allowing fine-grained image understanding and
reasoning. However, they are only designed to discover the region-level alignment between
visual features and the language feature. The exploration of higher-level visual relationship
information between image regions, which is rarely researched in recent works, is beyond
their capabilities. To fill this gap, we propose a novel visual relationship attention model …
to related visual information dynamically, allowing fine-grained image understanding and
reasoning. However, they are only designed to discover the region-level alignment between
visual features and the language feature. The exploration of higher-level visual relationship
information between image regions, which is rarely researched in recent works, is beyond
their capabilities. To fill this gap, we propose a novel visual relationship attention model …
Visual attention mechanisms have been broadly used by image captioning models to attend to related visual information dynamically, allowing fine-grained image understanding and reasoning. However, they are only designed to discover the region-level alignment between visual features and the language feature. The exploration of higher-level visual relationship information between image regions, which is rarely researched in recent works, is beyond their capabilities. To fill this gap, we propose a novel visual relationship attention model based on the parallel attention mechanism under the learnt spatial constraints. It can extract relationship information from visual regions and language and then achieve the relationship-level alignment between them. Using combined visual relationship attention and visual region attention to attend to related visual relationships and regions respectively, our image captioning model can achieve state-of-the-art performances on the MSCOCO dataset. Both quantitative analysis and qualitative analysis demonstrate that our novel visual relationship attention model can capture related visual relationship and further improve the caption quality.
ieeexplore.ieee.org
Showing the best result for this search. See all results