Google Scholar

DSAMT: Dual-Source Aligned Multimodal Transformers for TextCaps

C Liao, R Liu, S Gao - … Intelligence and Digital Content (IC-NIDC …, 2021 - ieeexplore.ieee.org

C Liao, R Liu, S Gao

2021 7th IEEE International Conference on Network Intelligence and …, 2021•ieeexplore.ieee.org

When generating captions for images, previous caption methods tend to consider the visual features of the image but ignore the Optical Character Recognition (OCR) in it, which makes the generated caption lack text information in the image. By integrating OCR modal as well as visual modal into caption prediction, TextCaps task is aimed at producing concise sentences recapitulating the image and the text information. We propose Dual-Source Aligned Multimodal Transformers (DSAMT), which utilize words from two sources (object tags and OCR tokens) as the supplement to vocabulary. These extra words are applied to align caption embedding and visual embedding through randomly masking some tokens in caption and calculating the masked token loss. A new object detection module is used in DSAMT to extract image visual features and object tags on TextCaps. We additionally use BERTSCORE to evaluate our predictions. We demonstrate our approach achieves superior results compared to state-of-the-art models on TextCaps dataset.

ieeexplore.ieee.org

Show moreShow less

Save Cite Related articles

Create alert

Cite

Advanced search

Saved to My library

DSAMT: Dual-Source Aligned Multimodal Transformers for TextCaps