DSAMT: Dual-Source Aligned Multimodal Transformers for TextCaps

C Liao, R Liu, S Gao - … Intelligence and Digital Content (IC-NIDC …, 2021 - ieeexplore.ieee.org
C Liao, R Liu, S Gao
2021 7th IEEE International Conference on Network Intelligence and …, 2021ieeexplore.ieee.org
When generating captions for images, previous caption methods tend to consider the visual
features of the image but ignore the Optical Character Recognition (OCR) in it, which makes
the generated caption lack text information in the image. By integrating OCR modal as well
as visual modal into caption prediction, TextCaps task is aimed at producing concise
sentences recapitulating the image and the text information. We propose Dual-Source
Aligned Multimodal Transformers (DSAMT), which utilize words from two sources (object …
When generating captions for images, previous caption methods tend to consider the visual features of the image but ignore the Optical Character Recognition (OCR) in it, which makes the generated caption lack text information in the image. By integrating OCR modal as well as visual modal into caption prediction, TextCaps task is aimed at producing concise sentences recapitulating the image and the text information. We propose Dual-Source Aligned Multimodal Transformers (DSAMT), which utilize words from two sources (object tags and OCR tokens) as the supplement to vocabulary. These extra words are applied to align caption embedding and visual embedding through randomly masking some tokens in caption and calculating the masked token loss. A new object detection module is used in DSAMT to extract image visual features and object tags on TextCaps. We additionally use BERTSCORE to evaluate our predictions. We demonstrate our approach achieves superior results compared to state-of-the-art models on TextCaps dataset.
ieeexplore.ieee.org