Seeing out of the box: End-to-end pre-training for vision-language representation learning
We study on joint learning of Convolutional Neural Network (CNN) and Transformer for
vision-language pre-training (VLPT) which aims to learn cross-modal alignments from
millions of image-text pairs. State-of-the-art approaches extract salient image regions and
align regions with words step-by-step. As region-based representations usually represent
parts of an image, it is challenging for existing models to fully understand the semantics from
paired natural languages. In this paper, we propose SOHO to"" See Out of tHe bOx"" that …
vision-language pre-training (VLPT) which aims to learn cross-modal alignments from
millions of image-text pairs. State-of-the-art approaches extract salient image regions and
align regions with words step-by-step. As region-based representations usually represent
parts of an image, it is challenging for existing models to fully understand the semantics from
paired natural languages. In this paper, we propose SOHO to"" See Out of tHe bOx"" that …
[PDF][PDF] Seeing Out of the Box: End-to-End Pre-training for Vision-Language Representation Learning Supplementary Material
4. Discussion For image-text retrieval task, the traditional approaches [2] first project an
image and a text to a common representation space and then correlate their representations
by late fusion. For example, the widely-used late fusion method is calculating cosine
similarity based on a dot-product operation, which is simple and fast. In contrast,
Transformer-based approaches early fuse the image and text by a multi-layer Transformer to
get an united representation. The unified representation captures the deep relation between …
image and a text to a common representation space and then correlate their representations
by late fusion. For example, the widely-used late fusion method is calculating cosine
similarity based on a dot-product operation, which is simple and fast. In contrast,
Transformer-based approaches early fuse the image and text by a multi-layer Transformer to
get an united representation. The unified representation captures the deep relation between …
Showing the best results for this search. See all results