Single-stream multi-level alignment for vision-language pretraining

Z Khan, BG Vijay Kumar, X Yu, S Schulter… - … on Computer Vision, 2022 - Springer
Self-supervised vision-language pretraining from pure images and text with a contrastive
loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that
aligns image and text representations only on a global level. Earlier, supervised, non-
contrastive methods were capable of finer-grained alignment, but required dense
annotations that were not scalable. We propose a single stream architecture that aligns
images and language at multiple levels: global, fine-grained patch-token, and …

Single stream multi-level alignment for vision-language pretraining

VKB Gopalkrishna, X Yu, S Schulter - US Patent App. 18/175,906, 2023 - Google Patents
A method is provided for pretraining vision and language models that includes receiving
image-text pairs, each including an image and a text describing the image. The method
encodes an image into a set of feature vectors corresponding to input image patches and a
CLS token which represents a global image feature. The method parses, by a text tokenizer,
the text into a set of feature vectors as tokens for each word in the text. The method encodes
the CLS token from the NN based visual encoder and the tokens from the text tokenizer into …
Showing the best results for this search. See all results