Google Scholar

Single-stream multi-level alignment for vision-language pretraining

Z Khan, BG Vijay Kumar, X Yu, S Schulter… - … on Computer Vision, 2022 - Springer

Self-supervised vision-language pretraining from pure images and text with a contrastive
loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that
aligns image and text representations only on a global level. Earlier, supervised, non-
contrastive methods were capable of finer-grained alignment, but required dense
annotations that were not scalable. We propose a single stream architecture that aligns
images and language at multiple levels: global, fine-grained patch-token, and …

Save Cite Cited by 18 Related articles All 6 versions

Single stream multi-level alignment for vision-language pretraining

VKB Gopalkrishna, X Yu, S Schulter - US Patent App. 18/175,906, 2023 - Google Patents

A method is provided for pretraining vision and language models that includes receiving
image-text pairs, each including an image and a text describing the image. The method
encodes an image into a set of feature vectors corresponding to input image patches and a
CLS token which represents a global image feature. The method parses, by a text tokenizer,
the text into a set of feature vectors as tokens for each word in the text. The method encodes
the CLS token from the NN based visual encoder and the tokens from the text tokenizer into …

Showing the best results for this search. See all results

Cite

Advanced search

Saved to My library

Single-stream multi-level alignment for vision-language pretraining

Single stream multi-level alignment for vision-language pretraining