PaliGemma
PaliGemma is a lightweight open vision-language model (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.
There are two sets of PaliGemma models, a general purpose set and a research-oriented set:
- PaliGemma - General purpose pretrained models that can be fine-tuned on a variety of tasks.
- PaliGemma-FT - Research-oriented models that are fine-tuned on specific research datasets.
Key benefits include:
-
Multimodal comprehension
Simultaneously understands both images and text. -
Versatile base model
Can be fine-tuned on a wide range of vision-language tasks. -
Off-the-shelf exploration
Comes with a checkpoint fine-tuned on on a mixture of tasks for immediate research use.