Multi-modal multi task LLM
Documentation
|
中文文档
Paper
·
Report Bug
·
Request Feature
- [2024/07] We will release the grounding&segmentation weights soon.
- [2024/07] ViT-336 supports, MM-Bench, TextVQA, SQA, GQA involves, coming soon (before Aug).
- [2024/07] Salient-15k is released.
- [2024/07] The work is accepted by ECAI 2024 Main Track!
- [2024/01] The code and segmentation weights are released.
- [2023/10] The paper is released.
Table of Contents
Structure:
Examples
Demo is coming soon.
Code
-
Epoch Quantitative Evaluation
- Compute metrics
-
Mixed Datasets
- Dataset scale specification (portion)
- Text, Image-Text, Video-Text
-
DeepSpeed
-
LoRA
Task
- Visual Understanding
- Image Captioning
- Video Captioning
- Visual Question Answering (VQA)
- Visual Segmentation
- Referring Expression Segmentation (RES)
- Salient Object Segmentation
- Semantic Segmentation
- Visual Grounding
- Referring Expression Comprehension (REC)
Models | Images/Videos |
---|---|
u-LLaVA | uLLaVA Stage 2 |
Fine-tune | ScienceQA | MM-Bench | Seed-Bench |
---|---|---|---|
u-LLaVA-7B | 87.74 | soon | soon |
zero-shot | Accuracy (Type 3) |
---|---|
Activity-QA | 51.70% |
Run the following commands in terminal:
pip install -r ./shells/requirements.txt
cd ./models/GroundingDINO && ./install.sh && cd ../..
Why do these?
- install requirements:
pip install -r requirements.txt
- build cuda core for GroundingDINO:
cd ./models/GroundingDINO && ./install.sh && cd ../..
, if not may ariseUserWarning: Failed to load custom C++ ops. Running on CPU mode Only! warnings.warn("Failed to load custom C++ ops. Running on CPU mode Only!")
Annotation download link: ullava modified annotations, LLaVA pretrain annotations and LLaVA finetuning annotaions
Image storage (download link can be found in the table):
image_root
├─ade20k
│ ├─annotations
│ └─images
├─coco2014
│ ├─test2014
│ ├─train2014
│ └─val2014
├─coco2017
│ ├─annotations
│ ├─train2017
│ └─val2017
├─cocostuff
│ ├─train2017
│ └─val2017
├─LLaVA-CC3M-Pretrain-595K
│ └─images
├─saiapr_tc-12
│ ├─00
│ └─01
└─vlpart
├─paco
│ └─annotations
└─pascal-part
├─Annotations_Part
├─examples
└─VOCdevkit
where ade20k is extracted from ADEChallengeData2016.zip and cocostuff is extracted from stuffthingmaps_trainval2017.zip, respectively.
Dataset | Images/Videos | Annotations |
---|---|---|
LLaVA CC3M | LLaVA-CC3M-Pretrain-595K/image.zip | chat.json |
TGIF | TGIF - Quark Drive | tgif.json |
Note: We have renamed the TGIF dataset and removed invalid samples to facilitate training, but please follow the original LICENSE.
Dataset | Images | Annotations |
---|---|---|
LLaVA Instruction 150K | coco2017 | llava_instruct_150k.json |
RefCOCO | coco2014 | refcoco_train.json |
RefCOCOg | coco2014 | refcocog_train.json |
RefCOCO+ | coco2014 | refcoco+_train.json |
RefCLEF | saiapr_tc-12 | refclef_train.json |
ADE20K | ade20k | ade20k.json |
COCO Stuff | cocostuff | cocostuff.json |
VOC2010 | voc2010 | pascal_part.json |
PACO LVIS | paco | paco_lvis.json |
Salient 15K | msra | ullava_salinet_15k.json |
Note: Please download the images of MSRA-10K and MSRA-B from the official site, thanks the authors for sharing.
Dataset config example
dataset:
llava:
data_type: 'image'
image_token_len: 256
build_info:
anno_dir: '/path_to_annotations/llava_instruct_150k.json'
image_dir: '/path_to_image_root/coco2017/train2017'
portion: 1.0
vis_processor: 'clip_image'
refcoco+:
data_type: 'image'
image_token_len: 256
build_info:
anno_dir: '/path_to_annotations/refcoco+_train.json'
image_dir: '/path_to_image_root/coco2014'
template_root: './datasets/templates/SEG.json'
portion: 1.0
vis_processor: 'clip_image'
Note:
- We re-organize most of the dataset annotations for easier training, but all of us must follow the rules that the original datasets require.
- Prepare Open-Source LLaMA models
Foundation model | Version | Path |
---|---|---|
Vicuna 7B HF | V1.1 | vicuna_7b_v1.1 |
LLaMA2 7B HF | - | meta-llama/Llama-2-7b-hf |
SAM | ViT-H | sam_vit_h_4b8939.pth |
GroundingDINO | swint_ogc | groundingdino_swint_ogc.pth |
Note:
- LLaMA2 is trained with bf16
, convergence error may happen when stage 1 training with fp16
.
- The default tokenizer.legacy
of Llama-2 is False, and may rise tokenization mismatch error with some conversation
template.
- Errata: The base LLM used in the paper is Vicuna-v1.1
, not LLaMA2. Sorry about the mistake.
- Prepare datasets
- Set config in
configs/train/ullava_core_stage1.yaml
Note set all datasets path or output path according to your experiments. 4. Train Stage I with multi GPUs
./shells/pretrain.sh
or python train_ullava_core.py --cfg_path './configs/train/ullava_core_stage1.yaml'
for 1 GPU.
The first stage with 4 A100 80G with bf16 costs ~6hours for 1 epoch. Then you can find the trained model at the output_dir, for example, './exp/ullava_core_7b'
After Stage I training finished, we can go through the following step, that is, fine-tuning.
- Prepare datasets
- Set config in
configs/train/ullava_stage2_lora.yaml (for lora)
configs/train/ullava_stage2.yaml (for non lora)
- Train Stage II with multi GPUs
./shells/finetune.sh
or python train_ullava.py --cfg_path './configs/train/ullava_stage2_lora.yaml'
for 1 GPU.
Q1: What conv_tpye used in training?
A1: Stage I: 'conv_simple'. Stage II: 'conv_sep2'
Q2: When LoRA used?
A2: Stage I: We have not used in this stage. Stage II: According to your devices.
- Set config
configs/eval/eval_res.ymal (for RES task)
configs/eval/eval_rec.ymal (for REC task)
configs/eval/eval_salient.ymal (for Salinet segmentation task)
- Run
python evaluation/eval_ullava.py --cfg_path './configs/eval/eval_res.yaml' (for RES)
python evaluation/eval_ullava_grounding.py --cfg_path './configs/eval/eval_rec.yaml' (for REC)
python evaluation/eval_ullava.py --cfg_path './configs/eval/eval_salient.yaml' (for Salinet)
Modify the parser in the evaluation/inference_ullava_core.py
and evaluation/inference_ullava.py
for stage I and stage II, respectively.
python evaluation/eval_ullava.py
python evaluation/eval_ullava_grounding.py
Distributed under the Apache License. See LICENSE
for more information.
@inproceedings{xu2024ullava,
title={u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model},
author={Xu, Jinjin and Xu, Liwu and Yang, Yuzhe and Li, Xiang and Wang, Fanyi and Xie, Yanchun and Huang, Yi-Jie and Li, Yaqian},
booktitle={Proceedings of the 27th European Conference on Artificial Intelligence},
year={2024}
}
- Visual Segmentation
- Instance Segmentation
We sincerely thank the open source community for their contributions. And this work is sponsored by Shanghai Pujiang Program (23PJ1421800).
See the open issues for a full list of proposed features (and known issues).