A Survey On Multimodal Large Language Models

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

A Survey on Multimodal Large Language Models


Shukang Yin*, Chaoyou Fu*†, Sirui Zhao*, Ke Li,
Xing Sun, Tong Xu, and Enhong Chen, Fellow, IEEE

Abstract—Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot,
which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of
MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional multimodal methods, suggesting
a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that
can compete with or even better than GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and
summarize the recent progress of MLLMs. First of all, we present the basic formulation of MLLM and delineate its related concepts,
including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be
arXiv:2306.13549v2 [cs.CV] 1 Apr 2024

extended to support more granularity, modalities, languages, and scenarios. We continue with multimodal hallucination and extended
techniques, including Multimodal ICL (M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To conclude the
paper, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just
begun, we will keep updating this survey and hope it can inspire more research. An associated GitHub link collecting the latest papers
is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

Index Terms—Multimodal Large Language Model, Vision Language Model, Large Language Model.

1 I NTRODUCTION manifests two representative traits compared with the tradi-


tional counterparts: (1) MLLM is based on LLM with billion-
R ECENT years have seen the remarkable progress of
LLMs [1], [2], [3], [4], [5]. By scaling up data size and
model size, these LLMs raise extraordinary emergent abil-
scale parameters, which is not available in previous models.
(2) MLLM uses new training paradigms to unleash its full
ities, typically including instruction following [5], [6], In- potential, such as using multimodal instruction tuning [19],
Context Learning (ICL) [7], and Chain of Thought (CoT) [8]. [20] to encourage the model to follow new instructions.
Although LLMs have demonstrated surprising zero/few- Armed with the two traits, MLLM exhibits new capabilities,
shot reasoning performance on most Natural Language such as writing website code based on images [21], under-
Processing (NLP) tasks, they are inherently “blind” to vision standing the deep meaning of a meme [22], and OCR-free
since they can only understand discrete text. Concurrently, math reasoning [23].
Large Vision Models (LVMs) can see clearly [9], [10], [11], Ever since the release of GPT-4 [3], there has been a
[12], but commonly lag in reasoning. research frenzy over MLLMs because of the amazing mul-
In light of this complementarity, LLM and LVM run timodal examples it shows. Rapid development is fueled
towards each other, leading to the new field of Multimodal by efforts from both academia and industry. Preliminary
Large Language Model (MLLM). Formally, it refers to the research on MLLMs focuses on text content generation
LLM-based model with the ability to receive, reason, and grounded in text prompts and image [20], [24]/video [25],
output with multimodal information. Prior to MLLM, there [26]/audio [27]. Subsequent works have expanded the capa-
have been a lot of works devoted to multimodality, which bilities or the usage scenarios, including: (1) Better granular-
can be divided into discriminative [13], [14], [15] and gen- ity support. Finer control on user prompts is developed to
erative [16], [17], [18] paradigms. CLIP [13], as a represen- support specific regions through boxes [28] or a certain ob-
tative of the former, projects visual and textual information ject through a click [29]. (2) Enhanced support on input and
into a unified representation space, building a bridge for output modalities [30], [31], such as image, video, audio,
downstream multimodal tasks. In contrast, OFA [16] is a and point cloud. Besides input, projects like NExT-GPT [32]
representative of the latter, which unifies multimodal tasks further support output in different modalities. (3) Improved
in a sequence-to-sequence manner. MLLM can be classified language support. Efforts have been made to extend the
as the latter according to the sequence operation, but it success of MLLMs to other languages (e.g. Chinese) with
relatively limited training corpus [33], [34]. (4) Extension
• †Chaoyou Fu is the project leader.
to more realms and usage scenarios. Some studies transfer
• *Shukang Yin, Chaoyou Fu, and Sirui Zhao contribute equally. the strong capabilities of MLLMs to other domains such as
• Shukang Yin, Sirui Zhao, Tong Xu, and Enhong Chen are with the medical image understanding [35], [36], [37] and document
Department of Data Science, University of Science and Technology of parsing [38], [39], [40]. Moreover, multimodal agents are
China, No.96, JinZhai Road Baohe District, Hefei, Anhui, 230026, China.
E-mail: [email protected], [email protected] developed to assist in real-world interaction, e.g. embodied
• Chaoyou Fu, Ke Li, and Xing Sun are with the Tencent YouTu Lab, agents [41], [42] and GUI agents [43], [44], [45]. An MLLM
Shanghai 200233, China. E-mail: [email protected] timeline is illustrated in Fig. 1.
Corresponding author: Chaoyou Fu, Sirui Zhao, and Enhong Chen. In view of such rapid progress and the promising results
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

Fuyu-8B V* SPHINX MoE-LLaVA Qwen-VL-Max


Publicly
Available/Unavailable MobileVLM Vary Monkey TextMonkey Mobile-Agent
MM1
ImageBind-LLM DreamLLM 1-3
MMICL Xcomposer NExT-GPT
10-12 2024 Gemini
AnyMAL
Woodpecker
Video-LLaMA 3D-LLM GPT-4V Qwen-VL
8-9 Video-LLaVA
Kosmos-2 Lynx GPT4RoI PointLLM ASM
LLaMA-VID
VisCPM LISA
Pengi Chameleon LanguageBIND

DetGPT VisionLLM Otter LLaVA-Med LLaVA-1.5


6-7
LaVIN MultiModal-GPT Shikra MotionGPT CogVLM

Emu Ferret
PaLM-E LLaMA-Adapter
4-5 GLaMM
Kosmos-1 LLaVA VideoChat
VIMA Flamingo MiniGPT-4 InstructBLIP
1-3
BLIP-2 HuggingGPT LTU EmbodiedGPT
2023
2022 MM-REACT ViperGPT GPT4Tools mPLUG-Owl

Fig. 1: A timeline of representative MLLMs. We are witnessing rapid growth in this field. More works can be found in our
released GitHub page, which is updated daily.

of this field, we write this survey to provide researchers to humans, modality encoders such as image/audio en-
with a grasp of the basic idea, main method, and current coders are human eyes/ears that receive and pre-process
progress of MLLMs. Note that we mainly focus on visual optical/acoustic signals, while LLMs are like human brains
and language modalities, but also include works involving that understand and reason with the processed signals. In
other modalities like video and audio. Specifically, we cover between, the modality interface serves to align different
the most important aspects of MLLMs with corresponding modalities. Some MLLMs also include a generator to output
summaries and open a GitHub page that would be updated other modalities apart from text. A diagram of the architec-
in real time. To the best of our knowledge, this is the first ture is plotted in Fig. 2. In this section, we introduce each
survey on MLLM. module in sequence.
The following parts of the survey are structured as
such: the survey starts with a comprehensive review of the
essential aspects of MLLMs, including (1) Mainstream archi- 2.1 Modality encoder
tectures (§2); (2) A full recipe of training strategy and data
(§3); (3) Common practices of performance evaluation (§4). The encoders compress raw information, such as images
Then, we delve into a deeper discussion on some important or audio, into a more compact representation. Rather than
topics about MLLMs, each focusing on a main problem: (1) training from scratch, a common approach is to use a pre-
What aspects can be further improved or extended (§5)? trained encoder that has been aligned to other modalities.
(2) How to relieve the multimodal hallucination issue (§6)? For example, CLIP [13] incorporates a visual encoder se-
The survey continues with the introduction of three key mantically aligned with the text through large-scale pre-
techniques (§7), each specialized in a specific scenario: M- training on image-text pairs. Therefore, it is easier to use
ICL (§7.1) is an effective technique commonly used at the such initially pre-aligned encoders to align with LLMs
inference stage to boost few-shot performance. Another im- through alignment pre-training (see §3.1).
portant technique is M-CoT (§7.2), which is typically used in The series of commonly used image encoders are sum-
complex reasoning tasks. Afterward, we delineate a general marized in Table 1. Apart from vanilla CLIP image en-
idea to develop LLM-based systems to solve composite coders [13], some works also explore using other variants.
reasoning tasks or to address common user queries (§7.3). For example, MiniGPT-4 [21] adopts an EVA-CLIP [47],
Finally, we finish our survey with a summary and potential [48] (ViT-G/14) encoder, which is trained with improved
research directions. training techniques. In contrast, Osprey [29] introduces
a convolution-based ConvNext-L encoder [46] to utilize
higher resolution and multi-level features. Some works also
2 A RCHITECTURE explore encoder-free architecture. For instance, the image
A typical MLLM can be abstracted into three modules, i.e. patches of Fuyu-8b [49] are directly projected before sending
a pre-trained modality encoder, a pre-trained LLM, and a to LLMs. Thus, the model naturally supports flexible image
modality interface to connect them. Drawing an analogy resolution input.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

TABLE 1: A summary of commonly used image encoders.


Variants Pretraining Corpus Resolution Samples (B) Parameter Size (M)
OpenCLIP-ConvNext-L [46] LAION-2B 320 29 197.4
CLIP-ViT-L/14 [13] OpenAI’s WIT 224/336 13 304.0
EVA-CLIP-ViT-G/14 [47] LAION-2B,COYO-700M 224 11 1000.0
OpenCLIP-ViT-G/14 [46] LAION-2B 224 34 1012.7
OpenCLIP-ViT-bigG/14 [46] LAION-2B 224 34 1844.9

Text
training data composition are of less importance compared
Image
with input resolution, found by empirical studies [52].
Text Similar encoders are also available for other modali-
Audio ties. For example, Pengi [27] uses CLAP [54] model as
LLM the audio encoder. ImageBind-LLM [30] uses the Image-
Video Bind [55] encoder, which supports encoding image, text,
Modality
Connector Generator
audio, depth, thermal, and Inertial Measurement Unit (IMU)
Encoder data. Equipped with the strong encoder, ImageBind-LLM
… … can respond to the input of multiple modalities.

LLM
MH-Attn
2.2 Pre-trained LLM
K
Q Q K V
Q-Former K
Instead of training an LLM from scratch, it is more effi-
MLP
V V cient and practical to start with a pre-trained one. Through
tremendous pre-training on web corpus, LLMs have been
embedded with rich world knowledge, and demonstrate
Learnable Queries strong generalization and reasoning capabilities.
We summarize the commonly used and publicly avail-
Fig. 2: An illustration of typical MLLM architecture. It
able LLMs in Table 2. Notably, most LLMs fall in the causal
includes an encoder, a connector, and a LLM. An optional
decoder category, following GPT-3 [7]. Among them, Flan-
generator can be attached to the LLM to generate more
T5 [56] series are relatively early LLMs used in works
modalities besides text. The encoder takes in images, audios
like BLIP-2 [59] and InstructBLIP [60]. LLaMA series [5],
or videos and outputs features, which are processed by the
[57] and Vicuna family [4] are representative open-sourced
connector so that the LLM can better understand. There are
LLMs that have attracted much academic attention. Since
broadly three types of connectors: projection-based, query-
the two LLMs are predominantly pre-trained on English
based, and fusion-based connectors. The former two types
corpus, they are limited in multi-language support, such
adopt token-level fusion, processing features into tokens to
as Chinese. In contrast, Qwen [58] is a bilingual LLM that
be sent along with text tokens, while the last type enables a
supports Chinese and English well.
feature-level fusion inside the LLM.
It should be noted that scaling up the parameter size
of LLMs also brings additional gains, similar to the case
of increasing input resolution. Specifically, Liu et al. [50],
When choosing encoders, one often considers factors [61] find that simply scaling up LLM from 7B to 13B brings
like resolution, parameter size, and pretraining corpus. comprehensive improvement on various benchmarks. Fur-
Notably, many works have empirically verified that us- thermore, when using a 34B LLM, the model shows emer-
ing higher resolution can achieve remarkable performance gent zero-shot Chinese capability, given that only English
gains [34], [50], [51], [52]. The approaches for scaling up multimodal data are used during training. Lu et al. [62] see
input resolution can be categorized into direct scaling and a similar phenomenon by scaling up LLMs from 13B to 35B
patch-division methods. The direct scaling way inputs im- and 65B/70B, where the larger model size brings consistent
ages of higher resolutions to the encoder, which often gains on benchmarks specifically designed for MLLMs.
involves further tuning the encoder [34] or replacing a There are also works that use smaller LLMs to facilitate
pre-trained encoder with higher resolution [50]. Similarly, deployment on mobile devices. For example, MobileVLM
CogAgent [44] uses a dual-encoder mechanism, where two series [63], [64] use downscaled LLaMA [5] (termed as
encoders process high and low-resolution images, respec- MobileLLaMA 1.4B/2.7B), enabling efficient inference on
tively. High-resolution features are injected into the low- mobile processors.
resolution branch through cross-attention. Patch-division Recently, explorations of Mixture of Experts (MoE) archi-
methods cut a high-resolution image into patches and reuse tecture for LLMs have garnered rising attention [65], [66],
the low-resolution encoder. For example, Monkey [51] and [67]. Compared with dense models, the sparse architecture
SPHINX [53] divide a large image into smaller patches enables scaling up total parameter size without increasing
and send sub-images together with a downsampled high- computational cost, by selective activation of the parame-
resolution image to the image encoder, where the sub- ters. Empirically, MM1 [52] and MoE-LLaVA [68] find that
images and the low-resolution image capture local and MoE implementation achieves better performance than the
global features, respectively. In contrast, parameter size and dense counterpart on almost all the benchmarks.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

TABLE 2: A summary of commonly used open-sourced LLMs. en, zh, fr, and de stand for English, Chinese, French, and
German, respectively.
Model Release Date Pretrain Data Scale Parameter Size (B) Language Support Architecture
Flan-T5-XL/XXL [56] Oct-2022 - 3/ 11 en, fr, de Encoder-Decoder
LLaMA [5] Feb-2023 1.4T tokens 7/ 13/ 33/ 65 en Causal Decoder
Vicuna [4] Mar-2023 1.4T tokens 7/ 13/ 33 en Causal Decoder
LLaMA-2 [57] Jul-2023 2T tokens 7/ 13/ 70 en Causal Decoder
Qwen [58] Sep-2023 3T tokens 1.8 / 7/ 14/ 72 en, zh Causal Decoder

2.3 Modality interface are first embedded with visual knowledge and then con-
catenated with text features as prefixes.
Since LLMs can only perceive text, bridging the gap be-
In terms of parameter size, learnable interfaces generally
tween natural language and other modalities is necessary.
comprise a small portion compared with encoders and
However, it would be costly to train a large multimodal
LLMs. Take Qwen-VL [34] as an example, the parameter
model in an end-to-end manner. A more practical way is
size of the Q-Former is about 0.08B, accounting for less
to introduce a learnable connector between the pre-trained
than 1% of the whole parameters, while the encoder and
visual encoder and LLM. The other approach is to translate
the LLM account for about 19.8% (1.9B) and 80.2% (7.7B),
images into languages with the help of expert models, and
respectively.
then send the language to LLM.
Expert Model. Apart from the learnable interface, using
Learnable Connector. It is responsible for bridging the
expert models, such as an image captioning model, is also
gap between different modalities. Specifically, the module
a feasible way to bridge the modality gap [77], [78], [79],
projects information into the space that LLM can understand
[80]. The basic idea is to convert multimodal inputs into
efficiently. Based on how multimodal information is fused,
languages without training. In this way, LLMs can under-
there are broadly two ways to implement such interfaces,
stand multimodality by the converted languages. For ex-
i.e. token-level and feature-level fusion.
ample, VideoChat-Text [25] uses pre-trained vision models
For token-level fusion, features output from encoders are to extract visual information such as actions and enriches
transformed into tokens and concatenated with text tokens the descriptions using a speech recognition model. Though
before being sent into LLMs. A common and feasible solu- using expert models is straightforward, it may not be as
tion is to leverage a group of learnable query tokens to ex- flexible as adopting a learnable interface. The conversion of
tract information in a query-based manner [69], which first foreign modalities into text would cause information loss.
has been implemented in BLIP-2 [59], and subsequently in- For example, transforming videos into textual descriptions
herited by a variety of work [26], [60], [70]. Such Q-Former- distorts spatial-temporal relationships [25].
style approaches compress visual tokens into a smaller num-
ber of representation vectors. In contrast, some methods
simply use a MLP-based interface to bridge the modality 3 T RAINING S TRATEGY AND DATA
gap [20], [37], [71], [72]. For example, LLaVA series adopts A full-fledged MLLM undergoes three stages of training,
one/two linear MLP [20], [50] to project visual tokens and i.e. pre-training, instruction-tuning, and alignment tuning.
align the feature dimension with word embeddings. Each phase of training requires different types of data and
On a related note, MM1 [52] has ablated on design fulfills different objectives. In this section, we discuss train-
choices on the connector and found that for token-level ing objectives, as well as data collection and characteristics
fusion, the type of modality adapter is far less important for each training stage.
than the number of visual tokens and input resolution.
Nevertheless, Zeng et al. [73] compare the performance
3.1 Pre-training
of token and feature-level fusion, and empirically reveal
that the token-level fusion variant performs better in terms 3.1.1 Training Detail
of VQA benchmarks. Regarding the performance gap, the As the first training stage, pre-training mainly aims to align
authors suggest that cross-attention models might require different modalities and learn multimodal world knowl-
a more complicated hyper-parameter searching process to edge. Pre-training stage generally entails large-scale text-
achieve comparable performance. paired data, e.g. caption data. Typically, the caption pairs de-
As another line, feature-level fusion inserts extra mod- scribe images/audio/videos in natural language sentences.
ules that enable deep interaction and fusion between text Here, we consider a common scenario where MLLMs are
features and visual features. For example, Flamingo [74] trained to align vision with text. As illustrated in Table 3,
inserts extra cross-attention layers between frozen Trans- given an image, the model is trained to predict autore-
former layers of LLMs, thereby augmenting language fea- gressively the caption of the image, following a standard
tures with external visual cues. Similarly, CogVLM [75] cross-entropy loss. A common approach for pre-training
plugs in a visual expert module in each Transformer layer is to keep pre-trained modules (e.g. visual encoders and
to enable dual interaction and fusion between vision and LLMs) frozen and train a learnable interface [20], [35], [72].
language features. For better performance, the QKV weight The idea is to align different modalities without losing
matrix of the introduced module is initialized from the pre-trained knowledge. Some methods [34], [81], [82] also
pre-trained LLM. Similarly, LLaMA-Adapter [76] introduces unfreeze more modules (e.g. visual encoder) to enable more
learnable prompts into Transformer layers. These prompts trainable parameters for alignment. It should be noted that
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

TABLE 4: Common datasets used for pre-training.


Input: <image>
Response: {caption} Dataset Samples Date
Coarse-grained Image-Text
TABLE 3: A simplified template to structure the caption CC-3M [84] 3.3M 2018
data. {<image>} is the placeholder for the visual tokens, and CC-12M [85] 12.4M 2020
SBU Captions [86] 1M 2011
{caption} is the caption for the image. Note that only the LAION-5B [87] 5.9B Mar-2022
part marked in red is used for loss calculation. LAION-2B [87] 2.3B Mar-2022
LAION-COCO [88] 600M Sep-2022
COYO-700M [90] 747M Aug-2022
the training scheme is closely related to the data quality. Fine-grained Image-Text
For short and noisy caption data, a lower resolution (e.g.
ShareGPT4V-PT [83] 1.2M Nov-2023
224) can be adopted to speed up the training process, while LVIS-Instruct4V [91] 111K Nov-2023
for longer and cleaner data, it is better to utilize higher ALLaVA [92] 709K Feb-2024
resolutions (e.g. 448 or higher) to mitigate hallucinations. Be- Video-Text
sides, ShareGPT4V [83] finds that with high-quality caption
MSR-VTT [93] 200K 2016
data in the pretraining stage, unlocking the vision encode
promotes better alignment. Audio-Text
WavCaps [94] 24K Mar-2023
3.1.2 Data
Pretraining data mainly serve two purposes, i.e. (1) aligning
different modalities and (2) providing world knowledge. LAION. This series are large web-scale datasets, with im-
The pretraining corpora can be divided into coarse-grained ages scrawled from the internet and associated alt-text as
and fine-grained data according to granularities, which we captions. To filter the image-text pairs, the following steps
will introduce sequentially. We summarize commonly used are performed: (1) Text with short lengths or images with too
pretraining datasets in Table 4. small or too big sizes are dropped. (2) Image deduplication
Coarse-grained caption data share some typical traits in based on URL. (3) Extract CLIP [13] embeddings for images
common: (1) The data volume is large since samples are and text, and use the embeddings to drop possibly illegal
generally sourced from the internet. (2) Because of the web- content and image-text pairs with low cosine similarity
scrawled nature, the captions are usually short and noisy between embeddings. Here we offer a brief summary of
since they originate from the alt-text of the web images. some typical variants:
These data can be cleaned and filtered via automatic tools, • LAION-5B [87]: It is a research-purpose dataset of 5.85B
for example, using CLIP [13] model to filter out image- image-text pairs. The dataset is multilingual with a 2B
text pairs whose similarities are lower than a pre-defined English subset.
threshold. In what follows, we introduce some representa- • LAION-COCO [88]: It contains 600M images extracted
tive coarse-grained datasets. from the English subset of LAION-5B. The captions are
CC. CC-3M [84] is a web-scale caption dataset of 3.3M synthetic, using BLIP [89] to generate various image cap-
image-caption pairs, where the raw descriptions are derived tions and using CLIP [13] to pick the best fit for the image.
from alt-text associated with images. The authors design a COYO-700M [90]. It contains 747M image-text pairs, which
complicated pipeline to clean data: (1) For images, those are extracted from CommonCrawl. For data filtering, the
with inappropriate content or aspect ratio are filtered. (2) authors design the following strategies: (1) For images,
For text, NLP tools are used to obtain text annotations, with those with inappropriate size, content, format, or aspect
samples filtered according to the designed heuristics. (3) For ratio are filtered. Moreover, the images are filtered based
image-text pairs, images are assigned labels via classifiers. on the pHash value to remove images overlapped with
If text annotations do not overlap with image labels, the public datasets such as ImageNet and MS-COCO. (2) For
corresponding samples are dropped. text, only English text with satisfactory length, noun forms,
CC-12M [85] is a following work of CC-3M and contains and appropriate words are saved. Whitespace before and
12.4M image-caption pairs. Compared with the previous after the sentence will be removed, and consecutive whites-
work, CC-12M relaxes and simplifies the data-collection pace characters will be replaced with a single whitespace.
pipeline, thus collecting more data. Moreover, text appearing more than 10 times (e.g. “image
SBU Captions [86]. It is a captioned photo dataset con- for”) will be dropped. (3) For image-text pairs, duplicated
taining 1M image-text pairs, with images and descriptions samples are removed based on (image pHash, text) tuple.
sourced from Flickr. Specifically, an initial set of images is Recently, more works [83], [91], [92] have explored
acquired by querying the Flickr website with a large number generating high-quality fine-grained data through prompt-
of query terms. The descriptions attached to the images ing strong MLLMs (e.g. GPT-4V). Compared with coarse-
thus serve as captions. Then, to ensure that descriptions grained data, these data generally contain longer and more
are relevant to the images, the retained images fulfill these accurate descriptions of the images, thus enabling finer-
requirements: (1) Descriptions of the images are of satisfac- grained alignment between image and text modalities.
tory length, decided by observation. (2) Descriptions of the However, since this approach generally requires calling
images contain at least 2 words in the predefined term lists commercial-use MLLMs, the cost is higher, and the data vol-
and a propositional word (e.g. “on”, “under”) that generally ume is relatively smaller. Notably, ShareGPT4V [83] strikes a
suggests spatial relationships. balance by first training a captioner with GPT-4V-generated
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

(A) Pretrain–finetune (BERT, T5) like the image caption task [100]. The output is the answer
Pretrained Finetune on Inference
LM task A on task A (C) Instruction tuning (FLAN) to the instruction conditioned on the input. The instruction
• Typically requires many
task-specific examples Pretrained
LM
Instruction-tune on
many tasks: Inference
on task A
template is flexible and subject to manual designs [20], [25],
• One specialized model B, C, D, …
for each task
Model learns to perform Inference on
[98], as exemplified in Table 5. Note that the instruction
many tasks via natural unseen task
(B) Prompting (GPT-3) language instructions template can also be generalized to the case of multi-round
Improve performance

Pretrained
via few-shot prompting
or prompt engineering Inference
conversations [20], [37], [71], [98].
LM on task A
Formally, a multimodal instruction sample can be de-
Fig. 3: Comparison of three typical learning paradigms. The noted in a triplet form, i.e. (I, M, R), where I, M, R repre-
image is from [19]. sent the instruction, the multimodal input, and the ground
truth response, respectively. The MLLM predicts an answer
given the instruction and the multimodal input:
100K data, then scaling up the data volume to 1.2M using
the pre-trained captioner. A = f (I, M; θ) (1)

Here, A denotes the predicted answer, and θ are the pa-


3.2 Instruction-tuning rameters of the model. The training objective is typically the
3.2.1 Introduction original auto-regressive objective used to train LLMs [20],
Instruction refers to the description of tasks. Intuitively, [37], [71], [101], based on which the MLLM is encouraged to
instruction tuning aims to teach models to better under- predict the next token of the response. The objective can be
stand the instructions from users and fulfill the demanded expressed as:
tasks. Tuning in this way, LLMs can generalize to unseen N
tasks by following new instructions, thus boosting zero-shot X
L(θ) = − log p(Ri |I, R<i ; θ) (2)
performance. This simple yet effective idea has sparked the i=1
success of subsequent NLP works, such as ChatGPT [2],
InstructGPT [95], FLAN [19], [56], and OPT-IML [96]. where N is the length of the ground-truth response.
The comparisons between instruction tuning and related
typical learning paradigms are illustrated in Fig. 3. The 3.2.3 Data Collection
supervised fine-tuning approach usually requires a large Since instruction data are more flexible in formats and
amount of task-specific data to train a task-specific model. varied in task formulations, it is usually trickier and more
The prompting approach reduces the reliance on large-scale costly to collect data samples. In this section, we summarize
data and can fulfill a specialized task via prompt engi- three typical ways to harvest instruction data at scale, i.e.
neering. In such a case, though the few-shot performance data adaptation, self-instruction, and data mixture.
has been improved, the zero-shot performance is still quite Data Adaptation. Task-specific datasets are rich sources of
average [7]. Differently, instruction tuning learns how to high-quality data. Hence, abundant works [60], [70], [76],
generalize to unseen tasks rather than fitting specific tasks [82], [101], [102], [103], [104] have utilized existing high-
like the two counterparts. Moreover, instruction tuning is quality datasets to construct instruction-formatted datasets.
highly related to multi-task prompting [97]. Take the transformation of VQA datasets for an example,
In this section, we delineate the format of instruction the original sample is an input-out pair where the input
samples, the training objectives, typical ways to gather in- comprises an image and a natural language question, and
struction data, and corresponding commonly used datasets. the output is the textual answer to the question conditioned
on the image. The input-output pairs of these datasets could
3.2.2 Training Detail naturally comprise the multimodal input and response of
A multimodal instruction sample often includes an optional the instruction sample (see §3.2.2). The instructions, i.e. the
instruction and an input-output pair. The instruction is descriptions of the tasks, can either derive from manual
typically a natural language sentence describing the task, design or from semi-automatic generation aided by GPT.
such as, “Describe the image in detail.” The input can be an Specifically, some works [21], [35], [60], [70], [102], [105]
image-text pair like the VQA task [99] or only an image hand-craft a pool of candidate instructions and sample one
of them during training. We offer an example of instruction
templates for the VQA datasets as shown in Table 6. The
Below is an instruction that describes a task. Write a response other works manually design some seed instructions and
that appropriately completes the request use these to prompt GPT to generate more [25], [82], [98].
Instruction: <instruction> Note that since the answers of existing VQA and caption
Input: {<image>, <text>} datasets are usually concise, directly using these datasets for
Response: <output>
instruction tuning may limit the output length of MLLMs.
There are two common strategies to tackle this problem. The
TABLE 5: A simplified template to structure the multimodal first one is to specify explicitly in instructions. For example,
instruction data. <instruction> is a textual description of the ChatBridge [104] explicitly declares short and brief for short-
task. {<image>, <text>} and <output> are input and output answer data, as well as a sentence and single sentence for
from the data sample. Note that <text> in the input may be conventional coarse-grained caption data. The second one is
missed for some datasets, such as image caption datasets to extend the length of existing answers [105]. For example,
merely have <image>. The example is adapted from [98]. M3 IT [105] proposes to rephrase the original answer by
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

• <Image> {Question}
• <Image> Question: {Question}
• <Image> {Question} A short answer to the question is
• <Image> Q: {Question} A:
• <Image> Question: {Question} Short answer:
• <Image> Given the image, answer the following question with no more than three words. {Question}
• <Image> Based on the image, respond to this question with a short answer: {Question}. Answer:
• <Image> Use the provided image to answer the question: {Question} Provide your answer as short as possible:
• <Image> What is the answer to the following question? "{Question}"
• <Image> The question "{Question}" can be answered using the image. A short answer is

TABLE 6: Instruction templates for VQA datasets, cited from [60]. <Image> and {Question} are the image and the question
in the original VQA datasets, respectively.

TABLE 7: A summary of popular datasets generated by self-instruction. For input/output modalities, I: Image, T: Text, V:
Video, A: Audio. For data composition, M-T and S-T denote multi-turn and single-turn, respectively.
Dataset Sample Modality Source Composition
LLaVA-Instruct 158K I+T→T MS-COCO 23K caption + 58K M-T QA + 77K reasoning
LVIS-Instruct 220K I+T→T LVIS 110K caption + 110K M-T QA
ALLaVA 1.4M I+T→T VFlan, LAION 709K caption + 709K S-T QA
Video-ChatGPT 100K V+T→T ActivityNet 7K description + 4K M-T QA
VideoChat 11K V+T → T WebVid description + summarization + creation
Clotho-Detail 3.9K A+T→T Clotho caption

prompting ChatGPT with the original question, answer, and and randomly shuffle) and sequential instruction tuning
contextual information of the image (e.g. caption and OCR). (text data followed by multimodal data).
Self-Instruction. Although existing multi-task datasets can
3.2.4 Data Quality
contribute a rich source of data, they usually do not meet
human needs well in real-world scenarios, such as multiple Recent research has revealed that the data quality of
rounds of conversations. To tackle this issue, some works instruction-tuning samples is no less important than quan-
collect samples through self-instruction [106], which utilizes tity. Lynx [73] finds that models pre-trained on large-scale
LLMs to generate textual instruction-following data using a but noisy image-text pairs do not perform as well as mod-
few hand-annotated samples. Specifically, some instruction- els pre-trained with smaller but cleaner datasets. Similarly,
following samples are hand-crafted as demonstrations, af- Wei et al. [108] finds that less instruction-tuning data with
ter which ChatGPT/GPT-4 is prompted to generate more higher quality can achieve better performance. For data
instruction samples with the demonstrations as guidance. filtering, the work proposes some metrics to evaluate data
LLaVA [20] extends the approach to the multimodal field quality and, correspondingly, a method to automatically
by translating images into text of captions and bound- filter out inferior vision-language data. Here we discuss two
ing boxes, and prompting text-only GPT-4 to generate important aspects regarding data quality.
new data with the guidance of requirements and demon- Prompt Diversity. The diversity of instructions has been
strations. In this way, a multimodal instruction dataset found to be critical for model performance. Lynx [73] em-
is constructed, called LLaVA-Instruct-150k. Following this pirically verifies that diverse prompts help improve model
idea, subsequent works such as MiniGPT-4 [21], Chat- performance and generalization ability.
Bridge [104], GPT4Tools [107], and DetGPT [72] develop Task Coverage. In terms of tasks involved in training data,
different datasets catering for different needs. Recently, with Du et al. [109] perform an empirical study and find that
the release of the more powerful multimodal model GPT- the visual reasoning task is superior to captioning and
4V, many works have adopted GPT-4V to generate data of QA tasks for boosting model performance. Moreover, the
higher quality, as exemplified by LVIS-Instruct4V [91] and study suggests that enhancing the complexity of instruc-
ALLaVA [92]. We summarize the popular datasets gener- tions might be more beneficial than increasing task diversity
ated through self-instruction in Table 7. and incorporating fine-grained spatial annotations.

Data Mixture. Apart from the multimodal instruction


3.3 Alignment tuning
data, language-only user-assistant conversation data can
also be used to improve conversational proficiencies 3.3.1 Introduction
and instruction-following abilities [81], [98], [101], [103]. Alignment tuning is more often used in scenarios where
LaVIN [101] directly constructs a minibatch by randomly models need to be aligned with specific human preferences,
sampling from both language-only and multimodal data. e.g. response with fewer hallucinations (see §6). Currently,
MultiInstruct [102] probes different strategies for training Reinforcement Learning with Human Feedback (RLHF) and
with a fusion of single modal and multimodal data, includ- Direct Preference Optimization (DPO) are two main tech-
ing mixed instruction tuning (combine both types of data niques for alignment tuning. In this section, we introduce
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

the main ideas of the two techniques in sequence and TABLE 8: A summary of datasets for alignment-tuning. For
offer some examples of how they are utilized in addressing input/output modalities, I: Image, T: Text.
practical problems, and finally, give a compilation of the Dataset Sample Modality Source
related datasets. LLaVA-RLHF [112] 10K I+T→T Human
RLHF-V [114] 5.7K I+T→T Human
VLFeedback [115] 380K I+T→T GPT-4V
3.3.2 Training Detail
RLHF [110], [111]. This technique aims to utilize rein-
forcement learning algorithms to align LLMs with human response and uses the obtained data to perform dense DPO.
preferences, with human annotations as supervision in the Silkie [115] instead collects preference data via prompting
training loop. As exemplified in InstructGPT [95], RLHF GPT-4V and distills the preference supervision into an
incorporates three key steps: instruction-tuned model through DPO.
1) Supervised fine-tuning. This step aims to fine-tune a
pre-trained model to present the preliminary desired 3.3.3 Data
output behavior. The fine-tuned model in the RLHF The gist of data collection for alignment-tuning is to collect
setting is called a policy model. Note that this step might feedback for model responses, i.e. to decide which response
be skipped since the supervised policy model π SFT can be is better. It is generally more expensive to collect such data,
initialized from an instruction-tuned model (see §3.2). and the amount of data used for this phase is typically
2) Reward modeling. A reward model is trained using pref- even less than that used in previous stages. In this part, we
erence pairs in this step. Given a multimodal prompt introduce some datasets and summarize them in Table 8.
(e.g. image and text) x and a response pair (yw , yl ), the LLaVA-RLHF [112]. It contains 10K preference pairs col-
reward model rθ learns to give a higher reward to the lected from human feedback in terms of honesty and help-
preferred response yw , and vice versa for yl , according to fulness. The dataset mainly serves to reduce hallucinations
the following objective: in model responses.
L(θ) = −E(x,yw ,yl )∼D [log(σ(rθ (x, yw ) − rθ (x, yl )] (3) RLHF-V [114]. It has 5.7K fine-grained human feedback
data collected by segment-level hallucination corrections.
where D = {(x, yw , yl )} is the comparison dataset VLFeedback [115]. It utilizes AI to provide feedback on
labeled by human annotators. In practice, the reward model responses. The dataset contains more than 380K
model rθ shares a similar structure with the policy model. comparison pairs scored by GPT-4V in terms of helpfulness,
3) Reinforcement learning. In this step, the Proximal Policy faithfulness, and ethical concerns.
Optimization (PPO) algorithm is adopted to optimize the
RL policy model πϕRL . A per-token KL penalty is often
added to the training objective to avoid deviating too far
4 E VALUATION
from the original policy [95], resulting in the objective: Evaluation is an essential part of developing MLLMs since
h it provides feedback for model optimization and helps to
L(ϕ) = −Ex∼D,y∼πϕRL (y|x) rθ (x, y) compare the performance of different models. Compared
 i (4) with evaluation methods of traditional multimodal mod-
− β · DKL πϕRL (y|x)||π REF (y|x) els, the evaluation of MLLMs exhibits several new traits:
(1) Since MLLMs are generally versatile, it is important
where β is the coefficient for the KL penalty term. Typ-
to evaluate MLLMs comprehensively. (2) MLLMs exhibit
ically, both the RL policy πϕRL and the reference model
many emergent capabilities that require special attention
π REF are initialized from the supervised model π SFT . (e.g. OCR-free math reasoning) and thus require new eval-
The obtained RL policy model is expected to align with uation schemes. The evaluation of MLLMs can be broadly
human preferences through this tuning process. categorized into two types according to the question genres,
Researchers have explored using the RLHF techniques including closed-set and open-set.
for better multimodal alignment. For example, LLaVA-
RLHF [112] collects human preference data and tunes a
4.1 Closed-set
model with fewer hallucinations based on LLaVA [20].
DPO [113]. It learns from human preference labels utilizing Closed-set questions refer to a type of question where the
a simple binary classification loss. Compared with the PPO- possible answer options are predefined and limited to a
based RLHF algorithm, DPO is exempt from learning an finite set. The evaluation is usually performed on task-
explicit reward model, thus simplifying the whole pipeline specific datasets. In this case, the responses can be naturally
to two steps, i.e. human preference data collection and judged by benchmark metrics [20], [60], [70], [76], [101],
preference learning. The learning objective is as follows: [102], [103], [104]. For example, InstructBLIP [60] reports
the accuracy on ScienceQA [116], as well as the CIDEr
h  πϕRL (yw |x) score [117] on NoCaps [118] and Flickr30K [119]. The evalu-
L(ϕ) = −E(x,yw ,yl )∼D log σ β log REF
π (yw |x) ation settings are typically zero-shot [60], [102], [104], [105]
RL
(5) or finetuning [20], [35], [60], [70], [76], [101], [103], [105]. The
πϕ (yl |x) i
− β log REF first setting often selects a wide range of datasets covering
π (yl |x) different general tasks and splits them into held-in and
RLHF-V [114] collects fine-grained (segment-level) prefer- held-out datasets. After tuning on the former, zero-shot
ence data pairs by correcting hallucinations in the model performance is evaluated on the latter with unseen datasets
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

or even unseen tasks. In contrast, the second setting is often [134] exploit a more advanced GPT-4V model to assess
observed in the evaluation of domain-specific tasks. For the performance of MLLMs. For example, Woodpecker [77]
example, LLaVA [20] and LLaMA-Adapter [76] report fine- adopts GPT-4V to judge the response quality of model
tuned performance on ScienceQA [116]. LLaVA-Med [35] answers based on the image. The evaluation is expected to
reports results on biomedical VQA [120], [121], [122]. be more accurate than using text-only GPT-4 since GPT-4V
The above evaluation methods are usually limited to a has direct access to the image.
small range of selected tasks or datasets, lacking a compre- A supplementary approach is to compare the different
hensive quantitative comparison. To this end, some efforts capabilities of MLLMs through case studies. For instance,
have endeavored to develop new benchmarks specially some studies evaluate two typical advanced commercial-use
designed for MLLMs [123], [124], [125], [126], [127], [128], models, GPT-4V and Gemini. Yang et al. [135] perform in-
[129]. For example, Fu et al. [123] construct a comprehen- depth qualitative analysis on GPT-4V by crafting a series of
sive evaluation benchmark MME that includes a total of samples across various domains and tasks, spanning from
14 perception and cognition tasks. All instruction-answer preliminary skills, such as caption and object counting, to
pairs in MME are manually designed to avoid data leakage. complex tasks that require world knowledge and reasoning,
MMBench [124] is a benchmark specifically designed for such as joke understanding and indoor navigation as an
evaluating multiple dimensions of model capabilities, using embodied agent. Wen et al. [136] make a more focused eval-
ChatGPT to match open responses with pre-defined choices. uation of GPT-4V by designing samples targeting automatic
Video-ChatGPT [130] and Video-Bench [131] focus on video driving scenarios. Fu et al. [137] carry out a comprehensive
domains and propose specialized benchmarks as well as evaluation on Gemini-Pro by comparing the model against
evaluation tools for assessment. There are also evaluation GPT-4V. The results suggest that GPT-4V and Gemini exhibit
strategies designed to evaluate a specific aspect of the comparable visual reasoning abilities in spite of different
model [102], as exemplified by POPE [132] for assessment response styles.
of hallucination degree.
5 E XTENSIONS
4.2 Open-set Recent studies have made significant strides in extending
In contrast to the closed-set questions, the responses to the capabilities of MLLMs, spanning from more potent
open-set questions can be more flexible, where MLLMs foundational abilities to broader coverage of scenarios. We
usually play a chatbot role. Because the content of the trace the principal development of MLLMs in this regard.
chat can be arbitrary, it would be trickier to judge than Granularity Support. To facilitate better interaction between
the closed-ended output. The criterion can be classified agents and users, researchers have developed MLLMs with
into manual scoring, GPT scoring, and case study. Manual finer support of granularities in terms of model inputs and
scoring requires humans to assess the generated responses. outputs. On the input side, models that support finer control
This kind of approach often involves hand-crafted ques- from user prompts are developed progressively, evolving
tions that are designed to assess specific dimensions. For from image to region [28], [138], [139] and even pixels [29],
example, mPLUG-Owl [81] collects a visually related eval- [140], [141]. Specifically, Shikra [28] supports region-level
uation set to judge capabilities like natural image under- input and understanding. Users may interact with the assis-
standing, diagram, and flowchart understanding. Similarly, tant more flexibly by referring to specific regions, which are
GPT4Tools [107] builds two sets for the finetuning and zero- represented in bounding boxes of natural language forms.
shot performance, respectively, and evaluates the responses Ferret [141] takes a step further and supports more flexible
in terms of thought, action, arguments, and the whole. referring by devising a hybrid representation scheme. The
Since manual assessment is labor intensive, some re- model supports different forms of prompts, including point,
searchers have explored rating with GPT, namely GPT scor- box, and sketch. Similarly, Osprey [29] supports point input
ing. This approach is often used to evaluate performance by utilizing a segmentation model [9]. Aided by the excep-
on multimodal dialogue. LLaVA [20] proposes to score the tional capabilities of the pre-trained segmentation model,
responses via text-only GPT-4 in terms of different aspects, Osprey enables specifying a single entity or part of it with a
such as helpfulness and accuracy. Specifically, 30 images single click. On the output side, grounding capabilities are
are sampled from the COCO [133] validation set, each improved in line with the development of input support.
associated with a short question, a detailed question, and Shikra [28] supports response grounded in the image with
a complex reasoning question via self-instruction on GPT-4. box annotations, resulting in higher precision and finer
The answers generated by both the model and GPT-4 are referring experience. LISA [142] further supports mask-
sent to GPT-4 for comparison. Subsequent works follow this level understanding and reasoning, which makes pixel-level
idea and prompt ChatGPT [81] or GPT-4 [35], [70], [101], grounding possible.
[104], [105] to rate results [35], [70], [81], [101], [104] or judge Modality Support. Increased support for modalities is a
which one is better [103]. tendency for MLLM studies. On the one hand, researchers
A main issue of applying text-only GPT-4 as an evaluator have explored adapting MLLMs to support the input of
is that the judge is only based on image-related text content, more multimodal content, such as 3D point cloud [41],
such as captions or bounding box coordinates, without [143], [144], [145]. On the other hand, MLLMs are also
accessing the image [35]. Thus, it may be questionable to set extended to generate responses of more modalities, such
GPT-4 as the performance upper bound in this case. With as image [32], [146], [147], [148], audio [32], [147], [149],
the release of the vision interface of GPT, some works [77], [150], and video [32], [151]. For example, NExT-GPT [32]
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

proposes a framework that supports inputs and outputs of the image content [77]. As a fundamental and important
mixed modalities, specifically, combinations of text, image, problem, the issue has received increased attention. In this
audio, and video, with the help of diffusion models [152], section, we briefly introduce some related concepts and
[153] attached to the MLLM. The framework applies an research development.
encoder-decoder architecture and puts LLM as a pivot for
understanding and reasoning.
Language Support. Current models are predominantly 6.1 Preliminaries
unilingual, probably due to the fact that high-quality non- Current research on multimodal hallucinations can be fur-
English training corpus is scarce. Some works have been de- ther categorized into three types [159]:
voted to developing multilingual models so that a broader
1) Existence Hallucination is the most basic form, meaning
range of users can be covered. VisCPM [33] transfers model
that models incorrectly claim the existence of certain
capabilities to the multilingual setting by designing a multi-
objects in the image.
stage training scheme. Specifically, the scheme takes English
2) Attribute Hallucination means describing the attributes
as a pivotal language, with abundant training corpus. Uti-
of certain objects in a wrong way, e.g. failure to identify
lizing a pre-trained bilingual LLM, the multimodal capa-
a dog’s color correctly. It is typically associated with ex-
bilities are transferred to Chinese by adding some trans-
istence hallucination since descriptions of the attributes
lated samples during instruction tuning. Taking a similar
should be grounded in objects present in the image.
approach, Qwen-VL [34] is developed from the bilingual
3) Relationship Hallucination is a more complex type and is
LLM Qwen [58] and supports both Chinese and English.
also based on the existence of objects. It refers to false
During pre-training, Chinese data is mixed into the training
descriptions of relationships between objects, such as
corpus to preserve the bilingual capabilities of the model,
relative positions and interactions.
taking up 22.7% of the whole data volume.
Scenario/Task Extension. Apart from developing common In what follows, we first introduce some specific eval-
general-purpose assistants, some studies have focused on uation methods (§6.2), which are useful to gauge the per-
more specific scenarios where practical conditions should formance of methods for mitigating hallucinations (§6.3).
be considered, while others extend MLLMs to downstream Then, we will discuss in detail the current methods for
tasks with specific expertise. reducing hallucinations, according to the main categories
A typical tendency is to adapt MLLMs to more specific each method falls into.
real-life scenarios. MobileVLM [63] explores developing
small-size variants of MLLMs for resource-limited scenarios.
6.2 Evaluation Methods
Some designs and techniques are utilized for deployment on
mobile devices, such as LLMs of smaller size and quantiza- CHAIR [160] is an early metric that evaluates hallucina-
tion techniques to speed up computation. Other works de- tion levels in open-ended captions. The metric measures
velop agents that interact with real-world [41], [154], [155], the proportion of sentences with hallucinated objects or
e.g. user-friendly assistants specially designed for Graphical hallucinated objects in all the objects mentioned. In con-
User Interface (GUI), as exemplified by CogAgent [44], trast, POPE [132] is a method that evaluates closed-set
AppAgent [43], and Mobile-Agent [45]. These assistants choices. Specifically, multiple prompts with binary choices
excel in planning and guiding through each step to fulfill a are formulated, each querying if a specific object exists
task specified by users, acting as helpful agents for human- in the image. The method also covers more challenging
machine interaction. Another line is to augment MLLMs settings to evaluate the robustness of MLLMs, with data
with specific skills for solving tasks in different domains, e.g. statistics taken into consideration. The final evaluation uses
document understanding [38], [39], [156], [157] and medi- a simple watchword mechanism, i.e. by detecting keywords
cal domains [35], [36], [37]. For document understanding, “yes/no”, to convert open-ended responses into closed-
mPLUG-DocOwl [38] utilizes various forms of document- set binary choices. With a similar evaluation approach,
level data for tuning, resulting in an enhanced model in MME [123] provides a more comprehensive evaluation,
OCR-free document understanding. TextMonkey [39] incor- covering aspects of existence, count, position and color, as
porates multiple tasks related to document understanding exemplified in [77].
to improve model performance. Apart from conventional Different from previous approaches that use match-
document image and scene text datasets, position-related ing mechanisms to detect and decide hallucinations,
tasks are added to reduce hallucinations and help mod- HaELM [161] proposes using text-only LLMs as a judge to
els learn to ground responses in the visual information. automatically decide whether MLLMs’ captions are correct
MLLMs can also be extended to medical domains by in- against reference captions. In light of the fact that text-only
stilling knowledge of the medical domain. For example, LLMs can only access limited image context and require
LLaVA-Med [158] injects medical knowledge into vanilla reference annotations, Woodpecker [77] uses GPT-4V to di-
LLaVA [20] and develops an assistant specialized in medical rectly assess model responses grounded in the image. Faith-
image understanding and question answering. Score [162] is a more fine-grained metric based on a routine
that breaks down descriptive sub-sentences and evaluates
each sub-sentence separately. Based on previous studies,
6 M ULTIMODAL H ALLUCINATION AMBER [163] is an LLM-free benchmark that encompasses
Multimodal hallucination refers to the phenomenon of both discriminative tasks and generative tasks and involves
responses generated by MLLMs being inconsistent with three types of possible hallucinations (see §6.1).
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

6.3 Mitigation Methods <BOS> Below are some examples and an instruction that
According to high-level ideas, the current methods can be describes a task. Write a response that appropriately completes
roughly divided into three categories: pre-correction, in- the request
process-correction, and post-correction. ### Instruction: {instruction}
Pre-correction. An intuitive and straightforward solution ### Image: <image>
### Response: {response}
for hallucination is to collect specialized data (e.g. negative
data) and use the data for fine-tuning, thus resulting in ### Image: <image>
### Response: {response}
models with fewer hallucinated responses.
LRV-Instruction [164] introduces a visual instruction tun-
ing dataset. Apart from common positive instructions, the ### Image: <image>
### Response: <EOS>
dataset incorporates delicately designed negative instruc-
tions at different semantic levels to encourage responses
faithful to the image content. LLaVA-RLHF [112] collects TABLE 9: A simplified example of the template to structure
human-preference pairs and finetunes models with rein- an M-ICL query, adapted from [98]. For illustration, we list
forcement learning techniques, leading to models more two in-context examples and a query divided by a dashed
aligned with less hallucinated answers. line. {instruction} and {response} are texts from the data
In-process-correction. Another line is to make improve- sample. <image> is a placeholder to represent the multi-
ments in architectural design or feature representation. modal input (an image in this case). <BOS> and <EOS> are
These works try to explore the reasons for hallucinations tokens denoting the start and the end of the input to the
and design corresponding remedies to mitigate them in the LLM, respectively.
generation process.
HallE-Switch [159] performs an empirical analysis of
ditional supervised learning paradigms that learn implicit
possible factors of object existence hallucinations and hy-
patterns from abundant data, the crux of ICL is to learn from
pothesizes that existence hallucinations derive from objects
analogy [168]. Specifically, in the ICL setting, LLMs learn
not grounded by visual encoders, and they are actually
from a few examples along with an optional instruction and
inferred based on knowledge embedded in the LLM. Based
extrapolate to new questions, thereby solving complex and
on the assumption, a continuous controlling factor and
unseen tasks in a few-shot manner [22], [169], [170]. (2) ICL
corresponding training scheme are introduced to control the
is usually implemented in a training-free manner [168] and
extent of imagination in model output during inference.
VCD [165] suggests that object hallucinations derive thus can be flexibly integrated into different frameworks at
from two primary causes, i.e. statistical bias in training the inference stage. A closely related technique to ICL is
corpus and strong language prior embedded in LLMs. The instruction-tuning (see §3.2), which is shown empirically to
authors take notice of the phenomenon that when injecting enhance the ICL ability [19].
noise into the image, MLLMs tend to lean towards language In the context of MLLM, ICL has been extended to more
prior rather than the image content for response genera- modalities, leading to Multimodal ICL (M-ICL). Building
tion, leading to hallucinations. Correspondingly, this work upon the setting in (§3.2), at inference time, M-ICL can be
designs an amplify-then-contrast decoding scheme to offset implemented by adding a demonstration set, i.e. a set of
the false bias. in-context samples, to the original sample. In this case, the
HACL [166] investigates the embedding space of vision template can be extended as illustrated in Table 9. Note
and language. Based on the observation, a contrastive learn- that we list two in-context examples for illustration, but
ing scheme is devised to pull paired cross-modal repre- the number and the ordering of examples can be flexibly
sentation closer while pushing away non-hallucinated and adjusted. In fact, models are commonly sensitive to the
hallucinated text representation. arrangement of demonstrations [168], [171].
Post-correction. Different from previous paradigms, post-
correction mitigates hallucinations in a post-remedy way 7.1.1 Improvement on ICL capabilities
and corrects hallucinations after output generation. Wood- Recently, a growing amount of work has focused on en-
pecker [77] is a training-free general framework for hal- hancing ICL performance under various scenarios. In this
lucination correction. Specifically, the method incorporates section, we trace the development of this field and summa-
expert models to supplement contextual information of the rize some relevant works.
image and crafts a pipeline to correct hallucinations step by MIMIC-IT [172] combines in-context learning with in-
step. The method is interpretable in that intermediate results struction tuning by building an instruction dataset for-
of each step can be checked, and objects are grounded in the matted with multimodal context. The model instruction
image. The other method LURE [167] trains a specialized tuned on the introduced dataset shows improved few-shot
revisor to mask objects with high uncertainty in the descrip- performance in the caption task. Emu [173] extends the
tions and regenerates the responses again. idea of Flamingo [74] by introducing extra modalities in
model generation and corresponding training corpus. Aided
by the introduced vision decoder, i.e. Stable Diffusion, the
7 E XTENDED T ECHNIQUES
model learns from extra vision supervision and supports
7.1 Multimodal In-Context Learning more flexibility in output format and in-context reasoning.
ICL is one of the important emergent abilities of LLMs. Specifically, apart from answering in pure text, the model
There are two good traits of ICL: (1) Different from tra- can also give responses in the form of images. Sheng et
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

al. [174] adopt a similar idea and try to extend output model on this proposed dataset. Multimodal-CoT [185] also
modalities into both text and image. Instead of adopting uses the ScienceQA benchmark but generates the output in a
a specialized encoder for images, the work adopts a unified two-step fashion, i.e. the rationale (chain of reasoning steps)
quantization scheme with a shared embedding layer. and the final answer based on the rationale. CoT-PT [187]
Some other works explore improving few-shot learn- learns an implicit chain of reasoning through a combination
ing performance under specific settings. Link-context learn- of prompt tuning and step-specific visual bias.
ing [175] focuses on strengthening the causal link between Compared with finetuning, few/zero-shot learning is
image-label pairs and casts a contrast training scheme by more computationally efficient. The main difference be-
formulating positive and negative image-description pairs. tween them is that the few-shot learning typically requires
MMICL [176] aims to augment the capabilities in reasoning hand-crafting some in-context examples so that the model
with multiple related images. To strengthen the link be- can learn to reason step by step more easily. In contrast, the
tween image and text, the work proposes a context scheme zero-shot learning does not require any specific example for
to transform interleaved image-text data into a uniform CoT learning. In this case, models learn to use the embedded
format. Jeong [177] finds that when inserting a small fraction knowledge and the reasoning ability without explicit guid-
of incoherent images/text as noise, MLLMs can be misled to ance by prompting designed instructions like “Let’s think
give responses inconsistent with the context. Based on the frame by frame” or “What happened between these two
observation, the work accordingly proposes a pre-filtering keyframes” [184], [186]. Similarly, some works [22], [188]
method to remove irrelevant context and facilitate more prompt models with descriptions of the task and tool usage
coherent responses. to decompose complex tasks into sub-tasks.

7.1.2 Applications 7.2.2 Chain Configuration


In terms of applications in multimodality, M-ICL is mainly Structure and length are two critical aspects of the rea-
used in two scenarios: (1) solving various visual reasoning soning chains. In terms of structure, current methods can
tasks [22], [74], [178], [179], [180] and (2) teaching LLMs be divided into single-chain and tree-shape methods. Rea-
to use external tools [169], [170], [181]. The former usu- soning with a single chain is a paradigm widely used in
ally involves learning from a few task-specific examples various methods [116], [185]. Specifically, the step-by-step
and generalizing to a new but similar question. From the reasoning process forms a single question-rationale-answer
information provided in instructions and demonstrations, chain. Recently, some methods have explored using a more
LLMs get a sense of what the task is doing and what the complicated scheme, i.e. tree-shape chain, for reasoning.
output template is and finally generate expected answers. In Specifically, DDCoT [189] breaks down a question into
contrast, examples of tool usage are more fine-grained. They multiple sub-questions, each of which is solved by LLM
typically comprise a chain of steps that could be sequentially itself or visual experts to generate rationales. Then the
executed to fulfill the task. Thus, the second scenario is LLM aggregates and reasons with the rationales to form
closely related to CoT (see §7.2). the final answer. With respect for chain length, it can be
categorized into adaptive and pre-defined formations. The
7.2 Multimodal Chain of Thought former configuration requires LLMs to decide on their own
when to halt the reasoning chains [22], [116], [169], [170],
As the pioneer work [8] points out, CoT is “a series of
[185], [188], while the latter setting stops the chains with a
intermediate reasoning steps”, which has been proven to
pre-defined length [79], [184], [186], [187].
be effective in complex reasoning tasks [8], [182], [183]. The
main idea of CoT is to prompt LLMs to output not only the 7.2.3 Generation Patterns
final answer but also the reasoning process that leads to the How the chain is constructed is a question worth studying.
answer, resembling the cognitive process of humans. We summarize the current works into (1) an infilling-based
Inspired by the success in NLP, multiple works [184], pattern and (2) a predicting-based pattern. Specifically, the
[185], [186], [187] have been proposed to extend the uni- infilling-based pattern demands deducing steps between
modal CoT to Multimodal CoT (M-CoT). We first introduce surrounding context (previous and following steps) to fill
different paradigms for acquiring the M-CoT ability (§7.2.1). the logical gaps [184], [186]. In contrast, the predicting-
Then, we delineate more specific aspects of M-CoT, includ- based pattern requires extending the reasoning chains given
ing the chain configuration (§7.2.2) and the pattern (§7.2.3). conditions such as instructions and previous reasoning his-
tory [22], [116], [169], [170], [185], [188]. The two types of
7.2.1 Learning Paradigms patterns share a requirement that the generated steps should
The learning paradigm is also an aspect worth investigating. be consistent and correct.
There are broadly three ways to acquire the M-CoT abil-
ity, i.e. through finetuning and training-free few/zero-shot
7.3 LLM-Aided Visual Reasoning
learning. The sample size requirement for the three ways is
in descending order. 7.3.1 Introduction
Intuitively, the finetuning approach often involves curat- Inspired by the success of tool-augmented LLMs [190], [191],
ing specific datasets for M-CoT learning. For example, Lu et [192], [193], some researches have explored the possibilities
al. [116] construct a scientific question-answering dataset of invoking external tools [22], [107], [169], [170] or vision
ScienceQA with lectures and explanations, which can serve foundation models [22], [79], [80], [188], [194], [195], [196]
as sources of learning CoT reasoning, and finetune the for visual reasoning tasks. Taking LLMs as helpers with
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

different roles, these works build task-specific [79], [197], The first two roles are related to CoT (see §7.2). It is
[198] or general-purpose [22], [169], [170], [181], [188] visual frequently used because complex tasks need to be broken
reasoning systems. down into intermediate simpler steps. When LLMs act as
Compared with conventional visual reasoning mod- controllers, the systems often finish the task in a single
els [199], [200], [201], these works manifest several good round, while multi-round is more common in the case of the
traits: (1) Strong generalization abilities. Equipped with rich decision maker. We delineate how LLMs serve these roles in
open-world knowledge learned from large-scale pretrain- the following parts.
ing, these systems can easily generalize to unseen objects or LLM as a Controller. In this case, LLMs act as a central
concepts with remarkable zero/few-shot performance [169], controller that (1) breaks down a complex task into simpler
[170], [195], [197], [198], [202]. (2) Emergent abilities. Aided sub-tasks/steps and (2) assigns these tasks to appropriate
by strong reasoning abilities of LLMs, these systems can tools/modules. The first step is often finished by leveraging
perform complex tasks. For example, given an image, MM- the CoT ability of LLMs. Specifically, LLMs are prompted
REACT [22] can interpret the meaning beneath the surface, explicitly to output task planning [181] or, more directly, the
such as explaining why a meme is funny. (3) Better inter- modules to call [107], [169], [170]. For example, VisProg [170]
activity and control. Traditional models typically allow a prompts GPT-3 to output a visual program, where each
limited set of control mechanisms and often entail expensive program line invokes a module to perform a sub-task. In
curated datasets [203], [204]. In contrast, LLM-based sys- addition, LLMs are required to output argument names for
tems have the ability to make fine control in a user-friendly the module input. To handle these complex requirements,
interface (e.g. click and natural language queries) [79]. some hand-crafted in-context examples are used as refer-
For this part, we start with introducing different training ences [169], [170], [181]. This is closely related to the opti-
paradigms employed in the construction of LLM-Aided mization of reasoning chains (see §7.2), or more specifically,
Visual Reasoning systems (§7.3.2). Then, we delve into the the least-to-most prompting [206] technique. In this way,
primary roles that LLMs play within these systems (§7.3.3). complex problems are broken down into sub-problems that
are solved sequentially.
7.3.2 Training Paradigms LLM as a Decision Maker. In this case, complex tasks
are solved in a multi-round manner, often in an iterative
According to training paradigms, LLM-Aided Visual Rea-
way [195]. Decision-makers often fulfill the following re-
soning systems can be divided into two types, i.e. training-
sponsibilities: (1) Summarize the current context and the
free and finetuning.
history information, and decide if the information available
Training-free. With abundant prior knowledge stored in
at the current step is sufficient to answer the question or
pre-trained LLMs, an intuitive and simple way is to freeze
complete the task; (2) Organize and summarize the answer
pre-trained models and directly prompt LLMs to fulfill var-
to present it in a user-friendly way.
ious needs. According to the setting, the reasoning systems
LLM as a Semantics Refiner. When LLM is used as a
can be further categorized into few-shot models [22], [169],
Semantics Refiner, researchers mainly utilize its rich linguis-
[170], [181] and zero-shot models [79], [197]. The few-shot
tics and semantics knowledge. Specifically, LLMs are often
models entail a few hand-crafted in-context samples (see
instructed to integrate information into consistent and fluent
§7.1) to guide LLMs to generate a program or a sequence of
natural language sentences [202] or generate texts according
execution steps. These programs or execution steps serve
to different specific needs [79], [197], [198].
as instructions for corresponding foundation models or
external tools/modules. The zero-shot models take a step
further by directly utilizing LLMs’ linguistics/semantics 8 C HALLENGES AND F UTURE D IRECTIONS
knowledge or reasoning abilities. For example, PointCLIP
The development of MLLMs is still in a rudimentary stage
V2 [197] prompts GPT-3 to generate descriptions with 3D-
and thus leaves much room for improvement, which we
related semantics for better alignment with corresponding
summarize below:
images. In CAT [79], LLMs are instructed to refine the
• Current MLLMs are limited in processing multimodal
captions according to user queries.
information of long context. This restricts the devel-
Finetuning. Some works adopt further finetuning to im-
opment of advanced models with more multimodal
prove the planning abilities with respect to tool us-
tokens, e.g. long-video understanding, and long doc-
age [107] or to improve localization capabilities [142], [205]
uments interleaved with images and text.
of the system. For example, GPT4Tools [107] introduces
• MLLMs should be upgraded to follow more compli-
the instruction-tuning approach (see §3.2). Accordingly, a
cated instructions. For example, a mainstream approach
new tool-related instruction dataset is collected and used to
to generating high-quality question-answer pair data is
finetune the model.
still prompting closed-source GPT-4V because of its ad-
vanced instruction-following capabilities, while other
7.3.3 Functions models generally fail to achieve.
In order to further inspect what roles LLMs exactly play • There is still a large space for improvement in tech-
in LLM-Aided Visual Reasoning systems, existing related niques like M-ICL and M-CoT. Current research on
works are divided into three types: the two techniques is still rudimentary, and the related
• LLM as a Controller capabilities of MLLMs are weak. Thus, explorations of
• LLM as a Decision Maker the underlying mechanisms and potential improvement
• LLM as a Semantics Refiner are promising.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

• Developing embodied agents based on MLLMs is a [17] J. Cho, J. Lei, H. Tan, and M. Bansal, “Unifying vision-and-
heated topic. It would be meaningful to develop such language tasks via text generation,” in ICML, 2021. 1
[18] Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao,
agents that can interact with the real world. Such en- “Simvlm: Simple visual language model pretraining with weak
deavors require models with critical capabilities, in- supervision,” arXiv:2108.10904, 2021. 1
cluding perception, reasoning, planning, and execution. [19] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du,
• Safety issues. Similar to LLMs, MLLMs can be vulnera- A. M. Dai, and Q. V. Le, “Finetuned language models are zero-
shot learners,” arXiv:2109.01652, 2021. 1, 6, 11
ble to crafted attacks [177], [207], [208]. In other words, [20] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,”
MLLMs can be misled to output biased or undesirable arXiv:2304.08485, 2023. 1, 4, 6, 7, 8, 9, 10
responses. Thus, improving model safety will be an [21] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4:
important topic. Enhancing vision-language understanding with advanced large
language models,” arXiv:2304.10592, 2023. 1, 2, 6, 7
[22] Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu,
9 C ONCLUSION C. Liu, M. Zeng, and L. Wang, “Mm-react: Prompting chatgpt for
multimodal reasoning and action,” arXiv:2303.11381, 2023. 1, 11,
In this paper, we perform a survey of the existing MLLM 12, 13
literature and offer a broad view of its main directions, [23] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter,
including the basic recipe and related extensions. Moreover, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An
embodied multimodal language model,” arXiv:2303.03378, 2023.
we underscore the current research gaps that need to be 1
filled and point out some promising research directions. We [24] A. Awadalla, I. Gao, J. Gardner, J. Hessel, Y. Hanafy, W. Zhu,
hope this survey can offer readers a clear picture of the K. Marathe, Y. Bitton, S. Gadre, S. Sagawa et al., “Openflamingo:
current progress of MLLM and inspire more works. An open-source framework for training large autoregressive
vision-language models,” arXiv:2308.01390, 2023. 1
[25] K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang,
R EFERENCES and Y. Qiao, “Videochat: Chat-centric video understanding,”
arXiv:2305.06355, 2023. 1, 4, 6
[1] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, [26] H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction-
B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language tuned audio-visual language model for video understanding,”
models,” arXiv:2303.18223, 2023. 1 arXiv:2306.02858, 2023. 1, 4
[2] OpenAI, “Chatgpt: A language model for conversational
[27] S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An
ai,” OpenAI, Tech. Rep., 2023. [Online]. Available: https:
audio language model for audio tasks,” NeurIPS, 2024. 1, 3
//www.openai.com/research/chatgpt 1, 6
[28] K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao,
[3] ——, “Gpt-4 technical report,” arXiv:2303.08774, 2023. 1
“Shikra: Unleashing multimodal llm’s referential dialogue
[4] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng,
magic,” arXiv:2306.15195. 1, 9
S. Zhuang, Y. Zhuang, J. E. Gonzalez et al., “Vicuna: An
[29] Y. Yuan, W. Li, J. Liu, D. Tang, X. Luo, C. Qin, L. Zhang, and
open-source chatbot impressing gpt-4 with 90% chatgpt quality,”
J. Zhu, “Osprey: Pixel understanding with visual instruction
2023. [Online]. Available: https://vicuna.lmsys.org 1, 3, 4
tuning,” arXiv:2312.10032. 1, 2, 9
[5] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., [30] J. Han, R. Zhang, W. Shao, P. Gao, P. Xu, H. Xiao, K. Zhang, C. Liu,
“Llama: Open and efficient foundation language models,” S. Wen, Z. Guo et al., “Imagebind-llm: Multi-modality instruction
arXiv:2302.13971, 2023. 1, 3, 4 tuning,” arXiv:2309.03905, 2023. 1, 3
[6] B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning [31] S. Moon, A. Madotto, Z. Lin, T. Nagarajan, M. Smith, S. Jain,
with gpt-4,” arXiv:2304.03277, 2023. 1 C.-F. Yeh, P. Murugesan, P. Heidari, Y. Liu et al., “Anymal: An
[7] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- efficient and scalable any-modality augmented language model,”
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Lan- arXiv:2309.16058, 2023. 1
guage models are few-shot learners,” NeurIPS, 2020. 1, 3, 6 [32] S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any
[8] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and multimodal llm,” arXiv:2309.05519, 2023. 1, 9
D. Zhou, “Chain of thought prompting elicits reasoning in large [33] J. Hu, Y. Yao, C. Wang, S. Wang, Y. Pan, Q. Chen, T. Yu, H. Wu,
language models,” arXiv:2201.11903, 2022. 1, 12 Y. Zhao, H. Zhang et al., “Large multilingual models pivot zero-
[9] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, shot multimodal learning across languages,” arXiv:2308.12038,
T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment 2023. 1, 10
anything,” arXiv:2304.02643, 2023. 1, 9 [34] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou,
[10] Y. Shen, C. Fu, P. Chen, M. Zhang, K. Li, X. Sun, Y. Wu, S. Lin, and J. Zhou, “Qwen-vl: A frontier large vision-language model
and R. Ji, “Aligning and prompting everything all at once for with versatile abilities,” arXiv:2308.12966, 2023. 1, 3, 4, 10
universal visual perception,” in CVPR, 2024. 1 [35] C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang,
[11] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a
H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes large language-and-vision assistant for biomedicine in one day,”
for end-to-end object detection,” arXiv:2203.03605, 2022. 1 arXiv:2306.00890, 2023. 1, 4, 6, 8, 9, 10
[12] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, [36] M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y. Dalmia, J. Leskovec,
V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., C. Zakka, E. P. Reis, and P. Rajpurkar, “Med-flamingo: a multi-
“Dinov2: Learning robust visual features without supervision,” modal medical few-shot learner,” in Machine Learning for Health
arXiv:2304.07193, 2023. 1 (ML4H), 2023. 1, 10
[13] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- [37] X. Zhang, C. Wu, Z. Zhao, W. Lin, Y. Zhang, Y. Wang, and W. Xie,
wal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning “Pmc-vqa: Visual instruction tuning for medical visual question
transferable visual models from natural language supervision,” answering,” arXiv:2305.10415, 2023. 1, 4, 6, 10
in ICML, 2021. 1, 2, 3, 5 [38] J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, Y. Dan, C. Zhao, G. Xu, C. Li,
[14] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, J. Tian et al., “mplug-docowl: Modularized multimodal large
“Align before fuse: Vision and language representation learning language model for document understanding,” arXiv:2307.02499,
with momentum distillation,” NeurIPS, 2021. 1 2023. 1, 10
[15] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, [39] Y. Liu, B. Yang, Q. Liu, Z. Li, Z. Ma, S. Zhang, and X. Bai,
and J. Liu, “Uniter: Universal image-text representation learn- “Textmonkey: An ocr-free large multimodal model for under-
ing,” in ECCV, 2020. 1 standing document,” arXiv:2403.04473, 2024. 1, 10
[16] P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, [40] A. Hu, Y. Shi, H. Xu, J. Ye, Q. Ye, M. Yan, C. Li, Q. Qian, J. Zhang,
J. Zhou, and H. Yang, “Ofa: Unifying architectures, tasks, and F. Huang, “mplug-paperowl: Scientific diagram analysis
and modalities through a simple sequence-to-sequence learning with the multimodal large language model,” arXiv:2311.18248,
framework,” in ICML, 2022. 1 2023. 1
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

[41] J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y. Wang, Q. Li, S.-C. [65] S. Shen, L. Hou, Y. Zhou, N. Du, S. Longpre, J. Wei, H. W. Chung,
Zhu, B. Jia, and S. Huang, “An embodied generalist agent in 3d B. Zoph, W. Fedus, X. Chen et al., “Mixture-of-experts meets
world,” arXiv:2311.12871, 2023. 1, 9, 10 instruction tuning: A winning combination for large language
[42] Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei, models,” arXiv:2305.14705, 2023. 3
“Kosmos-2: Grounding multimodal large language models to the [66] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary,
world,” arXiv:2306.14824, 2023. 1 C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand
[43] Z. Yang, J. Liu, Y. Han, X. Chen, Z. Huang, B. Fu, and et al., “Mixtral of experts,” arXiv:2401.04088, 2024. 3
G. Yu, “Appagent: Multimodal agents as smartphone users,” [67] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling
arXiv:2312.13771, 2023. 1, 10 to trillion parameter models with simple and efficient sparsity,”
[44] W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y. Wang, Z. Wang, JMLR, 2022. 3
Y. Dong, M. Ding et al., “Cogagent: A visual language model for [68] B. Lin, Z. Tang, Y. Ye, J. Cui, B. Zhu, P. Jin, J. Zhang, M. Ning, and
gui agents,” arXiv:2312.08914, 2023. 1, 3, 10 L. Yuan, “Moe-llava: Mixture of experts for large vision-language
[45] J. Wang, H. Xu, J. Ye, M. Yan, W. Shen, J. Zhang, F. Huang, and models,” arXiv:2401.15947, 2024. 3
J. Sang, “Mobile-agent: Autonomous multi-modal mobile device [69] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
agent with visual perception,” arXiv:2401.16158, 2024. 1, 10 S. Zagoruyko, “End-to-end object detection with transformers,”
[46] M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, in ECCV, 2020. 4
C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Repro- [70] F. Chen, M. Han, H. Zhao, Q. Zhang, J. Shi, S. Xu, and B. Xu, “X-
ducible scaling laws for contrastive language-image learning,” llm: Bootstrapping advanced large language models by treating
in CVPR, 2023. 2, 3 multi-modalities as foreign languages,” arXiv:2305.04160, 2023. 4,
[47] Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao, “Eva-clip: Improved 6, 8, 9
training techniques for clip at scale,” arXiv:2303.15389, 2023. 2, 3 [71] Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai, “Pandagpt: One
[48] Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, model to instruction-follow them all,” arXiv:2305.16355, 2023. 4,
X. Wang, and Y. Cao, “Eva: Exploring the limits of masked visual 6
representation learning at scale,” in CVPR, 2023. 2 [72] R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han,
[49] R. Bavishi, E. Elsen, C. Hawthorne, M. Nye, A. Odena, A. Somani, H. Xu, and L. K. T. Zhang, “Detgpt: Detect what you need via
and S. Taşırlar, “Introducing our multimodal models,” 2023. reasoning,” arXiv:2305.14167, 2023. 4, 7
[Online]. Available: https://www.adept.ai/blog/fuyu-8b 2 [73] Y. Zeng, H. Zhang, J. Zheng, J. Xia, G. Wei, Y. Wei, Y. Zhang, and
[50] H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual T. Kong, “What matters in training a gpt4-style language model
instruction tuning,” arXiv:2310.03744, 2023. 3, 4 with multimodal inputs?” arXiv:2307.02469, 2023. 4, 7
[51] Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y. Sun, Y. Liu, and [74] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson,
X. Bai, “Monkey: Image resolution and text label are important K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a
things for large multi-modal models,” arXiv:2311.06607, 2023. 3 visual language model for few-shot learning,” NeurIPS, 2022. 4,
[52] B. McKinzie, Z. Gan, J.-P. Fauconnier, S. Dodge, B. Zhang, 11, 12
P. Dufter, D. Shah, X. Du, F. Peng, F. Weers et al., “Mm1: [75] W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang,
Methods, analysis & insights from multimodal llm pre-training,” L. Zhao, X. Song et al., “Cogvlm: Visual expert for pretrained
arXiv:2403.09611, 2024. 3, 4 language models,” arXiv:2311.03079, 2023. 4
[53] Z. Lin, C. Liu, R. Zhang, P. Gao, L. Qiu, H. Xiao, H. Qiu, C. Lin, [76] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao,
W. Shao, K. Chen et al., “Sphinx: The joint mixing of weights, and Y. Qiao, “Llama-adapter: Efficient fine-tuning of language
tasks, and visual embeddings for multi-modal large language models with zero-init attention,” arXiv:2303.16199, 2023. 4, 6, 8, 9
models,” arXiv:2311.07575, 2023. 3 [77] S. Yin, C. Fu, S. Zhao, T. Xu, H. Wang, D. Sui, Y. Shen, K. Li,
[54] B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap X. Sun, and E. Chen, “Woodpecker: Hallucination correction for
learning audio concepts from natural language supervision,” in multimodal large language models,” arXiv:2310.16045, 2023. 4, 9,
ICASSP, 2023. 3 10, 11
[55] R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, [78] J. Guo, J. Li, D. Li, A. M. H. Tiong, B. Li, D. Tao, and S. Hoi, “From
and I. Misra, “Imagebind: One embedding space to bind them images to textual prompts: Zero-shot visual question answering
all,” in CVPR, 2023. 3 with frozen large language models,” in CVPR, 2023. 4
[56] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, [79] T. Wang, J. Zhang, J. Fei, Y. Ge, H. Zheng, Y. Tang, Z. Li, M. Gao,
X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction- S. Zhao, Y. Shan et al., “Caption anything: Interactive image
finetuned language models,” arXiv:2210.11416, 2022. 3, 4, 6 description with diverse multimodal controls,” arXiv:2305.02677,
[57] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, 2023. 4, 12, 13
Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale [80] D. Zhu, J. Chen, K. Haydarov, X. Shen, W. Zhang, and M. El-
et al., “Llama 2: Open foundation and fine-tuned chat models,” hoseiny, “Chatgpt asks, blip-2 answers: Automatic questioning
arXiv:2307.09288, 2023. 3, 4 towards enriched visual descriptions,” arXiv:2303.06594, 2023. 4,
[58] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, 12
W. Ge, Y. Han, F. Huang et al., “Qwen technical report,” [81] Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu,
arXiv:2309.16609, 2023. 3, 4, 10 P. Shi, Y. Shi et al., “mplug-owl: Modularization empowers large
[59] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language models with multimodality,” arXiv:2304.14178, 2023. 4,
language-image pre-training with frozen image encoders and 7, 9
large language models,” arXiv:2301.12597, 2023. 3, 4 [82] W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, T. Lu,
[60] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, J. Zhou, Y. Qiao et al., “Visionllm: Large language model is also an
B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general- open-ended decoder for vision-centric tasks,” arXiv:2305.11175,
purpose vision-language models with instruction tuning,” 2023. 4, 6
arXiv:2305.06500, 2023. 3, 4, 6, 7, 8 [83] L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and
[61] H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee, D. Lin, “Sharegpt4v: Improving large multi-modal models with
“Llava-next: Improved reasoning, ocr, and world knowledge,” better captions,” arXiv:2311.12793, 2023. 5
January 2024. [Online]. Available: https://llava-vl.github.io/ [84] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual
blog/2024-01-30-llava-next/ 3 captions: A cleaned, hypernymed, image alt-text dataset for
[62] Y. Lu, C. Li, H. Liu, J. Yang, J. Gao, and Y. Shen, “An empir- automatic image captioning,” in ACL, 2018. 5
ical study of scaling instruct-tuned large multimodal models,” [85] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual
arXiv:2309.09958, 2023. 3 12m: Pushing web-scale image-text pre-training to recognize
[63] X. Chu, L. Qiao, X. Lin, S. Xu, Y. Yang, Y. Hu, F. Wei, long-tail visual concepts,” in CVPR, 2021. 5
X. Zhang, B. Zhang, X. Wei et al., “Mobilevlm: A fast, repro- [86] V. Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing im-
ducible and strong vision language assistant for mobile devices,” ages using 1 million captioned photographs,” NeurIPS, 2011. 5
arXiv:2312.16886, 2023. 3, 10 [87] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman,
[64] X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y. Yang, X. Sun, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al.,
Y. Hu, X. Lin, B. Zhang et al., “Mobilevlm v2: Faster and stronger “Laion-5b: An open large-scale dataset for training next genera-
baseline for vision language model,” arXiv:2402.03766, 2024. 3 tion image-text models,” NeurIPS, 2022. 5
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

[88] C. Schuhmann, A. Köpf, R. Vencu, T. Coombes, and R. Beau- [111] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss,
mont, “Laion coco: 600m synthetic captions from laion2b-en.” A. Radford, D. Amodei, and P. F. Christiano, “Learning to sum-
https://laion.ai/blog/laion-coco/, 2022. 5 marize with human feedback,” NeurIPS, 2020. 8
[89] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language- [112] Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y. Shen, C. Gan, L.-Y. Gui,
image pre-training for unified vision-language understanding Y.-X. Wang, Y. Yang et al., “Aligning large multimodal models
and generation,” in ICML, 2022. 5 with factually augmented rlhf,” arXiv:2309.14525, 2023. 8, 11
[90] M. Byeon, B. Park, H. Kim, S. Lee, W. Baek, and S. Kim, [113] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and
“Coyo-700m: Image-text pair dataset,” https://github.com/ C. Finn, “Direct preference optimization: Your language model is
kakaobrain/coyo-dataset, 2022. 5 secretly a reward model,” NeurIPS, 2023. 8
[91] J. Wang, L. Meng, Z. Weng, B. He, Z. Wu, and Y.-G. Jiang, “To [114] T. Yu, Y. Yao, H. Zhang, T. He, Y. Han, G. Cui, J. Hu, Z. Liu,
see is to believe: Prompting gpt-4v for better visual instruction H.-T. Zheng, M. Sun et al., “Rlhf-v: Towards trustworthy mllms
tuning,” arXiv:2311.07574, 2023. 5, 7 via behavior alignment from fine-grained correctional human
[92] G. H. Chen, S. Chen, R. Zhang, J. Chen, X. Wu, Z. Zhang, feedback,” arXiv:2312.00849, 2023. 8
Z. Chen, J. Li, X. Wan, and B. Wang, “Allava: Harness- [115] L. Li, Z. Xie, M. Li, S. Chen, P. Wang, L. Chen, Y. Yang, B. Wang,
ing gpt4v-synthesized data for a lite vision-language model,” and L. Kong, “Silkie: Preference distillation for large visual
arXiv:2402.11684, 2024. 5, 7 language models,” arXiv:2312.10665, 2023. 8
[93] J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video descrip- [116] P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord,
tion dataset for bridging video and language,” in CVPR, 2016. P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning
5 via thought chains for science question answering,” NeurIPS,
[94] X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, 2022. 8, 9, 12
Y. Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly- [117] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider:
labelled audio captioning dataset for audio-language multimodal Consensus-based image description evaluation,” in CVPR, 2015.
research,” arXiv:2303.17395, 2023. 5 8
[95] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, [118] H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson,
P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., D. Batra, D. Parikh, S. Lee, and P. Anderson, “Nocaps: Novel
“Training language models to follow instructions with human object captioning at scale,” in ICCV, 2019. 8
feedback,” NeurIPS, 2022. 6, 8 [119] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image
[96] S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, descriptions to visual denotations: New similarity metrics for
K. Shuster, T. Wang, Q. Liu, P. S. Koura et al., “Opt-iml: Scaling semantic inference over event descriptions,” TACL, 2014. 8
language model instruction meta learning through the lens of [120] X. He, Y. Zhang, L. Mou, E. Xing, and P. Xie, “Pathvqa:
generalization,” arXiv:2212.12017, 2022. 6 30000+ questions for medical visual question answering,”
[97] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, arXiv:2003.10286, 2020. 9
Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja et al., “Mul- [121] J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman,
titask prompted training enables zero-shot task generalization,” “A dataset of clinically generated visual questions and answers
arXiv:2110.08207, 2021. 6 about radiology images,” Sci. Data, 2018. 9
[98] T. Gong, C. Lyu, S. Zhang, Y. Wang, M. Zheng, Q. Zhao, K. Liu, [122] B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y. Yang, and X.-M. Wu, “Slake:
W. Zhang, P. Luo, and K. Chen, “Multimodal-gpt: A vision and A semantically-labeled knowledge-enhanced dataset for medical
language model for dialogue with humans,” arXiv:2305.04790, visual question answering,” in ISBI, 2021. 9
2023. 6, 7, 11 [123] C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, Z. Qiu, W. Lin,
Z. Qiu, W. Lin et al., “Mme: A comprehensive evaluation bench-
[99] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick,
mark for multimodal large language models,” arXiv:2306.13394,
and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015.
2023. 9, 10
6
[124] Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan,
[100] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for
J. Wang, C. He, Z. Liu et al., “Mmbench: Is your multi-modal
generating image descriptions,” in CVPR, 2015. 6
model an all-around player?” arXiv:2307.06281, 2023. 9
[101] G. Luo, Y. Zhou, T. Ren, S. Chen, X. Sun, and R. Ji, “Cheap
[125] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and
and quick: Efficient vision-language instruction tuning for large
L. Wang, “Mm-vet: Evaluating large multimodal models for
language models,” arXiv:2305.15023, 2023. 6, 7, 8, 9
integrated capabilities,” arXiv:2308.02490, 2023. 9
[102] Z. Xu, Y. Shen, and L. Huang, “Multiinstruct: Improv- [126] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan, “Seed-
ing multi-modal zero-shot learning via instruction tuning,” bench: Benchmarking multimodal llms with generative compre-
arXiv:2212.10773, 2022. 6, 7, 8, 9 hension,” in CVPR, 2024. 9
[103] P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, [127] P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W.
P. Lu, C. He, X. Yue et al., “Llama-adapter v2: Parameter-efficient Chang, M. Galley, and J. Gao, “Mathvista: Evaluating mathemat-
visual instruction model,” arXiv:2304.15010, 2023. 6, 7, 8, 9 ical reasoning of foundation models in visual contexts,” in ICLR,
[104] Z. Zhao, L. Guo, T. Yue, S. Chen, S. Shao, X. Zhu, Z. Yuan, 2024. 9
and J. Liu, “Chatbridge: Bridging modalities with large language [128] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens,
model as a language catalyst,” arXiv:2305.16103, 2023. 6, 7, 8, 9 D. Jiang, W. Ren, Y. Sun et al., “Mmmu: A massive multi-
[105] L. Li, Y. Yin, S. Li, L. Chen, P. Wang, S. Ren, M. Li, Y. Yang, J. Xu, discipline multimodal understanding and reasoning benchmark
X. Sun, L. Kong, and Q. Liu, “M3 it: A large-scale dataset towards for expert agi,” arXiv:2311.16502, 2023. 9
multi-modal multilingual instruction tuning,” arXiv:2306.04387, [129] F. Liu, T. Guan, Z. Li, L. Chen, Y. Yacoob, D. Manocha, and
2023. 6, 8, 9 T. Zhou, “Hallusionbench: You see what you think? or you
[106] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, think what you see? an image-context reasoning benchmark
and H. Hajishirzi, “Self-instruct: Aligning language model with challenging for gpt-4v (ision), llava-1.5, and other multi-modality
self generated instructions,” arXiv:2212.10560, 2022. 7 models,” in CVPR, 2024. 9
[107] R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li, and Y. Shan, [130] M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-chatgpt:
“Gpt4tools: Teaching large language model to use tools via self- Towards detailed video understanding via large vision and lan-
instruction,” arXiv:2305.18752, 2023. 7, 9, 12, 13 guage models,” arXiv:2306.05424, 2023. 9
[108] L. Wei, Z. Jiang, W. Huang, and L. Sun, “Instructiongpt- [131] M. Ning, B. Zhu, Y. Xie, B. Lin, J. Cui, L. Yuan, D. Chen,
4: A 200-instruction paradigm for fine-tuning minigpt-4,” and L. Yuan, “Video-bench: A comprehensive benchmark and
arXiv:2308.12067, 2023. 7 toolkit for evaluating video-based large language models,”
[109] Y. Du, H. Guo, K. Zhou, W. X. Zhao, J. Wang, C. Wang, M. Cai, arXiv:2311.16103, 2023. 9
R. Song, and J.-R. Wen, “What makes for good visual instruc- [132] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen,
tions? synthesizing complex visual reasoning instructions for “Evaluating object hallucination in large vision-language mod-
visual instruction tuning,” arXiv:2311.01487, 2023. 7 els,” arXiv:2305.10355, 2023. 9, 10
[110] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, [133] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
models from human preferences,” arXiv:1909.08593, 2019. 8 context,” in ECCV, 2014. 9
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17

[134] M. Li, L. Li, Y. Yin, M. Ahmed, Z. Liu, and Q. Liu, “Red teaming large language-and-vision assistant for biomedicine in one day,”
visual language models,” arXiv:2401.12915, 2024. 9 arXiv:2306.00890, 2023. 10
[135] Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, [159] B. Zhai, S. Yang, X. Zhao, C. Xu, S. Shen, D. Zhao, K. Keutzer,
“The dawn of lmms: Preliminary explorations with gpt-4v M. Li, T. Yan, and X. Fan, “Halle-switch: Rethinking and con-
(ision),” arXiv:2309.17421. 9 trolling object existence hallucinations in large vision language
[136] L. Wen, X. Yang, D. Fu, X. Wang, P. Cai, X. Li, T. Ma, Y. Li, models for detailed caption,” arXiv:2310.01779, 2023. 10, 11
L. Xu, D. Shang et al., “On the road with gpt-4v (ision): Early [160] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko,
explorations of visual-language model on autonomous driving,” “Object hallucination in image captioning,” in EMNLP, 2018. 10
arXiv:2311.05332. 9 [161] J. Wang, Y. Zhou, G. Xu, P. Shi, C. Zhao, H. Xu, Q. Ye, M. Yan,
[137] C. Fu, R. Zhang, H. Lin, Z. Wang, T. Gao, Y. Luo, Y. Huang, J. Zhang, J. Zhu et al., “Evaluation and analysis of hallucination
Z. Zhang, L. Qiu, G. Ye et al., “A challenger to gpt-4v? early in large vision-language models,” arXiv:2308.15126, 2023. 10
explorations of gemini in visual expertise,” arXiv:2312.12436. 9 [162] L. Jing, R. Li, Y. Chen, M. Jia, and X. Du, “Faithscore:
[138] S. Zhang, P. Sun, S. Chen, M. Xiao, W. Shao, W. Zhang, K. Chen, Evaluating hallucinations in large vision-language models,”
and P. Luo, “Gpt4roi: Instruction tuning large language model on arXiv:2311.01477, 2023. 10
region-of-interest,” arXiv:2307.03601, 2023. 9 [163] J. Wang, Y. Wang, G. Xu, J. Zhang, Y. Gu, H. Jia, M. Yan,
[139] S. Xuan, Q. Guo, M. Yang, and S. Zhang, “Pink: Unveiling J. Zhang, and J. Sang, “An llm-free multi-dimensional benchmark
the power of referential comprehension for multi-modal llms,” for mllms hallucination evaluation,” arXiv:2311.07397, 2023. 10
arXiv:2310.00582, 2023. 9 [164] F. Liu, K. Lin, L. Li, J. Wang, Y. Yacoob, and L. Wang, “Mitigating
[140] H. Rasheed, M. Maaz, S. Shaji, A. Shaker, S. Khan, H. Cholakkal, hallucination in large multi-modal models via robust instruction
R. M. Anwer, E. Xing, M.-H. Yang, and F. S. Khan, “Glamm: Pixel tuning,” in ICLR, 2024. 11
grounding large multimodal model,” arXiv:2311.03356. 9 [165] S. Leng, H. Zhang, G. Chen, X. Li, S. Lu, C. Miao, and L. Bing,
[141] H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, “Mitigating object hallucinations in large vision-language models
S.-F. Chang, and Y. Yang, “Ferret: Refer and ground anything through visual contrastive decoding,” in CVPR, 2024. 11
anywhere at any granularity,” arXiv:2310.07704, 2023. 9 [166] C. Jiang, H. Xu, M. Dong, J. Chen, W. Ye, M. Yan, Q. Ye,
[142] X. Lai, Z. Tian, Y. Chen, Y. Li, Y. Yuan, S. Liu, and J. Jia, J. Zhang, F. Huang, and S. Zhang, “Hallucination augmented
“Lisa: Reasoning segmentation via large language model,” contrastive learning for multimodal large language model,”
arXiv:2308.00692, 2023. 9, 13 arXiv:2312.06968, 2023. 11
[143] R. Xu, X. Wang, T. Wang, Y. Chen, J. Pang, and D. Lin, [167] Y. Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn, M. Bansal,
“Pointllm: Empowering large language models to understand and H. Yao, “Analyzing and mitigating object hallucination in
point clouds,” arXiv:2308.16911, 2023. 9 large vision-language models,” arXiv:2310.00754, 2023. 11
[144] S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, [168] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu,
J. Fan, and T. Chen, “Ll3da: Visual interactive instruction and Z. Sui, “A survey for in-context learning,” arXiv:2301.00234,
tuning for omni-3d understanding, reasoning, and planning,” 2022. 11
arXiv:2311.18651, 2023. 9
[169] P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N. Wu, S.-
[145] Y. Hong, H. Zhen, P. Chen, S. Zheng, Y. Du, Z. Chen, and C. Gan,
C. Zhu, and J. Gao, “Chameleon: Plug-and-play compositional
“3d-llm: Injecting the 3d world into large language models,”
reasoning with large language models,” arXiv:2304.09842, 2023.
NeurIPS, 2023. 9
11, 12, 13
[146] Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu,
[170] T. Gupta and A. Kembhavi, “Visual programming: Composi-
T. Huang, and X. Wang, “Generative pretraining in multimodal-
tional visual reasoning without training,” in CVPR, 2023. 11,
ity,” in ICLR, 2024. 9
12, 13
[147] J. Zhan, J. Dai, J. Ye, Y. Zhou, D. Zhang, Z. Liu, X. Zhang, R. Yuan,
[171] Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp, “Fan-
G. Zhang, L. Li et al., “Anygpt: Unified multimodal llm with
tastically ordered prompts and where to find them: Overcoming
discrete sequence modeling,” arXiv:2402.12226, 2024. 9
few-shot prompt order sensitivity,” arXiv:2104.08786, 2021. 11
[148] E. Aiello, L. Yu, Y. Nie, A. Aghajanyan, and B. Oguz,
“Jointly training large autoregressive multimodal models,” [172] B. Li, Y. Zhang, L. Chen, J. Wang, F. Pu, J. Yang, C. Li, and
arXiv:2309.15564, 2023. 9 Z. Liu, “Mimic-it: Multi-modal in-context instruction tuning,”
arXiv:2306.05425, 2023. 11
[149] D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu,
“Speechgpt: Empowering large language models with intrinsic [173] Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu,
cross-modal conversational abilities,” arXiv:2305.11000, 2023. 9 T. Huang, and X. Wang, “Generative pretraining in multimodal-
ity,” arXiv:2307.05222, 2023. 11
[150] P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna,
Z. Borsos, F. d. C. Quitry, P. Chen, D. E. Badawy, W. Han, [174] D. Sheng, D. Chen, Z. Tan, Q. Liu, Q. Chu, J. Bao, T. Gong,
E. Kharitonov et al., “Audiopalm: A large language model that B. Liu, S. Xu, and N. Yu, “Towards more unified in-context visual
can speak and listen,” arXiv:2306.12925, 2023. 9 understanding,” arXiv:2312.02520, 2023. 12
[151] X. Wang, B. Zhuang, and Q. Wu, “Modaverse: Efficiently trans- [175] Y. Tai, W. Fan, Z. Zhang, F. Zhu, R. Zhao, and Z. Liu, “Link-
forming modalities with llms,” arXiv:2401.06395, 2024. 9 context learning for multimodal llms,” arXiv:2308.07891, 2023. 12
[152] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic [176] H. Zhao, Z. Cai, S. Si, X. Ma, K. An, L. Chen, Z. Liu, S. Wang,
models,” NeurIPS, 2020. 10 W. Han, and B. Chang, “Mmicl: Empowering vision-language
[153] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, model with multi-modal in-context learning,” arXiv:2309.07915,
“High-resolution image synthesis with latent diffusion models,” 2023. 12
in CVPR, 2022. 10 [177] J. Jeong, “Hijacking context in large multi-modal models,”
[154] R. Gong, Q. Huang, X. Ma, H. Vo, Z. Durante, Y. Noda, Z. Zheng, arXiv:2312.07553, 2023. 12, 14
S.-C. Zhu, D. Terzopoulos, L. Fei-Fei et al., “Mindagent: Emergent [178] Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, and L. Wang, “An
gaming interaction,” arXiv:2309.09971, 2023. 10 empirical study of gpt-3 for few-shot knowledge-based vqa,” in
[155] Y. Mu, Q. Zhang, M. Hu, W. Wang, M. Ding, J. Jin, B. Wang, J. Dai, AAAI, 2022. 12
Y. Qiao, and P. Luo, “Embodiedgpt: Vision-language pre-training [179] M. Tsimpoukelli, J. L. Menick, S. Cabi, S. Eslami, O. Vinyals,
via embodied chain of thought,” arXiv:2305.15021, 2023. 10 and F. Hill, “Multimodal few-shot learning with frozen language
[156] A. Hu, H. Xu, J. Ye, M. Yan, L. Zhang, B. Zhang, C. Li, J. Zhang, models,” NeurIPS, 2021. 12
Q. Jin, F. Huang et al., “mplug-docowl 1.5: Unified structure [180] B. Li, Y. Zhang, L. Chen, J. Wang, J. Yang, and Z. Liu, “Ot-
learning for ocr-free document understanding,” arXiv:2403.12895, ter: A multi-modal model with in-context instruction tuning,”
2024. 10 arXiv:2305.03726, 2023. 12
[157] J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, G. Xu, C. Li, J. Tian, Q. Qian, [181] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugging-
J. Zhang et al., “Ureader: Universal ocr-free visually-situated lan- gpt: Solving ai tasks with chatgpt and its friends in huggingface,”
guage understanding with multimodal large language model,” arXiv:2303.17580, 2023. 12, 13
in EMNLP, 2023. 10 [182] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large
[158] C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, language models are zero-shot reasoners,” arXiv:2205.11916,
T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a 2022. 12
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18

[183] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of [207] Y. Zhao, T. Pang, C. Du, X. Yang, C. Li, N.-M. Cheung, and M. Lin,
thought prompting in large language models,” arXiv:2210.03493, “On evaluating adversarial robustness of large vision-language
2022. 12 models,” arXiv:2305.16934, 2023. 14
[184] D. Rose, V. Himakunthala, A. Ouyang, R. He, A. Mei, Y. Lu, [208] E. Shayegani, Y. Dong, and N. Abu-Ghazaleh, “Jailbreak in
M. Saxon, C. Sonar, D. Mirza, and W. Y. Wang, “Visual chain pieces: Compositional adversarial attacks on multi-modal lan-
of thought: Bridging logical gaps with multimodal infillings,” guage models,” in ICLR, 2023. 14
arXiv:2305.02317, 2023. 12
[185] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola,
“Multimodal chain-of-thought reasoning in language models,”
arXiv:2302.00923, 2023. 12
[186] V. Himakunthala, A. Ouyang, D. Rose, R. He, A. Mei, Y. Lu,
C. Sonar, M. Saxon, and W. Y. Wang, “Let’s think frame by
frame: Evaluating video chain of thought with video infilling and
prediction,” arXiv:2305.13903, 2023. 12
[187] J. Ge, H. Luo, S. Qian, Y. Gan, J. Fu, and S. Zhan,
“Chain of thought prompt tuning in vision language models,”
arXiv:2304.07919, 2023. 12
[188] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, “Visual
chatgpt: Talking, drawing and editing with visual foundation
models,” arXiv:2303.04671, 2023. 12, 13
[189] G. Zheng, B. Yang, J. Tang, H.-Y. Zhou, and S. Yang, “Ddcot:
Duty-distinct chain-of-thought prompting for multimodal rea-
soning in language models,” in NeurIPS, 2023. 12
[190] A. Parisi, Y. Zhao, and N. Fiedel, “Talm: Tool augmented lan-
guage models,” arXiv:2205.12255, 2022. 12
[191] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang,
J. Callan, and G. Neubig, “Pal: Program-aided language models,”
arXiv:2211.10435, 2022. 12
[192] T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli,
L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer:
Language models can teach themselves to use tools,”
arXiv:2302.04761, 2023. 12
[193] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse,
S. Jain, V. Kosaraju, W. Saunders et al., “Webgpt: Browser-assisted
question-answering with human feedback,” arXiv:2112.09332,
2021. 12
[194] A. Zeng, A. Wong, S. Welker, K. Choromanski, F. Tombari,
A. Purohit, M. Ryoo, V. Sindhwani, J. Lee, V. Vanhoucke et al.,
“Socratic models: Composing zero-shot multimodal reasoning
with language,” arXiv:2204.00598, 2022. 12
[195] H. You, R. Sun, Z. Wang, L. Chen, G. Wang, H. A. Ayyubi,
K.-W. Chang, and S.-F. Chang, “Idealgpt: Iteratively decompos-
ing vision and language reasoning via large language models,”
arXiv:2305.14985, 2023. 12, 13
[196] V. Udandarao, A. Gupta, and S. Albanie, “Sus-x: Training-
free name-only transfer of vision-language models,”
arXiv:2211.16198, 2022. 12
[197] X. Zhu, R. Zhang, B. He, Z. Zeng, S. Zhang, and P. Gao, “Point-
clip v2: Adapting clip for powerful 3d open-world learning,”
arXiv:2211.11682, 2022. 13
[198] R. Zhang, X. Hu, B. Li, S. Huang, H. Deng, Y. Qiao, P. Gao,
and H. Li, “Prompt, generate, then cache: Cascade of foundation
models makes strong few-shot learners,” in CVPR, 2023. 13
[199] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould,
and L. Zhang, “Bottom-up and top-down attention for image
captioning and visual question answering,” in CVPR, 2018. 13
[200] Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, “Deep modular co-
attention networks for visual question answering,” in CVPR,
2019. 13
[201] P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, and H. Li,
“Dynamic fusion with intra-and inter-modality attention flow for
visual question answering,” in CVPR, 2019. 13
[202] A. Zeng, A. Wong, S. Welker, K. Choromanski, F. Tombari,
A. Purohit, M. Ryoo, V. Sindhwani, J. Lee, V. Vanhoucke et al.,
“Socratic models: Composing zero-shot multimodal reasoning
with language,” arXiv:2204.00598, 2022. 13
[203] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng, “Stylenet: Generating
attractive visual captions with styles,” in CVPR, 2017. 13
[204] A. Mathews, L. Xie, and X. He, “Senticap: Generating image
descriptions with sentiments,” in AAAI, 2016. 13
[205] P. Wu and S. Xie, “V*: Guided visual search as a core mechanism
in multimodal llms,” arXiv:2312.14135, 2023. 13
[206] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang,
D. Schuurmans, O. Bousquet, Q. Le, and E. Chi, “Least-to-
most prompting enables complex reasoning in large language
models,” arXiv:2205.10625, 2022. 13

You might also like