\cftpagenumbersoff

figure \cftpagenumbersofftable

Large Language Models for Disease Diagnosis: A Scoping Review

Shuang Zhou Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN, USA Zidu Xu School of Nursing, Columbia University, New York, New York, USA Mian Zhang Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas, Richardson, TX, USA Chunpu Xu Department of Computing, The Hong Kong Polytechnic University, Hong Kong, Hong Kong SAR Yawen Guo Department of Informatics, University of California, Irvine, Irvine, CA, USA Zaifu Zhan Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN, USA Sirui Ding Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA Jiashuo Wang Department of Computing, The Hong Kong Polytechnic University, Hong Kong, Hong Kong SAR Kaishuai Xu Department of Computing, The Hong Kong Polytechnic University, Hong Kong, Hong Kong SAR Yi Fang Department of Computer Science, New York University (Shanghai), Shanghai, CN Liqiao Xia Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Hong Kong, Hong Kong SAR Jeremy Yeung Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN, USA Daochen Zha Department of Computer Science, Rice University, Houston, TX, USA Genevieve B. Melton Institute for Health Informatics and Division of Colon and Rectal Surgery, Department of Surgery, University of Minnesota, Minneapolis, MN, USA Mingquan Lin Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN, USA Rui Zhang Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN, USA
Abstract

Automatic disease diagnosis has become increasingly valuable in clinical practice. The advent of large language models (LLMs) has catalyzed a paradigm shift in artificial intelligence, with growing evidence supporting the efficacy of LLMs in diagnostic tasks. Despite the increasing attention in this field, a holistic view is still lacking. Many critical aspects remain unclear, such as the diseases and clinical data to which LLMs have been applied, the LLM techniques employed, and the evaluation methods used. In this article, we perform a comprehensive review of LLM-based methods for disease diagnosis. Our review examines the existing literature across various dimensions, including disease types and associated clinical specialties, clinical data, LLM techniques, and evaluation methods. Additionally, we offer recommendations for applying and evaluating LLMs for diagnostic tasks. Furthermore, we assess the limitations of current research and discuss future directions. To our knowledge, this is the first comprehensive review for LLM-based disease diagnosis.

#Equal contribution
*Correspondence: [email protected]

Introduction

Automatic disease diagnosis is a crucial task in clinical scenarios that takes clinical data as input, analyzes patterns, and generates potential diagnoses with minimal or no human intervention 1. Its significance in healthcare is multifaceted. First, it enhances diagnostic accuracy, supports physicians in clinical decision-making, and addresses disparities in healthcare access by providing more high-quality diagnostic services 2. Second, automatic diagnosis improves the efficiency of healthcare professionals 3; 4, which is particularly valuable for clinicians managing larger panels of patients with increasing age and multiple morbidities 5. For instance, DXplain 6 was a diagnostic system that utilized patients’ signs, symptoms, and laboratory data to generate a list of potential diagnoses, along with a justification for why each condition should be considered. Additionally, online services further facilitate early diagnosis or large-scale screening of certain diseases 4; 7, such as mental health disorders, by raising awareness in the early stages and helping to prevent potential risks. For example, several studies investigated using social media posts for large-scale depression identification 8 and suicide risk prediction 9.

Recent advancements in artificial intelligence (AI) have driven the development of automated diagnostic systems through two stages 10; 11; 12; 13. Initially, machine learning techniques such as support vector machines and decision trees were employed for disease classification 14; 15, which typically involved four steps: data processing, feature extraction, model optimization, and disease prediction. With larger datasets and sufficient computational power, deep learning methods later dominated the development of diagnostic tasks 16; 2. These approaches leveraged deep neural networks (DNNs), including convolutional neural networks 1; 17, recurrent neural networks 18, and generative adversarial networks 19, enabling end-to-end feature extraction and model training. For example, a convolutional DNN with 34343434 layers achieved cardiologist-level performance in arrhythmia diagnosis 20. However, these models generally require extensive labeled data for supervised learning and are typically task-specific 1; 20, limiting their adaptability to other tasks or new demands 17.

In recent years, the paradigm of AI has shifted from traditional deep learning to the emergence of large language models (LLMs). Unlike supervised learning, LLMs, such as generative pre-trained transformers (GPT) 21 and LLaMA 22, are generative models pre-trained on vast amounts of unlabeled data through self-supervised learning. These models, typically comprising billions of parameters, excel in language processing and adapt to various tasks. To date, LLMs have demonstrated superior performance in clinical scenarios 23, including question answering (QA) 24, information retrieval 25, and clinical report generation 26; 27. Recently, increasing numbers of studies have verified the effectiveness of LLMs for diagnostic tasks. For instance, PathChat 28, a vision-language generalist LLM fine-tuned on hundreds of thousands of instructions, achieved state-of-the-art performance in human pathology. Med-MLLM 27, a multimodal LLM pre-trained and fine-tuned on extensive medical data, including chest X-rays, CT scans, and clinical notes, demonstrated notable accuracy in COVID-19 diagnosis. Additionally, Kim et al. 29 employed GPT-4 with prompt engineering and found it surpassed mental health professionals in identifying obsessive-compulsive disorder, which underscores LLM’s potential in mental health diagnostics.

Although this research field has drawn wide attention, many key questions remain under-explored. For instance, which diseases and medical data have been investigated in LLM-based diagnostic tasks (Q1)? What LLM techniques have been applied to disease diagnosis and how to choose appropriate ones (Q2)? What evaluation methods are appropriate for assessing performance (Q3)? Despite numerous review papers have investigated the studies of applying LLMs in medicine domain 30; 31; 32; 33; 34; 35; 36; 37, these efforts typically provide a broad overview of various clinical applications without underscoring disease diagnosis. For instance, Pressman et al. 38 offered a comprehensive summary of potential clinical applications of LLMs, including pre-consultation, treatment, postoperative management, discharge, and patient education. Additionally, none of these review papers address the nuances and challenges of applying LLMs to disease diagnosis or answer the aforementioned questions, highlighting a critical research gap.

The primary aim of our review is to provide an overview of studies utilizing LLMs for disease diagnosis. The review introduced various disease types, disease-associated clinical specialties, clinical data, LLM techniques, and evaluation methods from existing works. Additionally, we provided recommendations for data preparation, selecting appropriate LLM techniques, and employing suitable evaluation strategies for diagnostic tasks. Further, our review characterized the limitations of current studies and shed insight into the challenges and future directions in this field. To the best of our knowledge, it is the first review that focused on disease diagnosis with LLMs and provided a comprehensive overview of this domain. In summary, this review outlined a blueprint for LLM-based disease diagnosis and helped to inspire and streamline future research efforts.

Results

Refer to caption
Figure 1: PRISMA flowchart of study records. PRISMA flowchart showing the study selection process.
Refer to caption
Figure 2: Overview of the investigated scope. It illustrated disease types and the associated clinical specialties, clinical data types, modalities of the utilized data, the applied LLM techniques, and evaluation methods. We only presented part of the clinical specialties and some representative diseases.

Overview of the scope

This section presented the scope of our review. Figure 2 not only illustrated disease types, the associated clinical specialties, clinical data types, and data modalities (Q1) but also introduced the applied LLM techniques (Q2) and evaluation methods (Q3), which answered the aforementioned questions. Specifically, we investigated 19 clinical specialties and over 15 types of clinical data in disease diagnosis. The clinical data spanned various data modalities, including text, image, video, audio, time series, and multimodal cases. Besides, we categorized existing works for disease diagnosis based on the applied LLM techniques, such as prompt (zero-shot), retrieval-augmented generation (RAG), and pre-training. Table 1 summarized the taxonomy of the mainstream LLM techniques. Figure 4 showcased the association of clinical specialties, data modalities, and the LLM techniques of the included papers. The above figures comprehensively revealed the current development of LLM-based disease diagnosis. Additionally, Figure 3 showed the meta-information analysis of our review, involving publication tendencies of different regions, a summary of widely-used LLMs for training and inference, and the statistics of data sources, evaluation methods, and data privacy status.

Refer to caption
Figure 3: Metadata of information from LLM-based diagnostic studies in the scoping review. a Quarterly breakdown of LLM-based diagnostic studies. Since the information for 2024-Q3 is incomplete, our statistics only cover up to 2024-Q2. b The top 5 widely-used LLMs for inference and training. c Breakdown of the data source by regions. d Breakdown of evaluation methods (note some papers utilized multiple evaluation methods). e Breakdown of the employed datasets by privacy status.
Refer to caption
Figure 4: Summary of the association between clinical specialties (left), data modalities (middle), and LLM techniques (right) across the included papers.
Table 1: Overview of LLM techniques for disease diagnosis.
Techniques Types Characteristics Representative studies
Prompting Zero-shot A single instruction describing the task Text 39; 40, image 41; 42, audio 43; 44, text-image 45, text-time series 46; 47, text-tabular 48
Few-shot An instruction supplemented with several demonstrations Text 49; 50, image 51, text-image 52; 53, text-image-tabular 54
CoT Decomposes a problem into multiple linear steps Text 55; 56, audio 57, time series 58, text-image 59; 60
Self-consistency Generates multiple reasoning paths Text 61, audio 62, text-image-tabular-time series 63
Soft prompt Continuous vector embeddings with learnability Text 64 , image 65, tabular-time series 66; 67, text-image-graph68,
RAG Knowledge graph External knowledge is stored in graphical structure Text 69; 70; 71, text-time series 72
Corpus External knowledge comes from high-quality corpora Text 73; 74, text-image 75; 76, text-time series 77
Database External medical knowledge comes from databases Text 78; 79; 80, text-image 81; 82, text-time series 83
Fine-tuning SFT Injects medical knowledge via supervised learning Text 84; 85; 86, text-image 87; 88; 89, text-video 90; 91, text-audio 92; 93, text-tabular 94; 95; 48
RLHF Aligns the model with human preferences Text 96; 97; 98, text-image 99
PEFT Fine-tunes a small number of (extra) model parameters Text 100; 101; 84, text-image 102
Pre-training - Learns general knowledge with unsupervised learning Text 103; 104; 101, text-image 88; 105; 106, text-tabular 107; 48, text-video 93, text-omics 106

Note: SFT = supervised fine-tuning, RLHF = reinforcement learning from human feedback, PEFT = parameter-efficient fine-tuning.

Prompt-based disease diagnosis

A customized prompt typically comprises four components 108: instruction (specifying the task), context (defining the scenario or domain), input data (identifying the data to be processed), and output indicators (directing the model on the desired style or role). Over 60% (N=278) of included studies were prompt-based methods. We identified five distinct techniques that fall into two primary categories: hard prompts and soft prompts. Hard prompts include methods such as zero-shot, few-shot, Chain-of-Thought (CoT), and self-consistency prompting. These prompts are static and interpretable, written in natural language, which makes them particularly effective when the input and output structures are well-defined 109. On the other hand, soft prompts are continuous vector embeddings generated by a small, trainable model and then fed into an LLM. This technique, known as prompt tuning, encodes input data into task-specific embeddings, enabling the LLM to adapt to the task more effectively 110.

Among the prompt-based studies, zero-shot prompting, which consists of a single instruction without labeled examples, was the most prevalent (N=194). CoT-based methods (N=37) were featured by breaking down complex problems into smaller, manageable parts, allowing the model to address these sequentially in multiple steps 111; 112. For instance, in differential diagnoses, LLMs using CoT reasoning can follow clinical guidelines to sequentially interpret medical images, radiology reports, and symptom descriptions, providing intermediate outputs at each step that feed into subsequent analyses 59; 60; 55. This step-by-step approach allows the model to integrate context throughout the reasoning process, ultimately enabling a holistic final diagnosis. Few-shot prompt-based methods (N=35) expanded zero-shot prompting with a few labeled examples to enhance task performance. Studies based on self-consistency prompting (N=4) were characterized by generating multiple reasoning paths to enhance the reliability and robustness of LLMs 113; 63. For example,  Kim et al. 63 employed self-consistency prompting to predict depression scores (PHQ-4) by synthesizing diverse information from demographics, health domain literature, self-reported symptoms, and wearable sensor data to select the most consistent response among multiple reasoning paths. Soft prompt-based studies (N=6) involved training continuous vector embeddings before feeding them into LLMs, which enabled them to adapt LLMs’ behavior for specific tasks. It has been mainly utilized to encode multimodal electronic health records (EHR), including medical images, clinical notes, and lab results. A key advantage of the soft prompt is its capacity to integrate external domain knowledge, such as medical concept embeddings, with contextual information like individual clinical profiles. This allows the model to generate nuanced disease diagnoses with detailed explanations, making it well-suited for complex clinical scenarios 66; 65.

The majority of prompt-based studies involved unimodal data exploration (N=221), with most studies focusing exclusively on text data (N=171). Clinical text data such as clinical notes 114; 115, medical imaging reports 116; 117; 56, and clinical case reports 45; 118 were predominantly utilized. These studies typically input clinical notes or case reports and ask LLMs for suggested disease diagnosis 119; 120; 121; 122. Some studies (N=19) applied prompt engineering to medical image data. Commonly studied medical images included CT scans 51; 123, X-rays 68; 124, magnetic resonance imaging (MRI) 125; 51, and pathological images 126; 127. The primary use is to detect abnormalities on medical images and provide supporting evidence for differential diagnoses 75; 41; 126; 128.

With the rapid development of multimodal LLMs, an increasing number of studies have explored using these models for disease diagnosis with prompt engineering (N=57). A key advancement in this area is visual-language models (VLMs) (e.g., GPT-4V, LLaVA, and Flamingo), which have made image-text pairs the most prevalent input combinations for multimodal LLMs (N=37). Differing from the unimodal LLMs, VLMs were given more comprehensive clinical profiles, i.e., medical images and complementary textual descriptions, and were able to justify the diagnosis decisions with more details 129; 130; 131. For instance, Upadhyaya et al. 75 demonstrated that incorporating ophthalmologist feedback and contextual information (e.g., image location, purpose) with eye movement images significantly enhanced GPT-4V’s diagnostic accuracy for amblyopia.

More advanced multimodal LLMs, such as GPT-4o and Gemini-1.5 Pro, enabled prompt-based research to extend beyond text and image and include diverse data modalities for disease diagnosis. Specifically, many efforts leveraged audio and video data to facilitate the diagnosis of neurological and neurodegenerative disorders, such as autism 43; 132 and dementia 44; 68. Some studies investigated using omics data for the detection of rare genetic disorders 133 and Alzheimer’s disease 134. Additionally, a wide range of risk prediction tasks tended to incorporate multimodal data for early warning, including time series data, such as ECG signals 46; 135; 47 and wearable sensor data 58; 63; tabular data, such as user demographics 136; 134, and lab test results 66; 137. The applications included depression and anxiety screening 63, emergency triage 138, and arrhythmia detection 46; 139; 135. Another study further combined multimodal LLMs with a medical concept graph for neurological disorder diagnosis 68.

Retrieval-augmented LLMs for diagnosis

To enhances the accuracy and credibility of the diagnosis, alleviate hallucination issues and update LLMs’ stored medical knowledge without needing re-training, recent studies 140; 141; 142; 79; 70; 69 have incorporated external medical knowledge into diagnostic tasks. The external knowledge primarily comes from corpus 140; 141; 74; 143; 77; 144; 75; 145; 74; 76; 73; 146; 147; 148, databases 142; 79; 61; 149; 80; 123; 81; 83; 150; 82; 151; 135; 152; 78; 153, and knowledge graph 70; 154; 69; 72; 71, in the included papers. Based on the data modality, these RAG-based studies can be roughly categorized into text-based, text-image-based, and time-series-based augmentations.

In text-based RAG, the majority of research 79; 143; 152; 74; 151; 78; 142; 140; 149; 153; 148; 145 has adopted a basic retrieval strategy. In this approach, external knowledge is encoded into vector representations using sentence transformers (e.g., OpenAI’s text-embedding-ada-002), which serve as retrieval sources. Queries were similarly encoded, allowing the system to identify and fetch the most relevant knowledge by calculating the similarity between query vectors and source vectors. This combined information was then fed into LLMs using specially designed prompts to generate diagnostic results. However, two papers employed LLMs to search similar medical cases from the given content 144; 146. Zhenzhu et al. 144 designed guideline-based GPT-agents to summarize and retrieve content for traumatic brain injury rehabilitation‑related questions. McInerney et al. 146 utilized the LLM to extract evidence fragments from previous notes for evaluating the risk factors for cancer, pneumonia, and pulmonary edema. Four studies retrieved relevant content from knowledge graphs 70; 154; 69; 147; 71. One study leveraged regular expressions to match useful knowledge for pulmonary hypertension diagnosis 141. Different from previous studies where only one LLM was utilized for diagnosis, Wang et al. 80 employed several LLMs, each of which was equipped with specific medical knowledge, for joint diagnosis.

In text-image data processing, a common approach 81; 75; 123; 151; 82 involves extracting features from input images, converting these features into textual descriptions, and subsequently applying text-based enhancement techniques. For instance, Ferber et al. 151 employed advanced models like GPT-4V to extract critical information from images to facilitate the retrieval of relevant documents in oncology diagnosis. Similarly, Ranjit et al. 73 employed multimodal models to directly compute similarities between image and text features for document retrieval. Notably, two studies fine-tuned LLMs using the retrieved documents to enhance diagnostic accuracy 150; 76.

For time-series RAG, most studies focused on the electrocardiogram (ECG) analysis 77; 83; 135. For instance, Yu et al. 77 converted fundamental ECG conditions into improved text descriptions by utilizing the retrieved relevant information. Yu et al. 135 constructed a local database with specific domain knowledge for diagnosing arrhythmia and sleep apnea. Chen et al. 83 pretrained a model with a public ECG-Report dataset and fine-tuned the model for hypertension and myocardial infarction diagnosis. One study utilized the RAG method for readmission prediction based on multimodal EHR 72.

Fine-tuning LLMs for diagnosis

Fine-tuning a LLM typically encompasses two pivotal stages: supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). During the SFT stage, the model is trained on task-specific instruction-response pairs, enabling it to interpret instructions and generate responses across diverse modalities. This phase is crucial for establishing a foundational understanding of the model, facilitating the processing of inputs to produce desired outputs. Subsequently, the RLHF phase further refines the model by aligning its behavior with human preferences. Utilizing reinforcement learning, the model is optimized to generate responses that are more helpful, truthful, and congruent with human values 155, thereby ensuring compliance with societal expectations for ethical and effective AI.

Medical SFT enhances the in-context learning, reasoning, planning, and role-playing capabilities of LLMs, leading to improved diagnostic performance. During this process, inputs from various data modalities are integrated into the LLM’s word embedding space. Following the approach outlined in LLaVA 156, visual information is first converted into visual token embeddings using an image encoder and a projector. These embeddings, which match the dimensionality of language token embeddings, are then fed into the LLM for end-to-end training. In this review, many studies focused on conducting SFT on medical texts for diagnostic purposes (N=49). The medical texts can be clinical notes 157; 84; 95, clinical QA pairs 84; 158; 159; 160; 104, medical dialogues 161; 162; 163; 100; 164, or medical reports 90; 165; 166; 167; 102. Lot of studies combined both medical texts and images to enhance disease diagnosis (N=43), such as X-ray images 168; 90; 169; 165; 170, MRI images 170; 171; 102, or pathology images 106; 172; 92. A few studies also explored the detection of diseases from medical videos 90; 91, where video frames were sampled and transformed into visual token embeddings. To perform SFT effectively, it is crucial to collect high-quality responses to task-specific instructions. These instructions should be well-defined and diverse, covering a wide range of scenarios to ensure comprehensive training.

RLHF methods could be divided into two categories: online and offline. Online RLHF, a key process for the success of ChatGPT 173, first fits a reward model to datasets of prompts and human preferences over responses, then uses reinforcement learning algorithms like PPO 174 to update the LLM to maximize the learned reward model. Some explorations showed online RLHF could effectively improve the diagnostic ability of medical LLMs 99; 97; 98. For example, Zhang et al. 98 aligned their model with the characteristics of doctors and achieved robust performance on a wide range of medical QA tasks, including condition diagnosis and etiological analysis. However, the overall performance of online RLHF highly relies on the quality of the reward model, which is expected to give accurate rewards to LLM responses, and several works demonstrated that the reward model could suffer from issues like over-optimization 175 and shifting form initial data distribution 176. Meanwhile, the training process for reinforcement learning is often characterized by instability and challenges in control 177. Offline RLHF methods like DPO 178 cast RLHF as optimizing a simple classification loss, eliminating the need for a reward model. These methods are also more stable and computationally lightweight and have proven useful in medical LLMs alignment 179; 96; 101. Yang et al. 101 found that if the offline RLHF phase is removed, their model exhibited significant performance drops in doctor evaluations on pediatric benchmarks. To conduct RLHF, a high-quality dataset of prompts and responses with human preferences is crucial to train a well-calibrated reward model 180 for online RLHF or ensure the better convergence of DPO like offline RLHF algorithms 181, whether from human experts 173 or powerful AI models 182.

As the size of LLMs increases, their capabilities are correspondingly enhanced. Consequently, larger models are often preferred to ensure a robust foundational capacity for adaptation to downstream tasks. However, scaling up model size renders full training increasingly impractical, as it demands extensive GPU resources. Parameter-efficient fine-tuning (PEFT) offers a solution to this challenge by minimizing the number of parameters requiring fine-tuning. The most popular PEFT method is Low-Rank Adaptation (LoRA) 183, which introduces trainable rank decomposition matrices into each layer without modifying the model’s architecture. LoRA is particularly favored due to its advantage of not adding inference latency. In this review, all the PEFT-based studies (N=7) used LoRA to reduce the training cost 102; 100; 101; 184; 84; 185; 186.

Pre-training LLMs for diagnosis

LLMs are initially pre-trained on extensive text corpora to perform next-token prediction. During this phase, the model learns the structure of language and acquires a vast amount of knowledge about the world. When pre-trained on medical texts, LLMs gain foundational medical knowledge, which proves valuable when adapting them for various downstream medical tasks, including medical diagnosis. In this review, five studies perform text-only pretraining on the LLMs from different sources 103; 187; 104; 188; 189, such as clinical notes, medical QA texts, dialogues, and Wikipedia. Moreover, eight studies injected medical visual knowledge into multimodal LLMs via pretraining 88; 190; 107; 191; 189; 106; 105. For instance, Chen et al. 105 and Wang et al. 189 pre-trained their models on visual question-answering (VQA) data. Specifically, Chen et al. 105 employed an off-the-shelf multimodal LLM to reformat image-text pairs from PubMed into VQA data points for training their model. To improve the quality of the image encoder, pretraining tasks like reconstructing images at tile-level or slide-level 106, and aligning similar images or image-text pairs 88 are common choices.

Table 2: Overview of evaluation metrics for disease diagnosis
Type Evaluation metric Purpose
Automatic metric Accuracy 158 The ratio of all correct predictions to the total predictions
Precision95 The ratio of true positives to the total number of positive predictions
Recall95 The ratio of true positives to the total number of actual positive cases
F127 Calculated as the harmonic mean of precision and recall
AUC192 The area under the Receiver Operating Characteristic curve
AUPR 193 The area under the precision-recall curve
Top-k accuracy194 The ratio of instances with the true label in the top k predictions to total instances
Top-k precision124 The ratio of true positives to total positive predictions within the top k predictions
Top-k recall 195 The ratio of true positives within the top k predictions to actual positive cases
Mean square error196 The average of the squared differences between predicted and actual values
Mean absolute error114 The average of the absolute differences between predicted and actual values
Cohen’s κ𝜅\kappaitalic_κ 197 Measure the agreement between predicted score and actual score
BLUE 198 Calculate precision by counting matching n-grams between reference and generated text
ROUGE 49 Calculate F1-score by matching n-grams between reference and generated text
CIDEr 199 Evaluate n-gram similarity, emphasizing alignment across multiple reference texts
BERTScore 200 Measure similarity by comparing embeddings of reference and generated text
METEOR 201 Evaluate text similarity by considering precision, recall, word order, and synonym matches
Human evaluation Necessity 49 Whether the response or prediction assists in advancing the diagnosis
Acceptance 202 The degree of acceptance of the response without any revision
Reliability 203 The trustworthiness of the evidence in the response or prediction
Explainability144 Whether the response or prediction is explainable
Human or LLM evaluation Correctness 204 Whether the response or prediction is medically correct
Consistency 205 Whether the response or prediction is consistent with the ground-truth or input
Clarity 79 Whether the response or prediction is clearly clarified
Professionality203 The rationality of the evidence based on domain knowledge
Completeness49 Whether the response or prediction is sufficient and comprehensive
Satisfaction206 Whether the response or prediction is satisfying
Hallucination205 Response contains inconsistent or unmentioned information with previous context
Relevance79 Whether the response or prediction is relevant to the context
Coherence 207 Assess logical consistency with the dialogue history

Evaluation strategy

As evaluating diagnostic performance is crucial, we further summarized and analyzed the evaluation strategies for diagnostic tasks. Generally, existing evaluation methods fall into three categories: automatic evaluation, human evaluation, and LLM evaluation (shown in Table 2). An overview of the advantages and limitations of the evaluation strategies is depicted in Figure 5.

Most studies assessed diagnostic effectiveness using automatic metrics, which can be broadly categorized into three types. The first type primarily uses classification-based metrics such as accuracy, precision, and recall, which are suitable for single-disease prediction. For example, Liu et al. 27 adopted AUC, accuracy, and F1 score to evaluate COVID-19 diagnosis effectiveness. The second type is generally used in multi-label scenarios, where predictions involve multiple potential diagnoses, including top-k accuracy and top-k precision. For instance, Tu et al. 194 utilized top-k accuracy to measure the percentage of correct diagnoses appearing within the top-k positions of the diagnosis list. The third type applies to risk prediction tasks, where mean absolute error (MAE) or mean squared error (MSE) measures the deviation between predicted values and the actual ones 196; 114. In summary, automatic metrics offer advantages such as time and cost efficiency, ease of implementation, and suitability for large-scale data. However, they require ground-truth answers, which are often unavailable in many scenarios. Additionally, these metrics typically lack human-centric perspectives, such as assessing the reliability or overall usefulness of the prediction. Furthermore, they generally fall short in evaluating complex scenarios, such as determining whether a diagnostic reasoning process is medically correct 208.

Refer to caption
Figure 5: Summary of the evaluation strategies for diagnostic tasks.

Many studies evaluated diagnostic performance through human efforts 24; 209. This method relies on domain experts to evaluate the quality of model predictions based on their medical knowledge. One advantage lies in that it typically does not require ground-truth answers. Additionally, it accommodates human-centric perspectives and can address complex tasks that necessitate extensive human intelligence or domain knowledge. However, human evaluation presents several limitations, including significant time and cost demands, as well as a susceptibility to human error. Consequently, this strategy is usually applied for small-scale data assessment.

Additionally, some studies have utilized LLMs to replace human experts in diagnostic evaluation 210; 211; 212. LLM evaluation combines the benefits of human-centric evaluation with the efficiency of automated metrics. Although ground-truth is not strictly required for this approach 212; 205, its inclusion further enhances the reliability of LLM evaluation 209. Commonly used LLMs for this purpose include GPT-3.5, GPT-4, and LLaMA-3. However, this approach is limited by the performance of the employed LLMs, which are susceptible to hallucination issues 205. Moreover, LLM-based evaluation may struggle with handling complex clinical scenarios 213.

In summary, the above evaluation strategies have their advantages and limitations. The balance between accurate evaluation and cost-effectiveness varies depending on the specific scenario. Our analyses, presented in Figure 5, provide convenience in selecting appropriate evaluation strategies, catering to the requirements of various applications.

Discussion

This section presented notable findings from the included studies, discussed the data preparation for the mainstream LLM techniques, and highlighted key challenges and potential future research directions. Our review revealed that most studies utilized LLMs for disease diagnosis through prompt learning. The phenomenon might be explained as follows. Firstly, it requires minimal data. For instance, zero-shot and few-shot prompts enable the development of diagnostic systems with just a few dozen examples 214; 39. Secondly, prompt-based methods are user-friendly and require minimal setup, making them accessible to researchers with limited machine-learning expertise. Additionally, it significantly reduces computational overhead, making implementation feasible on ordinary hardware. Furthermore, when used appropriately, large-scale LLMs like GPT-4 or GPT-3.5, which own extensive medical knowledge, demonstrate fair performance across various diagnostic tasks 24; 214.

Refer to caption
Figure 6: Summary of the advantages and limitations of the mainstream LLM techniques for diagnosis.

We summarized the advantages and limitations of mainstream LLM techniques of the included papers in Figure 6 and discussed the data preparation as follows. Generally, the selection of LLM techniques for developing diagnostic systems depends on the quantity and quality of available data. Specifically, prompt engineering is highly flexible and effective when annotated data is limited. Generally, designing an appropriate instruction supplemented with several examples as demonstration is sufficient for prompting 24. Zero-shot prompting even allows models to perform diagnosis without annotated examples while still achieve fair performance 214. To effectively apply RAG to diagnosis, a comprehensive and high-quality external knowledge base is indispensable. This knowledge base can be databases 79, corpora 143; 78 or knowledge graphs 70 from which LLMs can retrieve accurate information during inference. Effective fine-tuning necessitates a well-annotated, domain-specific dataset that includes labeled examples reflecting the target diagnostic tasks, such as annotated clinical notes or medical images, and a substantial number of samples 27. Pre-training requires extensive and diverse datasets that encompass a wide spectrum of medical knowledge, including unstructured text (e.g., clinical notes, medical literature) or structured data (e.g., lab test results) 94; 54. The quality and diversity of the pre-training datasets are crucial for establishing the model’s foundational knowledge and its ability to generalize across various medical contexts. While pre-training and fine-tuning would achieve promising performance and reliability 27; 190, they demand significant resources, such as advanced graphics cards and millions of medical data, which are usually hard to obtain. In contrast, not all scenarios require expert-level performance for disease diagnosis, such as large-scale screening 8; 215, health risk alerts from mobile devices 58, or public health education 30; 32. Balancing the trade-off between accuracy and cost-effectiveness varies by scenario. In summary, the analyses presented in Figure 6 guide users in selecting appropriate LLM techniques for disease diagnosis based on available resources.

Refer to caption
Figure 7: Summary of the limitation and future direction for LLM-based disease diagnosis.

Despite the progress in LLM-based methods for disease diagnosis, this scoping review identifies several barriers that impede their clinical utility (Figure 7). In the information-gathering process, a notable limitation is that only a small subset of studies integrated comprehensive multimodal data for diagnosis 216, such as text, image, time series, and other modalities. For example, Deng et al. 217 developed a multimodal LLM incorporating text, images, video, and speech for autism spectrum disorder screening. This discrepancy contrasts with real-world diagnostic scenarios, where comprehensive patient information spans multiple data modalities 160, particularly for complex conditions affecting multiple organs. Therefore, future research should emphasize collecting and fusing information from diverse modalities to simulate real-world scenarios.

Another limitation is that most studies implicitly assume the collected patient information is sufficient for disease diagnosis. Nevertheless, this assumption usually hardly holds, particularly in initial consultations or with complicated diseases, and using incomplete data would likely cause misdiagnosis 218; 219. In practice, clinical information gathering is an iterative process, beginning with the collection of initial patient data (e.g., subjective symptoms), narrowing down potential diagnoses, and then conducting medical examinations for further data collection and disease screening 220. This process typically requires extensive domain expertise from experienced clinicians. To alleviate the reliance on professionals, an increasing number of studies are exploring diagnostic conversations that collect relevant patient information through multi-round dialogues 221; 222. For example, AIME utilized LLMs for clinical history-taking and diagnostic dialogue 194, while MEDIQ asked follow-up questions to gather essential information for clinical reasoning 213. Following this tendency, future research can integrate the awareness of incomplete information into diagnostic models or develop advanced methods for automatic diagnostic queries 223; 224.

Some barriers lie in the information integration process. Although adhering to clinical guidelines is critical in medical scenarios, only a few studies considered this factor. For instance, Kresevic et al. 143 aimed to improve clinical decision support systems through accurate interpretation of medical guidelines for chronic Hepatitis C Virus infection management. Future works can integrate clinical guidelines for developing diagnostic systems. Besides, the integration and interpretation of lab test results pose significant value in healthcare. For example, He et al. 225 exploited LLMs to generate lab test-related responses to answer patients’ queries, thus gaining patients’ trust. A future direction is leveraging LLMs to interpret lab test results for professionals and patients.

Exploring the interaction between clinicians, patients, and diagnostic systems presents a promising avenue for research 226; 221; 222. In medical settings, diagnostic systems could function as assistants that provide supplementary information to enhance the accuracy or efficiency of clinicians 51; 157; 227; 228. Besides, these systems should incorporate feedback from medical experts, facilitating continuous refinement and adaptation. Additionally, a user-friendly interface is expected for human-machine interaction. For instance, doctors directly talk with the diagnostic systems to input patients’ information and perform discussions. In brief, future studies could explore how the effective application of diagnostic algorithms can further enhance clinical significance 229.

Another barriers lie in the decision-making step. While many studies emphasize diagnostic accuracy, they usually ignore human-centric perspectives such as model interpretability, patient privacy, safety, and fairness 30; 230; 231. Specifically, providing diagnostic predictions alone is insufficient in clinical scenarios, as the black-box nature of LLMs often undermines trust 205; 208. Accordingly, it is essential to provide interpretative insights into the diagnoses 208. For example, Dual-Inf is a prompt-based framework that not only offers potential diagnoses but also explains the rationale behind them 209. Regarding privacy, adherence to regulations like the Health Insurance Portability and Accountability Act (HIPAA) and the General Data Protection Regulation (GDPR) is essential, such as the de-identification of sensitive information 25; 232. To date, only a few works have investigated the issue 233; 80. For instance, SkinGPT-4 is a dermatology diagnostic system designed for local deployment to protect user privacy 233. Fairness is another concern, ensuring patients are not discriminated against based on gender, age, or race 230. Research addressing the fairness issue in LLM-based diagnosis remains limited 234; 235. In short, future research should integrate these human-centric perspectives into diagnostic systems to address these critical issues.

In terms of technical aspects, integrating multimodal data for disease diagnosis draws increasing attention 12. However, several challenges remain, including eliminating data noise 236, fusing heterogeneous data from various modalities 237, and performing efficient learning. Besides, many domain-specific LLMs are constrained by smaller parameter scales compared to general-domain LLMs 238; 203. This may be due to the lack of substantial corpora and computational resources necessary for training large-scale medical models 194. However, pre-training on vast medical datasets can embed more medical knowledge into LLMs, thereby enhancing their reasoning abilities and improving performance on rare diseases and complex cases 239; 240. Future work can also investigate employing multiple specialist models to boost diagnostic accuracy, as it simulates interdisciplinary clinical discussions for complex disease cases involving multiple clinical specialties 80; 241; 242. Additionally, hallucination is a long-standing issue in LLMs, which severely jeopardizes the reliability of diagnostic systems 243. To mitigate data-related hallucination, which is rooted in the misinformation or knowledge gap from training data, future studies can investigate knowledge editing 244 or retrieve external knowledge 143; 79 for diagnosis. For the training-related hallucinations that are raised by the intrinsic limitations of the architecture or training strategies in LLMs 245, future works can explore novel model architectures or pre-training strategies 239; 246.

Another critical area is the development of diagnostic systems. Many studies utilized private datasets, which are often inaccessible due to privacy concerns 143; 247. However, the advancement of diagnostic systems necessitates a greater availability of public data. The other issue is that the scarcity of annotated data poses a significant challenge to the development of this field. This is because well-annotated datasets enable exploiting automatic metrics for evaluation, reducing the need for extensive human effort in performance assessment 209. Therefore, constructing and releasing annotated benchmark datasets would significantly contribute to the research community. Moreover, performance evaluation should also be highlighted. Currently, there is no standardized guideline for evaluating diagnostic performance, particularly regarding human-centric metrics 49; 207; 248. A generic principle is to consider metrics from different aspects, such as effectiveness, robustness, reliability, and explainability, thereby providing a comprehensive evaluation.

In practice, the deployment of diagnostic systems remains a considerable challenge. Many studies reported that LLMs struggle to provide stable responses or predictions 249; 231. For instance, Hager et al. 231 discovered that the changes in instructions could result in large obvious changes in diagnostic accuracy. However, a stable and reproducible clinical decision is crucial in clinical scenarios. Therefore, future works can explore ensuring the stability of LLMs for diagnostic tasks. The other direction is to deploy diagnostic algorithms on mobile devices that can continuously and automatically collect basic signs and information from the human body, such as electroencephalogram rhythms and ECG rhythms. This enables mobile devices to send health-related risk alerts for early warning. In addition, early diagnosis draws wide attention and creates significant value 16; 237. For instance, early diagnosis of lung adenocarcinoma can increase the 5-year survival rate to 52% 250. However, only a few studies exploited LLMs for this purpose 121; 75. The difficulty lies in that many diseases typically lack obvious symptoms in the early stages and are hard to identify. Future directions can further explore how to deploy diagnostic systems for early diagnosis.

In conclusion, our study provided a comprehensive review of LLM-based methods for disease diagnosis. Our contributions were multifaceted. First, we summarized the disease types, the associated clinical specialties, clinical data, the employed LLM techniques, and evaluation methods within this research domain. Second, we compared the advantages and limitations of mainstream LLM techniques and evaluation methods, offering recommendations for developing diagnostic systems based on varying user demands. Third, we identified intriguing phenomena from the current studies and provided insights into their underlying causes. Lastly, we analyzed the current challenges and outlined the future directions of this research field. In summary, our review presented an in-depth analysis of LLM-based disease diagnosis, outlined its blueprint, inspired future research, and helped streamline efforts in developing diagnostic systems.

Methods

Search strategy and selection criteria

This scoping review is reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines, as shown in Figure 1. We performed literature search from various resources to find relevant articles published between 1 Jan 2019 and 18 July 2024. We searched seven electronic databases, including PubMed, CINAHL, Scopus, Web of Science, Google Scholar, ACM Digital Library, and IEEE Xplore. The search terms were selected based on consensus expert opinion and used for each database (see Supplementary Data 1).

We performed a two-stage screening process to focus on LLMs for human disease diagnosis. The first stage involved using the title and abstract for paper exclusion. The criterion was as follows: (a) articles were not published in English; (b) articles irrelevant to LLMs or foundation models; and (c) articles irrelevant to the health domain. The second stage was full-text screening, emphasizing using language models for diagnosis-related tasks. We excluded review papers, editorials, and papers not explicitly used for disease diagnosis. Notably, the scope of “disease diagnosis” in this review was not confined to tasks that directly produced diagnoses, such as medical image classification; it also encompassed diagnosis-related tasks, such as depression identification 8 and suicide risk prediction 9. See Supplementary Data 2 for details of the scope. We also excluded studies concerning foundation models that do not incorporate text modalities, including visual foundation models. Full texts of studies reserved from the initial screening were independently evaluated for final eligibility by at least two examiners. Any disagreements were resolved by consensus or a third member.

Data extraction

Information garnered from the articles consists of four categories. (1) Basic information, including title, published venue, published time (year and month), and region of correspondence. (2) Data-related information, including data sources (continents), dataset type, modality (e.g., text, image, video, or text-image), clinical specialty, disease name, data availability (i.e., private or public data), and data size. (3) Model-related information, which comprises base LLM type, parameter size, and technique type. (4) Evaluation, which includes evaluation schema (e.g., automatic or human evaluation) and evaluation metric (e.g., accuracy and precision). See Supplementary Table 1 for details of the data extraction form.

Data synthesis

We synthesized insights from the data extraction to highlight the principal themes in LLM-based disease diagnosis. Firstly, we presented the scope of our review, spanning disease-associated clinical specialties, clinical data, data modalities, and LLM techniques. We also calculated the statistics of the meta-information, including development tendencies, the most widely used LLMs, and the distribution of the data sources. We then summarized various LLM-based techniques and evaluation strategies, analyzing their strengths and weaknesses, and offering targeted recommendations. Diving deeper into technical aspects, we detailed modeling approaches into four categories (prompt-based methods, RAG, fine-tuning, and pre-training), and fine-grained subtypes. We also examined the challenges faced by current research and outlined potential future directions. In summary, our synthesis encompassed a broad range of perspectives, assessing studies across data, LLM techniques, performance evaluation, and application scenarios, which are in line with established reporting standards.

Data availability

The analyzed data are included in this article. Aggregate data analyzed in this study will be released upon the acceptance of this paper.

References

  • Liu et al. 2020 Yuan Liu, Ayush Jain, Clara Eng, David H Way, Kang Lee, Peggy Bui, Kimberly Kanada, Guilherme de Oliveira Marinho, Jessica Gallegos, Sara Gabriele, et al. A deep learning system for differential diagnosis of skin diseases. Nature medicine, 26(6):900–908, 2020.
  • Mei et al. 2020 Xueyan Mei, Hao-Chih Lee, Kai-yue Diao, Mingqian Huang, Bin Lin, Chenyu Liu, Zongyu Xie, Yixuan Ma, Philip M Robson, Michael Chung, et al. Artificial intelligence–enabled rapid diagnosis of patients with covid-19. Nature medicine, 26(8):1224–1228, 2020.
  • Li et al. 2021 Xiaoqing Li, Dan Tian, Weihua Li, Bin Dong, Hansong Wang, Jiajun Yuan, Biru Li, Lei Shi, Xulin Lin, Liebin Zhao, et al. Artificial intelligence-assisted reduction in patients’ waiting time for outpatient process: a retrospective cohort study. BMC health services research, 21:1–11, 2021.
  • Li et al. 2024a Bing Li, Huan Chen, Weihong Yu, Ming Zhang, Fang Lu, Jingxue Ma, Yuhua Hao, Xiaorong Li, Bojie Hu, Lijun Shen, et al. The performance of a deep learning system in assisting junior ophthalmologists in diagnosing 13 major fundus diseases: a prospective multi-center clinical trial. npj Digital Medicine, 7(1):8, 2024a.
  • Qiu et al. 2020 Shangran Qiu, Prajakta S Joshi, Matthew I Miller, Chonghua Xue, Xiao Zhou, Cody Karjadi, Gary H Chang, Anant S Joshi, Brigid Dwyer, Shuhan Zhu, et al. Development and validation of an interpretable deep learning framework for alzheimer’s disease classification. Brain, 143(6):1920–1933, 2020.
  • Barnett et al. 1987 G Octo Barnett, James J Cimino, Jon A Hupp, and Edward P Hoffer. Dxplain: an evolving diagnostic decision-support system. Jama, 258(1):67–74, 1987.
  • Su et al. 2020 Chang Su, Zhenxing Xu, Jyotishman Pathak, and Fei Wang. Deep learning in mental health outcome research: a scoping review. Translational Psychiatry, 10(1):116, 2020.
  • Gkotsis et al. 2017 George Gkotsis, Anika Oellrich, Sumithra Velupillai, Maria Liakata, Tim JP Hubbard, Richard JB Dobson, and Rina Dutta. Characterisation of mental health conditions in social media using informed deep learning. Scientific reports, 7(1):1–11, 2017.
  • Du et al. 2018 Jingcheng Du, Yaoyun Zhang, Jianhong Luo, Yuxi Jia, Qiang Wei, Cui Tao, and Hua Xu. Extracting psychiatric stressors for suicide from social media using deep learning. BMC medical informatics and decision making, 18:77–87, 2018.
  • Sajda 2006 Paul Sajda. Machine learning for detection and diagnosis of disease. Annu. Rev. Biomed. Eng., 8(1):537–565, 2006.
  • Stafford et al. 2020 Imogen S Stafford, Melina Kellermann, Enrico Mossotto, Robert Mark Beattie, Ben D MacArthur, and Sarah Ennis. A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases. NPJ digital medicine, 3(1):30, 2020.
  • Kline et al. 2022 Adrienne Kline, Hanyin Wang, Yikuan Li, Saya Dennis, Meghan Hutch, Zhenxing Xu, Fei Wang, Feixiong Cheng, and Yuan Luo. Multimodal machine learning in precision health: A scoping review. npj Digital Medicine, 5(1):171, 2022.
  • Aggarwal et al. 2021 Ravi Aggarwal, Viknesh Sounderajah, Guy Martin, Daniel SW Ting, Alan Karthikesalingam, Dominic King, Hutan Ashrafian, and Ara Darzi. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ digital medicine, 4(1):65, 2021.
  • Myszczynska et al. 2020 Monika A Myszczynska, Poojitha N Ojamies, Alix MB Lacoste, Daniel Neil, Amir Saffari, Richard Mead, Guillaume M Hautbergue, Joanna D Holbrook, and Laura Ferraiuolo. Applications of machine learning to diagnosis and treatment of neurodegenerative diseases. Nature reviews neurology, 16(8):440–456, 2020.
  • Fatima and Pasha 2017 Meherwar Fatima and Maruf Pasha. Survey of machine learning algorithms for disease diagnostic. Journal of Intelligent Learning Systems and Applications, 9(01):1–16, 2017.
  • Choy et al. 2023 Shern Ping Choy, Byung Jin Kim, Alexandra Paolino, Wei Ren Tan, Sarah Man Lin Lim, Jessica Seo, Sze Ping Tan, Luc Francis, Teresa Tsakok, Michael Simpson, et al. Systematic review of deep learning image analyses for the diagnosis and monitoring of skin disease. NPJ Digital Medicine, 6(1):180, 2023.
  • Zhou et al. 2024a Shuang Zhou, Xiao Huang, Ninghao Liu, Wen Zhang, Yuan-Ting Zhang, and Fu-Lai Chung. Open-world electrocardiogram classification via domain knowledge-driven contrastive learning. Neural Networks, 179:106551, 2024a.
  • Mei et al. 2023 Xueyan Mei, Zelong Liu, Ayushi Singh, Marcia Lange, Priyanka Boddu, Jingqi QX Gong, Justine Lee, Cody DeMarco, Chendi Cao, Samantha Platt, et al. Interstitial lung disease diagnosis and prognosis using an ai system integrating longitudinal data. Nature communications, 14(1):2272, 2023.
  • Zhou et al. 2021 Qianwei Zhou, Margarita Zuley, Yuan Guo, Lu Yang, Bronwyn Nair, Adrienne Vargo, Suzanne Ghannam, Dooman Arefan, and Shandong Wu. A machine and human reader study on ai diagnosis model safety under attacks of adversarial images. Nature communications, 12(1):7281, 2021.
  • Hannun et al. 2019 Awni Y Hannun, Pranav Rajpurkar, Masoumeh Haghpanahi, Geoffrey H Tison, Codie Bourn, Mintu P Turakhia, and Andrew Y Ng. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature medicine, 25(1):65–69, 2019.
  • Brown et al. 2020 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, 2020.
  • Touvron et al. 2023 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • Lu et al. 2024a Zhiyong Lu, Yifan Peng, Trevor Cohen, Marzyeh Ghassemi, Chunhua Weng, and Shubo Tian. Large language models in biomedicine and health: current research landscape and future directions. Journal of the American Medical Informatics Association, 31(9):1801–1811, 2024a.
  • Singhal et al. 2023 Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
  • Peng et al. 2024a Le Peng, Gaoxiang Luo, Sicheng Zhou, Jiandong Chen, Ziyue Xu, Ju Sun, and Rui Zhang. An in-depth evaluation of federated learning on biomedical natural language processing for information extraction. NPJ Digital Medicine, 7(1):127, 2024a.
  • Chen et al. 2024a Xiaolan Chen, Weiyi Zhang, Pusheng Xu, Ziwei Zhao, Yingfeng Zheng, Danli Shi, and Mingguang He. Ffa-gpt: an automated pipeline for fundus fluorescein angiography interpretation and question-answer. npj Digital Medicine, 7(1):111, 2024a.
  • Liu et al. 2023a Fenglin Liu, Tingting Zhu, Xian Wu, Bang Yang, Chenyu You, Chenyang Wang, Lei Lu, Zhangdaihong Liu, Yefeng Zheng, Xu Sun, et al. A medical multimodal large language model for future pandemics. NPJ Digital Medicine, 6(1):226, 2023a.
  • Lu et al. 2024b Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Melissa Zhao, Aaron K Chow, Kenji Ikemura, Ahrong Kim, Dimitra Pouli, Ankush Patel, et al. A multimodal generative ai copilot for human pathology. Nature, pages 1–3, 2024b.
  • Kim et al. 2024a Jiyeong Kim, Kimberly G Leonte, Michael L Chen, John B Torous, Eleni Linos, Anthony Pinto, and Carolyn I Rodriguez. Large language models outperform mental and medical health care professionals in identifying obsessive-compulsive disorder. NPJ Digital Medicine, 7(1):193, 2024a.
  • Thirunavukarasu et al. 2023 Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023.
  • Zhou et al. 2023a Hongjian Zhou, Boyang Gu, Xinyu Zou, Yiru Li, Sam S Chen, Peilin Zhou, Junling Liu, Yining Hua, Chengfeng Mao, Xian Wu, et al. A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:2311.05112, 2023a.
  • Meng et al. 2024 Xiangbin Meng, Xiangyu Yan, Kuo Zhang, Da Liu, Xiaojuan Cui, Yaodong Yang, Muhan Zhang, Chunxia Cao, Jingjia Wang, Xuliang Wang, et al. The application of large language models in medicine: A scoping review. Iscience, 27(5), 2024.
  • Zhang et al. 2024a Yunkun Zhang, Jin Gao, Zheling Tan, Lingfeng Zhou, Kexin Ding, Mu Zhou, Shaoting Zhang, and Dequan Wang. Data-centric foundation models in computational healthcare: A survey. arXiv preprint arXiv:2401.02458, 2024a.
  • Du et al. 2024a Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Richard Yang, Wenyu Zhang, Xinyi Wang, Rui Zhang, Pengyu Hong, David W Bates, et al. Generative large language models in electronic health records for patient care since 2023: A systematic review. medRxiv, pages 2024–08, 2024a.
  • Wang et al. 2024a Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, et al. A survey for large language models in biomedicine. arXiv preprint arXiv:2409.00133, 2024a.
  • Li et al. 2024b Lingyao Li, Jiayan Zhou, Zhenxiang Gao, Wenyue Hua, Lizhou Fan, Huizi Yu, Loni Hagen, Yonfeng Zhang, Themistocles L Assimes, Libby Hemphill, et al. A scoping review of using large language models (llms) to investigate electronic health records (ehrs). arXiv preprint arXiv:2405.03066, 2024b.
  • He et al. 2023 Kai He, Rui Mao, Qika Lin, Yucheng Ruan, Xiang Lan, Mengling Feng, and Erik Cambria. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. arXiv preprint arXiv:2310.05694, 2023.
  • Pressman et al. 2024 Sophia M Pressman, Sahar Borna, Cesar A Gomez-Cabello, Syed Ali Haider, Clifton R Haider, and Antonio Jorge Forte. Clinical and surgical applications of large language models: A systematic review. Journal of Clinical Medicine, 13(11):3041, 2024.
  • Wu et al. 2024a D. Wu, J. Yang, C. Liu, T. C. Hsieh, E. Marchi, J. Blair, P. Krawitz, C. Weng, W. Chung, G. J. Lyon, I. D. Krantz, J. M. Kalish, and K. Wang. Gestaltmml: Enhancing rare genetic disease diagnosis through multimodal machine learning combining facial images and clinical texts. ArXiv, 2024a. ISSN 2331-8422.
  • Mizuta et al. 2024 K. Mizuta, T. Hirosawa, Y. Harada, and T. Shimizu. Can chatgpt-4 evaluate whether a differential diagnosis list contains the correct diagnosis as accurately as a physician? Diagnosis (Berl), 11(3):321–324, 2024. ISSN 2194-802x. 10.1515/dx-2024-0027.
  • Noda et al. 2024 M. Noda, H. Yoshimura, T. Okubo, R. Koshu, Y. Uchiyama, A. Nomura, M. Ito, and Y. Takumi. Feasibility of multimodal artificial intelligence using gpt-4 vision for the classification of middle ear disease: Qualitative study and validation. Jmir ai, 3:e58342, 2024. ISSN 2817-1705. 10.2196/58342.
  • Overgaard Olesen et al. 2024 Anne Sophie Overgaard Olesen, Kristina Cecilia Miger, Olav Wendelboe Nielsen, and Johannes Grand. How does chatgpt-4 match radiologists in detecting pulmonary congestion on chest x-ray? Journal of Medical Artificial Intelligence, 7, 2024. ISSN 2617-2496.
  • Hu et al. 2024a Chuanbo Hu, Wenqi Li, Mindi Ruan, Xiangxu Yu, Lynn K Paul, Shuo Wang, and Xin Li. Exploiting chatgpt for diagnosing autism-associated language disorders and identifying distinct features. arXiv preprint arXiv:2405.01799, 2024a.
  • Rezaii et al. 2024 Neguine Rezaii, Daisy Hochberg, Megan Quimby, Bonnie Wong, Michael Brickhouse, Alexandra Touroutoglou, Bradford C Dickerson, and Phillip Wolff. Artificial intelligence classifies primary progressive aphasia from connected speech. Brain, 147(9):3070–3082, 2024.
  • Benary et al. 2023 Manuela Benary, Xing David Wang, Max Schmidt, Dominik Soll, Georg Hilfenhaus, Mani Nassir, Christian Sigler, Maren Knödler, Ulrich Keller, Dieter Beule, Ulrich Keilholz, Ulf Leser, and Damian T. Rieke. Leveraging large language models for decision support in personalized oncology. JAMA Network Open, 6(11):e2343689–e2343689, 2023. ISSN 2574-3805. 10.1001/jamanetworkopen.2023.43689.
  • Liu et al. 2023b Chunyu Liu, Yongpei Ma, Kavitha Kothur, Armin Nikpour, and Omid Kavehei. Biosignal copilot: Leveraging the power of llms in drafting reports for biomedical signals. medRxiv, page 2023.06.28.23291916, 2023b. 10.1101/2023.06.28.23291916.
  • Liu et al. 2023c Xin Liu, Daniel McDuff, Geza Kovacs, Isaac Galatzer-Levy, Jacob Sunshine, Jiening Zhan, Ming-Zher Poh, Shun Liao, Paolo Di Achille, and Shwetak Patel. Large language models are few-shot health learners. arXiv preprint arXiv:2305.15525, 2023c.
  • Slack and Singh 2023 Dylan Slack and Sameer Singh. Tablet: Learning from instructions for tabular data. arXiv preprint arXiv:2304.13188, 2023.
  • Dou et al. 2023 Chengfeng Dou, Zhi Jin, Wenping Jiao, Haiyan Zhao, Zhenwei Tao, and Yongqiang Zhao. Plugmed: Improving specificity in patient-centered medical dialogue generation using in-context learning. arXiv preprint arXiv:2305.11508, 2023.
  • Yang et al. 2024a Zhichao Yang, Avijit Mitra, Sunjae Kwon, and Hong Yu. Clinicalmamba: A generative clinical language model on longitudinal clinical notes, 2024a.
  • Siepmann et al. 2024 Robert Siepmann, Marc Huppertz, Annika Rastkhiz, Matthias Reen, Eric Corban, Christian Schmidt, Stephan Wilke, Philipp Schad, Can Yüksel, Christiane Kuhl, Daniel Truhn, and Sven Nebelung. The virtual reference radiologist: comprehensive ai assistance for clinical image reading and interpretation. European Radiology, 2024. ISSN 1432-1084. 10.1007/s00330-024-10727-2.
  • Mai et al. 2024 A. Shaaban Mai, Khan Adnan, and Yaqub Mohammad. Medpromptx: Grounded multimodal prompting for chest x-ray diagnosis. ArXiv, abs/2403.15585, 2024.
  • Xia et al. 2024a Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu, et al. Cares: A comprehensive benchmark of trustworthiness in medical vision language models. arXiv preprint arXiv:2406.06007, 2024a.
  • Bae et al. 2024 Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei Ji, Eric Chang, Tackeun Kim, et al. Ehrxqa: A multi-modal question answering dataset for electronic health records with chest x-ray images. Advances in Neural Information Processing Systems, 36, 2024.
  • Wada et al. 2024 A. Wada, T. Akashi, G. Shih, A. Hagiwara, M. Nishizawa, Y. Hayakawa, J. Kikuta, K. Shimoji, K. Sano, K. Kamagata, A. Nakanishi, and S. Aoki. Optimizing gpt-4 turbo diagnostic accuracy in neuroradiology through prompt engineering and confidence thresholds. Diagnostics (Basel), 14(14), 2024. ISSN 2075-4418 (Print) 2075-4418. 10.3390/diagnostics14141541.
  • Moallem et al. 2024 Golnaz Moallem, Aneysis De Las Mercedes Gonzalez, Atman Desai, and Mirabela Rusu. Automated labeling of spondylolisthesis cases through spinal mri radiology report interpretation using chatgpt. In Medical Imaging 2024: Computer-Aided Diagnosis, volume 12927, pages 702–706. SPIE, 2024.
  • Chen et al. 2023a Zhiyu Chen, Yujie Lu, and William Yang Wang. Empowering psychotherapy with large language models: Cognitive distortion detection through diagnosis of thought prompting. arXiv preprint arXiv:2310.07146, 2023a.
  • Englhardt et al. 2024 Zachary Englhardt, Chengqian Ma, Margaret E Morris, Chun-Cheng Chang, Xuhai” Orson” Xu, Lianhui Qin, Daniel McDuff, Xin Liu, Shwetak Patel, and Vikram Iyer. From classification to clinical insights: Towards analyzing and reasoning about mobile and behavioral health data with large language models. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(2):1–25, 2024.
  • Parth et al. 2024 Vashisht Parth, Lodha Abhilasha, Maddipatla Mukta, Yao Zonghai, Mitra Avijit, Yang Zhichao, Wang Junda, Kwon Sunjae, and Yu Hong. Umass-bionlp at mediqa-m3g 2024: Dermprompt - a systematic exploration of prompt engineering with gpt-4v for dermatological diagnosis. ArXiv, abs/2404.17749, 2024.
  • Busch et al. 2024 Felix Busch, Tianyu Han, Marcus R Makowski, Daniel Truhn, Keno K Bressem, and Lisa Adams. Integrating text and image analysis: Exploring gpt-4v’s capabilities in advanced radiological applications across subspecialties. J Med Internet Res, 26:e54948, 2024. ISSN 1438-8871. 10.2196/54948.
  • Abdullahi et al. 2024 Tassallah Abdullahi, Ritambhara Singh, Carsten Eickhoff, et al. Learning to make rare and complex diagnoses with generative ai assistance: qualitative study of popular large language models. JMIR Medical Education, 10(1):e51391, 2024.
  • Lim et al. 2024 Sehee Lim, Yejin Kim, Chi-Hyun Choi, Jy-yong Sohn, and Byung-Hoon Kim. Erd: A framework for improving llm reasoning for cognitive distortion classification. arXiv preprint arXiv:2403.14255, 2024.
  • Kim et al. 2024b Yu Han Kim, Xuhai Orson Xu, Daniel McDuff, Cynthia Breazeal, and Hae Won Park. Health-llm: Large language models for health prediction via wearable sensor data. ArXiv, abs/2401.06866, 2024b.
  • Peng et al. 2024b Cheng Peng, Zehao Yu, Kaleb E Smith, Wei-Hsuan Lo-Ciganic, Jiang Bian, and Yonghui Wu. Improving generalizability of extracting social determinants of health using large language models through prompt-tuning. arXiv preprint arXiv:2403.12374, 2024b.
  • Wenshuo et al. 2023 Zhou Wenshuo, Ye Zhiyu, Yang Yehui, Wang Siqi, Huang Haifeng, Wang Rongjie, and Yang Dalu. Transferring pre-trained large language-image model for medical image captioning, 2023.
  • Niu et al. 2024 Shuai Niu, Jing Ma, Liang Bai, Zhihua Wang, Li Guo, and Xian Yang. Ehr-knowgen: Knowledge-enhanced multimodal learning for disease diagnosis generation. Information Fusion, 102:102069, 2024. ISSN 1566-2535. https://doi.org/10.1016/j.inffus.2023.102069.
  • Belyaeva et al. 2023 Anastasiya Belyaeva, Justin Cosentino, Farhad Hormozdiari, Krish Eswaran, Shravya Shetty, Greg Corrado, Andrew Carroll, Cory Y McLean, and Nicholas A Furlotte. Multimodal llms for health grounded in individual-specific data. In Workshop on Machine Learning for Multimodal Healthcare Data, pages 86–102. Springer, 2023.
  • Peng et al. 2024c Liang Peng, Songyue Cai, Zongqian Wu, Huifang Shang, Xiaofeng Zhu, and Xiaoxiao Li. Mmgpl: Multimodal medical data analysis with graph prompt learning. Medical Image Analysis, 97:103225, 2024c.
  • Wen et al. 2023a Yilin Wen, Zifeng Wang, and Jimeng Sun. Mindmap: Knowledge graph prompting sparks graph of thoughts in large language models. ArXiv, abs/2308.09729, 2023a.
  • Wu et al. 2024b Jiageng Wu, Xian Wu, and Jie Yang. Guiding clinical reasoning with large language models via knowledge seeds. arXiv preprint arXiv:2403.06609, 2024b.
  • Gao et al. 2023a Y. Gao, R. Li, Emma Croxford, Samuel Tesch, Daniel To, J. Caskey, B. W. Patterson, Matthew M. Churpek, T. Miller, D. Dligach, and M. Afshar. Large language models and medical knowledge grounding for diagnosis prediction. In medRxiv, 2023a.
  • Zhu et al. 2024 Yinghao Zhu, Changyu Ren, Zixiang Wang, Xiaochen Zheng, Shiyun Xie, Junlan Feng, Xi Zhu, Zhoujun Li, Liantao Ma, and Chengwei Pan. Emerge: Integrating rag for improved multimodal ehr predictive modeling. ArXiv, abs/2406.00036, 2024.
  • Ranjit et al. 2023 Mercy Prasanna Ranjit, Gopinath Ganapathy, Ranjit Frederick Manuel, and Tanuja Ganu. Retrieval augmented chest x-ray report generation using openai gpt models. ArXiv, abs/2305.03660, 2023.
  • Ge et al. 2023 Jin Ge, Steve Sun, Joseph Owens, Victor Galvez, Oksana Gologorskaya, Jennifer C. Lai, Mark J Pletcher, and Ki Lai. Development of a liver disease-specific large language model chat interface using retrieval augmented generation. medRxiv, 2023.
  • Upadhyaya et al. 2024 Aasef G. Upadhyaya, Dipak P.and Shaikh, Gokce Busra Cakir, Katrina Prantzalos, Pedram Golnari, Fatema F. Ghasia, and Satya S. Sahoo. A 360° view for large language models: Early detection of amblyopia in children using multi-view eye movement recordings. pages 165–175, 2024.
  • Xia et al. 2024b Peng Xia, Kangyu Zhu, Haoran Li, Hongtu Zhu, Yun Li, Gang Li, Linjun Zhang, and Huaxiu Yao. Rule: Reliable multimodal rag for factuality in medical vision language models. ArXiv, abs/2407.05131, 2024b.
  • Yu et al. 2024 Han Yu, Peikun Guo, and Akane Sano. Ecg semantic integrator (esi): A foundation ecg model pretrained with llm-enhanced cardiological text. ArXiv, abs/2405.19366, 2024.
  • Rau et al. 2023 Alexander Rau, Stephan Rau, Daniela Zoeller, Anna Fink, Hien Tran, Caroline Wilpert, Johanna Nattenmueller, Jakob Neubauer, Fabian Bamberg, Marco Reisert, and Maximilian Frederik Russe. A context-based chatbot surpasses trained radiologists and generic chatgpt in following the acr appropriateness guidelines. Radiology, 308 1:e230970, 2023.
  • Shi et al. 2023 Wenqi Shi, Yuchen Zhuang, Yuanda Zhu, Henry Iwinski, Michael Wattenbarger, and May Dongmei Wang. Retrieval-augmented large language models for adolescent idiopathic scoliosis patients in shared decision-making. In Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 1–10, 2023.
  • Wang et al. 2024b Hao Wang, Sendong Zhao, Zewen Qiang, Nuwa Xi, Bing Qin, and Ting Liu. Beyond direct diagnosis: Llm-based multi-specialist agent consultation for automatic diagnosis. ArXiv, abs/2401.16107, 2024b.
  • Rifat Ahmmad Rashid et al. 2024 Mohammad Rifat Ahmmad Rashid, Mahamudul Hasan, Akibul Haque, Angon Bhadra Antu, Anika Tabassum Tanha, Anisur Rahman, and M Saddam Hossain Khan. A respiratory disease management framework by combining large language models and convolutional neural networks for effective diagnosis. International Journal of Computing and Digital Systems, 16(1):189–202, 2024.
  • Panagoulias et al. 2024 Dimitrios P. Panagoulias, Evridiki Tsoureli-Nikita, Maria K. Virvou, and George A. Tsihrintzis. Dermacen analytica: A novel methodology integrating multi-modal large language models with machine learning in tele-dermatology. ArXiv, abs/2403.14243, 2024.
  • Chen et al. 2024b Chen Chen, Lei Li, Marcel Beetz, Abhirup Banerjee, Ramneek Gupta, and Vicente Grau. Large language model-informed ecg dual attention network for heart failure risk prediction. ArXiv, abs/2403.10581, 2024b.
  • Toma et al. 2023 Augustin Toma, Patrick R Lawler, Jimmy Ba, Rahul G Krishnan, Barry B Rubin, and Bo Wang. Clinical camel: An open expert-level medical language model with dialogue-based knowledge encoding. arXiv, 2023.
  • Ting et al. 2024 Daniel Shu Wei Ting, Jasmine Chiat Ling Ong, Liyuan Jin, Elangovan Kabilan, Gilbert Yong San Lim, Daniel Yan Zheng Lim, Gerald Gui Ren Sng, Yuhe Ke, Joshua Yi Min Tung, Ryan Jian Zhong, Christopher Ming Yao Koh, Keane Zhi Hao Lee, Xiang Chen, Jack Kian Ch’ng, Than Aung, and Ken Junyang Goh. Development and testing of a novel large language model-based clinical decision support systems for medication safety in 12 clinical specialties. Research Square, 2024.
  • Vithanage et al. 2024 Dinithi Vithanage, Chao Deng, Lei Wang, Mengyang Yin, Mohammad Alkhalaf, Zhenyua Zhang, Yunshu Zhu, Alan Christy Soewargo, and Ping Yu. Evaluating machine learning approaches for multi-label classification of unstructured electronic health records with a generative large language model. bioRxiv, 2024.
  • Liu et al. 2023d Junwen Liu, Zheyu Zhang, Jifeng Xiao, Zhijia Jin, Xuekun Zhang, Yuanyuan Ma, Fuhua Yan, and Ning Wen. Large language model locally fine-tuning (LLMLF) on chinese medical imaging reports. In Proceedings of the 2023 6th International Conference on Big Data Technologies, New York, NY, USA, 2023d. ACM.
  • Liu et al. 2023e Fenglin Liu, Tingting Zhu, Xian Wu, Bang Yang, Chenyu You, Chenyang Wang, Lei Lu, Zhangdaihong Liu, Yefeng Zheng, Xu Sun, Yang Yang, Lei Clifton, and David A Clifton. A medical multimodal large language model for future pandemics. NPJ Digit. Med., 6(1):226, 2023e.
  • Song et al. 2024 Meiyue Song, Jiarui Wang, Zhihua Yu, Jiaxin Wang, Le Yang, Yuting Lu, Baicun Li, Xue Wang, Xiaoxu Wang, Qinghua Huang, Zhijun Li, Nikolaos I Kanellakis, Jiangfeng Liu, Jing Wang, Binglu Wang, and Juntao Yang. PneumoLLM: Harnessing the power of large language model for pneumoconiosis diagnosis. Med. Image Anal., 97(103248):103248, 2024.
  • Yang et al. 2024b Lin Yang, Shawn Xu, Andrew Sellergren, Timo Kohlberger, Yuchen Zhou, Ira Ktena, Atilla Kiraly, Faruk Ahmed, Farhad Hormozdiari, Tiam Jaroensri, Eric Wang, Ellery Wulczyn, Fayaz Jamil, Theo Guidroz, Chuck Lau, Siyuan Qiao, Yun Liu, Akshay Goel, Kendall Park, Arnav Agharwal, Nick George, Yang Wang, Ryutaro Tanno, David G T Barrett, Wei-Hung Weng, S Sara Mahdavi, Khaled Saab, Tao Tu, Sreenivasa Raju Kalidindi, Mozziyar Etemadi, Jorge Cuadros, Gregory Sorensen, Yossi Matias, Katherine Chou, Greg Corrado, Joelle Barral, Shravya Shetty, David Fleet, S M Ali Eslami, Daniel Tse, Shruthi Prabhakara, Cory McLean, Dave Steiner, Rory Pilgrim, Christopher Kelly, Shekoofeh Azizi, and Daniel Golden. Advancing multimodal medical capabilities of gemini. arXiv, 2024b.
  • Zhang et al. 2024b Xiangyu Zhang, Hexin Liu, Kaishuai Xu, Qiquan Zhang, Daijiao Liu, Beena Ahmed, and Julien Epps. When LLMs meets acoustic landmarks: An efficient approach to integrate speech into large language models for depression detection. arXiv, 2024b.
  • Sun et al. 2024a Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Kai Zhang, Lin Sun, Zhongyi Shui, Yunlong Zhang, Honglin Li, and Lin Yang. Pathasst: A generative foundation ai assistant towards artificial general intelligence of pathology. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5034–5042, 2024a.
  • Liu and Zuo 2023 Weihua Liu and Yong Zuo. Stone needle: A general multimodal large-scale model framework towards healthcare. arXiv preprint arXiv:2306.16034, 2023.
  • Kraljevic et al. 2024 Zeljko Kraljevic, Dan Bean, Anthony Shek, Rebecca Bendayan, Harry Hemingway, Joshua Au Yeung, Alexander Deng, Alfred Baston, Jack Ross, Esther Idowu, et al. Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study. The Lancet Digital Health, 6(4):e281–e290, 2024.
  • Jiang et al. 2023 Lavender Yao Jiang, Xujin Chris Liu, Nima Pour Nejatian, Mustafa Nasir-Moin, Duo Wang, Anas Abidin, Kevin Eaton, Howard Antony Riina, Ilya Laufer, Paawan Punjabi, et al. Health system-scale language models are all-purpose prediction engines. Nature, 619(7969):357–362, 2023.
  • Dou et al. 2024a Chengfeng Dou, Zhi Jin, Wenpin Jiao, Haiyan Zhao, Yongqiang Zhao, and Zhenwei Tao. Integrating physician diagnostic logic into large language models: Preference learning from process feedback. arXiv, 2024a.
  • Wang et al. 2023a Guangyu Wang, Guoxing Yang, Zongxin Du, Longjun Fan, and Xiaohu Li. ClinicalGPT: Large language models finetuned with diverse medical data and comprehensive evaluation. arXiv, 2023a.
  • Zhang et al. 2023a Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou Wang, and Haizhou Li. HuatuoGPT, towards taming language model to be a doctor. arXiv, 2023a.
  • Zhou et al. 2024b Zijian Zhou, Miaojing Shi, Meng Wei, Oluwatosin Alabi, Zijie Yue, and Tom Vercauteren. Large model driven radiology report generation with clinical quality reinforcement learning. arXiv, 2024b.
  • Sun 2024 Maojun Sun. LlamaCare: A large medical language model for enhancing healthcare knowledge sharing. arXiv, 2024.
  • Yang et al. 2024c Dingkang Yang, Jinjie Wei, Dongling Xiao, Shunli Wang, Tong Wu, Gang Li, Mingcheng Li, Shuaibing Wang, Jiawei Chen, Yue Jiang, Qingyao Xu, Ke Li, Peng Zhai, and Lihua Zhang. PediatricsGPT: Large language models as chinese medical assistants for pediatric applications. arXiv, 2024c.
  • Chen et al. 2024c Zhixuan Chen, Luyang Luo, Yequan Bie, and Hao Chen. Dia-LLaMA: Towards large language model-driven CT report generation. arXiv, 2024c.
  • Rajashekar et al. 2024a Niroop Channa Rajashekar, Yeo Eun Shin, Yuan Pu, Sunny Chung, Kisung You, Mauro Giuffre, Colleen E Chan, Theo Saarinen, Allen Hsiao, Jasjeet Sekhon, Ambrose H Wong, Leigh V Evans, Rene F Kizilcec, Loren Laine, Terika Mccall, and Dennis Shung. Human-algorithmic interaction using a large language model-augmented artificial intelligence clinical decision support system. In Proceedings of the CHI Conference on Human Factors in Computing Systems, volume 37, pages 1–20, New York, NY, USA, 2024a. ACM.
  • Labrak et al. 2024 Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. BioMistral: A collection of open-source pretrained large language models for medical domains. arXiv, 2024.
  • Chen et al. 2024d Junying Chen, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, and Benyou Wang. HuatuoGPT-vision, towards injecting medical visual knowledge into multimodal LLMs at scale. arXiv, 2024d.
  • Xu et al. 2024a Hanwen Xu, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, Zelalem Gero, Javier González, Yu Gu, Yanbo Xu, Mu Wei, Wenhui Wang, Shuming Ma, Furu Wei, Jianwei Yang, Chunyuan Li, Jianfeng Gao, Jaylen Rosemon, Tucker Bower, Soohee Lee, Roshanthi Weerasinghe, Bill J Wright, Ari Robicsek, Brian Piening, Carlo Bifulco, Sheng Wang, and Hoifung Poon. A whole-slide foundation model for digital pathology from real-world data. Nature, 630(8015):181–188, 2024a.
  • Ding et al. 2024 Jun-En Ding, Nguyen Minh Thao Phan, Wen-Chih Peng, Jian-Zhe Wang, Chun-Cheng Chug, Min-Chen Hsieh, Yun-Chien Tseng, Ling Chen, Dongsheng Luo, Chenwei Wu, Chi-Te Wang, Chih-Ho Hsu, Yi-Tui Chen, Pei-Fu Chen, Feng Liu, and Fang-Ming Hung. Large language multimodal models for new-onset type 2 diabetes prediction using five-year cohort electronic health records. Research Square, 2024.
  • Liu et al. 2023f Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9), 2023f. ISSN 0360-0300. 10.1145/3560815.
  • Jiaqi et al. 2023 Wang Jiaqi, Shi Enze, Yu Sigang, Wu Zihao, Ma Chong, Dai Haixing, Yang Qiushi, Kang Yanqing, Wu Jinru, Hu Huawen, Yue Chenxi, Zhang Haiyang, Liu Yi-Hsueh, Li Xiang, Ge Bao, Zhu Dajiang, Yuan Yixuan, Shen Dinggang, Liu Tianming, and Zhang Shu. Prompt engineering for healthcare: Methodologies and applications. ArXiv, abs/2304.14670, 2023.
  • Gao et al. 2023b Zhangyang Gao, Yuqi Hu, Cheng Tan, and Stan Z. Li. Prefixmol: Target- and chemistry-aware molecule design via prefix embedding. ArXiv preprint, abs/2302.07120, 2023b.
  • Kojima et al. 2022 Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  • Wei et al. 2024 Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2024.
  • Wang et al. 2023b Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Huai hsin Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations, 2023b.
  • Safranek et al. 2024 Conrad W Safranek, Thomas Huang, Donald S Wright, Catherine X Wright, Vimig Socrates, Rohit B Sangal, Mark Iscoe, David Chartash, and R Andrew Taylor. Automated heart score determination via chatgpt: Honing a framework for iterative prompt development. Journal of the American College of Emergency Physicians Open, 5(2):e13133, 2024.
  • Chung et al. 2024 Philip Chung, Christine T Fong, Andrew M Walters, Nima Aghaeepour, Meliha Yetisgen, and Vikas N O’Reilly-Shah. Large language model capabilities in perioperative risk prediction and prognostication. JAMA surgery, 2024.
  • Delsoz et al. 2023 M. Delsoz, Y. Madadi, W. M. Munir, B. Tamm, S. Mehravaran, M. Soleimani, A. Djalilian, and S. Yousefi. Performance of chatgpt in diagnosis of corneal eye diseases. medRxiv, 2023. 10.1101/2023.08.25.23294635.
  • Fink et al. 2023 Matthias A. Fink, Arved Bischoff, Christoph A. Fink, Martin Moll, Jonas Kroschke, Luca Dulz, Claus Peter Heusel, Hans-Ulrich Kauczor, and Tim F. Weber. Potential of chatgpt and gpt-4 for data mining of free-text ct reports on lung cancer. Radiology, 308(3):e231362, 2023. 10.1148/radiol.231362.
  • Reese et al. 2023 Justin T Reese, Daniel Danis, J Harry Caufield, Tudor Groza, Elena Casiraghi, Giorgio Valentini, Christopher J Mungall, and Peter N Robinson. On the limitations of large language models in clinical diagnosis. medRxiv, 2023.
  • Sarangi et al. 2024 Pradosh Kumar Sarangi, Aparna Irodi, Swaha Panda, Debasish Swapnesh Kumar Nayak, and Himel Mondal. Radiological differential diagnoses based on cardiovascular and thoracic imaging patterns: perspectives of four large language models. Indian Journal of Radiology and Imaging, 34(02):269–275, 2024.
  • Wang et al. 2024c Jiankun Wang, Sumyeong Ahn, Taykhoom Dalal, Xiaodan Zhang, Weishen Pan, Qiannan Zhang, Bin Chen, Hiroko H Dodge, Fei Wang, and Jiayu Zhou. Augmented risk prediction for the onset of alzheimer’s disease from electronic health records with large language models. arXiv preprint arXiv:2405.16413, 2024c.
  • Du et al. 2024b Xinsong Du, John Novoa-Laurentiev, Joseph M Plasaek, Ya-Wen Chuang, Liqin Wang, Gad Marshall, Stephanie K Mueller, Frank Chang, Surabhi Datta, Hunki Paek, et al. Enhancing early detection of cognitive decline in the elderly: A comparative study utilizing large language models in clinical notes. medRxiv, 2024b.
  • Haider et al. 2024 Syed Ali Haider, Sophia M Pressman, Sahar Borna, Cesar A Gomez-Cabello, Ajai Sehgal, Bradley C Leibovich, and Antonio Jorge Forte. Evaluating large language model (llm) performance on established breast classification systems. Diagnostics, 14(14):1491, 2024.
  • Wu et al. 2024c Xuzhou Wu, Guangxin Li, Xing Wang, Zeyu Xu, Yingni Wang, Jianming Xian, Xueyu Wang, Gong Li, and Kehong Yuan. Diagnosis assistant for liver cancer utilizing a large language model with three types of knowledge. 2024c.
  • Xu et al. 2023 Shawn Xu, Lin Yang, Christopher Kelly, Marcin Sieniek, Timo Kohlberger, Martin Ma, Wei-Hung Weng, Atilla Kiraly, Sahar Kazemzadeh, Zakkai Melamed, et al. Elixr: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv:2308.01317, 2023.
  • Gertz et al. 2024 Roman Johannes Gertz, Thomas Dratsch, Alexander Christian Bunck, Simon Lennartz, Andra-Iza Iuga, Martin Gunnar Hellmich, Thorsten Persigehl, Lenhard Pennig, Carsten Herbert Gietzen, Philipp Fervers, et al. Potential of gpt-4 for detecting errors in radiology reports: Implications for reporting accuracy. Radiology, 311(1):e232714, 2024.
  • Ono et al. 2024 D. Ono, D. W. Dickson, and S. Koga. Evaluating the efficacy of few-shot learning for gpt-4vision in neurodegenerative disease histopathology: A comparative analysis with convolutional neural network model. Neuropathol Appl Neurobiol, 50(4):e12997, 2024. ISSN 0305-1846. 10.1111/nan.12997.
  • Dai et al. 2021 Yin Dai, Yifan Gao, and Fayu Liu. Transmed: Transformers advance multi-modal medical image classification. Diagnostics, 11(8):1384, 2021.
  • Antaki et al. 2024 F. Antaki, R. Chopra, and P. A. Keane. Vision-language models for feature detection of macular diseases on optical coherence tomography. JAMA Ophthalmol, 142(6):573–576, 2024. ISSN 2168-6165 (Print) 2168-6165. 10.1001/jamaophthalmol.2024.1165.
  • Peng et al. 2023a Zhiyu Peng, Ruiqi Ma, Yihan Zhang, Mingxu Yan, Jie Lu, Qian Cheng, Jingjing Liao, Yunqiu Zhang, Jinghan Wang, Yue Zhao, et al. Development and evaluation of multimodal ai for diagnosis and triage of ophthalmic diseases using chatgpt and anterior segment images: protocol for a two-stage cross-sectional study. Frontiers in Artificial Intelligence, 6:1323924, 2023a.
  • Suh et al. 2024 Pae Sun Suh, Woo Hyun Shim, Chong Hyun Suh, Hwon Heo, Chae Ri Park, Hye Joung Eom, Kye Jin Park, Jooae Choe, Pyeong Hwa Kim, Hyo Jung Park, et al. Comparing diagnostic accuracy of radiologists versus gpt-4v and gemini pro vision using image inputs from diagnosis please cases. Radiology, 312(1):e240273, 2024.
  • Pugliese et al. 2023 Giorgia Pugliese, Alberto Maccari, Elena Felisati, Giovanni Felisati, Leonardo Giudici, Chiara Rapolla, Antonia Pisani, and Alberto Maria Saibene. Are artificial intelligence large language models a reliable tool for difficult differential diagnosis? an a posteriori analysis of a peculiar case of necrotizing otitis externa. Clinical Case Reports, 11(9):e7933, 2023.
  • Shijian et al. 2024 Deng Shijian, E. Kosloski Erin, Patel Siddhi, A. Barnett Zeke, Nan Yiyang, Kaplan Alexander, Aarukapalli Sisira, T. Doan William, Wang Matthew, Singh Harsh, Rollins Pamela Rosenthal, and Tian Yapeng. Hear me, see me, understand me: Audio-visual autism behavior recognition. ArXiv, abs/2406.02554, 2024.
  • Liang et al. 2024 Lungang Liang, Yulan Chen, Taifu Wang, Dan Jiang, Jishuo Jin, Yanmeng Pang, Qin Na, Qiang Liu, Xiaosen Jiang, Wentao Dai, et al. Genetic transformer: An innovative large language model driven approach for rapid and accurate identification of causative variants in rare genetic diseases. medRxiv, pages 2024–07, 2024.
  • Feng et al. 2023 Yingjie Feng, Xiaoyin Xu, Yueting Zhuang, and Min Zhang. Large language models improve alzheimer’s disease diagnosis using multi-modality data. In 2023 IEEE International Conference on Medical Artificial Intelligence (MedAI), pages 61–66. IEEE, 2023.
  • Yu et al. 2023a Han Yu, Peikun Guo, and Akane Sano. Zero-shot ecg diagnosis with large language models and retrieval-augmented generation. In ML4H@NeurIPS, 2023a.
  • Wu et al. 2023 Da Wu, Jingye Yang, Steven Klein, Cong Liu, Tzung-Chien Hsieh, Peter Krawitz, Chunhua Weng, Gholson J Lyon, Jennifer M Kalish, and Kai Wang. Multimodal machine learning combining facial images and clinical texts improves diagnosis of rare genetic diseases. arXiv preprint arXiv:2312.15320, 2023.
  • Ma et al. 2024 Mingyu Derek Ma, Chenchen Ye, Yu Yan, Xiaoxuan Wang, Peipei Ping, Timothy S Chang, and Wei Wang. Clibench: Multifaceted evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders and prescriptions. arXiv preprint arXiv:2406.09923, 2024.
  • Fraser et al. 2023 Hamish Fraser, Daven Crossland, Ian Bacher, Megan Ranney, Tracy Madsen, and Ross Hilliard. Comparison of diagnostic and triage accuracy of ada health and webmd symptom checkers, chatgpt, and physicians for patients in an emergency department: Clinical data analysis study. JMIR Mhealth Uhealth, 11:e49995, Oct 2023.
  • Yu et al. 2023b Han Yu, Peikun Guo, and Akane Sano. Zero-shot ecg diagnosis with large language models and retrieval-augmented generation. In Stefan Hegselmann, Antonio Parziale, Divya Shanmugam, Shengpu Tang, Mercy Nyamewaa Asiedu, Serina Chang, Tom Hartvigsen, and Harvineet Singh, editors, Proceedings of the 3rd Machine Learning for Health Symposium, volume 225 of Proceedings of Machine Learning Research, pages 650–663. PMLR, 10 Dec 2023b.
  • Zhang et al. 2023b Haodi Zhang, Jiahong Li, Yichi Wang, and Yuanfeng Song. Integrating automated knowledge extraction with large language models for explainable medical decision-making. 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1710–1717, 2023b.
  • Thompson et al. 2023 Will Thompson, David M. Vidmar, Jessica K. De Freitas, John M. Pfeifer, Brandon K. Fornwalt, Ruijun Chen, Gabriel Altay, Kabir Manghnani, Andrew C. Nelsen, Kellie Morland, Martin C. Stumpe, and Riccardo Miotto. Large language models with retrieval-augmented generation for zero-shot disease phenotyping. ArXiv, abs/2312.06457, 2023.
  • Zhao et al. 2024 Wenting Zhao, Zhongfen Deng, Shweta Yadav, and Philip S. Yu. Heterogeneous knowledge grounding for medical question answering with retrieval augmented large language model. Companion Proceedings of the ACM on Web Conference 2024, 2024.
  • Kresevic et al. 2024 Simone Kresevic, Mauro Giuffrè, Milos Ajcevic, Agostino Accardo, Lory S Crocè, and Dennis L Shung. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. NPJ Digital Medicine, 7(1):102, 2024.
  • Zhenzhu et al. 2024 Li Zhenzhu, Zhang Jingfeng, Zhou Wei, Zheng Jianjun, and Xia Yinshui. Gpt-agents based on medical guidelines can improve the responsiveness and explainability of outcomes for traumatic brain injury rehabilitation. Scientific Reports, 14(1):7626, 2024.
  • Ghersin et al. 2023 Itai Ghersin, R Weisshof, Eduard Koifman, Haggai Bar-Yoseph, Dana Ben Hur, Itay Maza, Erez Hasnis, Roni Nasser, Baruch Ovadia, Dikla Dror Zur, Matti Waterman, and Yuri Gorelik. Comparative evaluation of a language model and human specialists in the application of european guidelines for the management of inflammatory bowel diseases and malignancies. Endoscopy, 2023.
  • McInerney et al. 2024a Denis Jered McInerney, William Dickinson, Lucy Flynn, Andrea Young, Geoffrey Young, Jan-Willem van de Meent, and Byron C. Wallace. Towards reducing diagnostic errors with interpretable risk prediction. ArXiv, abs/2402.10109, 2024a.
  • Vaid et al. 2024 Akhil Vaid, Joshua Lampert, Juhee Lee, Ashwin Sawant, Donald Apakama, Ankit Sakhuja, Ali Soroush, Denise Lee, Isotta Landi, Nicole Bussola, Ismail Nabeel, Robbie Freeman, Patricia H. Kovatch, Brendan G. Carr, Benjamin S. Glicksberg, Edgar Argulian, Stamatios Lerakis, Monica Kraft, Alexander Charney, and Girish N. Nadkarni. Generative large language models are autonomous practitioners of evidence-based medicine. CoRR, abs/2401.02851, 2024. 10.48550/ARXIV.2401.02851.
  • Oniani et al. 2024 David Oniani, Xizhi Wu, Shyam Visweswaran, Sumit Kapoor, Shravan Kooragayalu, Katelyn Polanska, and Yanshan Wang. Enhancing large language models for clinical decision support by incorporating clinical practice guidelines. 2024 IEEE 12th International Conference on Healthcare Informatics (ICHI), pages 694–702, 2024.
  • Molinet et al. 2024 Benjamin Molinet, Santiago Marro, Elena Cabrio, and Serena Villata. Explanatory argumentation in natural language for correct and incorrect medical diagnoses. Journal of Biomedical Semantics, 15, 2024.
  • Sharma 2024 Naman Sharma. Cxr-agent: Vision-language models for chest x-ray interpretation with uncertainty aware radiology reporting. ArXiv, abs/2407.08811, 2024.
  • Ferber et al. 2024 Dyke Ferber, Omar S. M. El Nahhas, Georg Wölflein, Isabella C. Wiest, Jan Clusmann, Marie-Elisabeth Lessman, Sebastian Foersch, Jacqueline Lammert, Maximilian Tschochohei, Dirk Jäger, Manuel Salto-Tellez, Nikolaus Schultz, Daniel Truhn, and Jakob Nikolas Kather. Autonomous artificial intelligence agents for clinical decision making in oncology. ArXiv, abs/2404.04667, 2024.
  • Soong et al. 2023 David Soong, Sriram Sridhar, Han Si, J. S. Wagner, Ana Caroline Costa S’a, Christina Y. Yu, Kubra Karagoz, Meijian Guan, Hisham K Hamadeh, and Brandon Higgs. Improving accuracy of gpt-3/4 results on biomedical data using a retrieval-augmented language model. PLOS Digital Health, 3, 2023.
  • Choi et al. 2024 Hongyoon Choi, Dongjoo Lee, and Yeon koo Kang. Empowering pet imaging reporting with retrieval-augmented large language models and reading reports database: A pilot single center study. In medRxiv, 2024.
  • Jia et al. 2024 Mingyi Jia, Junwen Duan, Yan Song, and Jianxin Wang. medikal: Integrating knowledge graphs as assistants of llms for enhanced clinical diagnosis on emrs. ArXiv, abs/2406.14326, 2024.
  • Askell et al. 2021 Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a laboratory for alignment. arXiv, 2021.
  • Liu et al. 2023g Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023g.
  • McDuff et al. 2023 Daniel McDuff, Mike Schaekermann, Tao Tu, Anil Palepu, Amy Wang, Jake Garrison, Karan Singhal, Yash Sharma, Shekoofeh Azizi, Kavita Kulkarni, et al. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164, 2023.
  • Zhang et al. 2024c Kai Zhang, Rong Zhou, Eashan Adhikarla, Zhiling Yan, Yixin Liu, Jun Yu, Zhengliang Liu, Xun Chen, Brian D Davison, Hui Ren, et al. A generalist vision–language foundation model for diverse biomedical tasks. Nature Medicine, pages 1–13, 2024c.
  • Yang et al. 2023 Songhua Yang, Hanjie Zhao, Senbin Zhu, Guangyu Zhou, Hongfei Xu, Yuxiang Jia, and Hongying Zan. Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue. arXiv, 2023.
  • Saab et al. 2024 Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of gemini models in medicine. arXiv, 2024.
  • Wu et al. 2024d Jiageng Wu, Xian Wu, Yefeng Zheng, and Jie Yang. MedKP: Medical dialogue with knowledge enhancement and clinical pathway encoding. arXiv, 2024d.
  • He et al. 2024a Yuhong He, Yongqi Zhang, Shizhu He, and Jun Wan. BP4ER: Bootstrap prompting for explicit reasoning in medical dialogue generation. arXiv, 2024a.
  • Luo et al. 2024 Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao, and Kang Liu. MoELoRA: Contrastive learning guided mixture of experts on parameter-efficient fine-tuning for large language models. arXiv, 2024.
  • Xu et al. 2024b Kaishuai Xu, Yi Cheng, Wenjun Hou, Qiaoyu Tan, and Wenjie Li. Reasoning like a doctor: Improving medical dialogue systems via diagnostic reasoning process alignment. arXiv, 2024b.
  • He et al. 2024b Sunan He, Yuxiang Nie, Zhixuan Chen, Zhiyuan Cai, Hongmei Wang, Shu Yang, and Hao Chen. MedDr: Diagnosis-guided bootstrapping for large-scale medical vision-language learning. arXiv, 2024b.
  • Zhou et al. 2024c Hong-Yu Zhou, Subathra Adithan, Julián Nicolás Acosta, Eric J Topol, and Pranav Rajpurkar. A generalist learner for multifaceted medical image interpretation. arXiv, 2024c.
  • Ji et al. 2024 Jia Ji, Yongshuai Hou, Xinyu Chen, Youcheng Pan, and Yang Xiang. Vision-language model for generating textual descriptions from clinical images: Model development and validation study. JMIR Form. Res., 8:e32690, 2024.
  • Liu et al. 2023h Zhengliang Liu, Yiwei Li, Peng Shu, Aoxiao Zhong, Longtao Yang, Chao Ju, Zihao Wu, Chong Ma, Jie Luo, Cheng Chen, et al. Radiology-llama2: Best-in-class large language model for radiology. arXiv preprint arXiv:2309.06419, 2023h.
  • Alkhaldi et al. 2024 Asma Alkhaldi, Raneem Alnajim, Layan Alabdullatef, Rawan Alyahya, Jun Chen, Deyao Zhu, Ahmed Alsinan, and Mohamed Elhoseiny. MiniGPT-med: Large language model as a general interface for radiology diagnosis. arXiv, 2024.
  • Lee et al. 2023 Seowoo Lee, Jiwon Youn, Hyungjin Kim, Mansu Kim, and Soon Ho Yoon. CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images. arXiv, 2023.
  • Kwon et al. 2024 Taeyoon Kwon, Kai Tzu-iunn Ong, Dongjin Kang, Seungjun Moon, Jeong Ryong Lee, Dosik Hwang, Beomseok Sohn, Yongsik Sim, Dongha Lee, and Jinyoung Yeo. Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18417–18425, 2024.
  • Zhou et al. 2023b Juexiao Zhou, Xiaonan He, Liyuan Sun, Jiannan Xu, Xiuying Chen, Yuetan Chu, Longxi Zhou, Xingyu Liao, Bin Zhang, and Xin Gao. SkinGPT-4: An interactive dermatology diagnostic system with visual large language model. arXiv, 2023b.
  • Ouyang et al. 2022 Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E Miller, Maddie Simens, Amanda Askell, P Welinder, P Christiano, J Leike, and Ryan J Lowe. Training language models to follow instructions with human feedback. ArXiv preprint, abs/2203.02155, 2022.
  • Schulman et al. 2017 John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv, 2017.
  • Gao et al. 2022 Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. arXiv, 2022.
  • Marks et al. 2023 Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, Philip Torr, and Fazl Barez. Beyond training objectives: Interpreting reward model divergence in large language models. arXiv, 2023.
  • Henderson et al. 2018 Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. In AAAI, pages 3207–3214, 2018.
  • Rafailov et al. 2023 Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv, 2023.
  • Ye et al. 2023 Qichen Ye, Junling Liu, Dading Chong, Peilin Zhou, Yining Hua, Fenglin Liu, Meng Cao, Ziming Wang, Xuxin Cheng, Zhu Lei, and Zhenhua Guo. Qilin-med: Multi-stage knowledge injection advanced medical large language model. arXiv, 2023.
  • Guo et al. 2017 Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330. PMLR, 2017.
  • Tajwar et al. 2024 Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of LLMs should leverage suboptimal, on-policy data. arXiv, 2024.
  • Bai et al. 2022 Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional AI: Harmlessness from AI feedback. arXiv, 2022.
  • Hu et al. 2022 Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  • Liu et al. 2024a Qidong Liu, Xian Wu, Xiangyu Zhao, Yuanshao Zhu, Derong Xu, Feng Tian, and Yefeng Zheng. When MOE meets LLMs: Parameter efficient fine-tuning for multi-task medical applications, 2024a.
  • Chen et al. 2024e Xiaolan Chen, Ziwei Zhao, Weiyi Zhang, Pusheng Xu, Le Gao, Mingpu Xu, Yue Wu, Yinwen Li, Danli Shi, and Mingguang He. EyeGPT: Ophthalmic assistant with large language models. arXiv, 2024e.
  • Ren et al. 2024a Xiaoyu Ren, Yuanchen Bai, Huiyu Duan, Lei Fan, Erkang Fei, Geer Wu, Pradeep Ray, Menghan Hu, Chenyuan Yan, and Guangtao Zhai. ChatASD: LLM-based AI therapist for ASD. In Communications in Computer and Information Science, Communications in computer and information science, pages 312–324. Springer Nature Singapore, Singapore, 2024a.
  • Yang et al. 2022 Xi Yang, Nima Pour Nejatian, Hoo Chang Shin, Kaleb Smith, Christopher Parisien, Colin Compas, Cheryl Martin, Mona Flores, Ying Zhang, Tanja Magoc, Christopher Harle, Gloria Lipori, Duane Mitchell, William Hogan, Elizabeth Shenkman, Jiang Bian, and Yonghui Wu. GatorTron: A large clinical language model to unlock patient information from unstructured electronic health records. bioRxiv, 2022.
  • Acharya et al. 2024 Angeela Acharya, Sulabh Shrestha, Anyi Chen, Joseph Conte, Sanja Avramovic, Siddhartha Sikdar, Antonios Anastasopoulos, and Sanmay Das. Clinical risk prediction using language models: benefits and considerations. J. Am. Med. Inform. Assoc., 2024.
  • Wang et al. 2024d Jianfeng Wang, Kah Phooi Seng, Yi Shen, Li-Minn Ang, and Difeng Huang. Image to label to answer: An efficient framework for enhanced clinical applications in medical visual question answering. Electronics (Basel), 13(12):2273, 2024d.
  • Xie et al. 2023 Weidi Xie, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, and Yanfeng Wang. Towards generalist foundation model for radiology. Research Square, 2023.
  • Phan et al. 2024 Vu Minh Hieu Phan, Yutong Xie, Yuankai Qi, Lingqiao Liu, Liyang Liu, Bowen Zhang, Zhibin Liao, Qi Wu, Minh-Son To, and Johan W Verjans. Decomposing disease descriptions for enhanced pathology detection: A multi-aspect vision-language pre-training framework. arXiv, 2024.
  • Zhang et al. 2023c Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Weidi Xie, and Yanfeng Wang. Knowledge-enhanced visual-language pre-training on chest radiology images. Nature Communications, 14(1):4542, 2023c.
  • Du et al. 2024c Jiawei Du, Jia Guo, Weihang Zhang, Shengzhu Yang, Hanruo Liu, Huiqi Li, and Ningli Wang. Ret-clip: A retinal image foundation model pre-trained with clinical diagnostic reports. arXiv preprint arXiv:2405.14137, 2024c.
  • Tu et al. 2024 Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tomasev, et al. Towards conversational diagnostic ai. NEJM AI, 2024.
  • Chen et al. 2021 Pei-Fu Chen, Ssu-Ming Wang, Wei-Chih Liao, Lu-Cheng Kuo, Kuan-Chih Chen, Yu-Cheng Lin, Chi-Yu Yang, Chi-Hao Chiu, Shu-Chih Chang, Feipei Lai, et al. Automatic icd-10 coding and training system: deep neural network based on supervised learning. JMIR Medical Informatics, 9(8):e23230, 2021.
  • Zhang et al. 2024d Tiantian Zhang, Manxi Lin, Hongda Guo, Xiaofan Zhang, Ka Fung Peter Chiu, Aasa Feragen, and Qi Dou. Incorporating clinical guidelines through adapting multi-modal large language model for prostate cancer pi-rads scoring. arXiv preprint arXiv:2405.08786, 2024d.
  • Pedro et al. 2024 Tiago Pedro, José Maria Sousa, Luísa Fonseca, Manuel G Gama, Goreti Moreira, Mariana Pintalhão, Paulo C Chaves, Ana Aires, Gonçalo Alves, Luís Augusto, et al. Exploring the use of chatgpt in predicting anterior circulation stroke functional outcomes after mechanical thrombectomy: a pilot study. Journal of NeuroInterventional Surgery, 2024.
  • Zhou et al. 2024d Zijian Zhou, Miaojing Shi, Meng Wei, Oluwatosin Alabi, Zijie Yue, and Tom Vercauteren. Large model driven radiology report generation with clinical quality reinforcement learning. arXiv preprint arXiv:2403.06728, 2024d.
  • Yang et al. 2024d Lin Yang, Shawn Xu, Andrew Sellergren, Timo Kohlberger, Yuchen Zhou, Ira Ktena, Atilla Kiraly, Faruk Ahmed, Farhad Hormozdiari, Tiam Jaroensri, et al. Advancing multimodal medical capabilities of gemini. arXiv preprint arXiv:2405.03162, 2024d.
  • Wen et al. 2023b Yilin Wen, Zifeng Wang, and Jimeng Sun. Mindmap: Knowledge graph prompting sparks graph of thoughts in large language models. arXiv preprint arXiv:2308.09729, 2023b.
  • Weng et al. 2023 Yixuan Weng, Bin Li, Fei Xia, Minjun Zhu, Bin Sun, Shizhu He, Kang Liu, and Jun Zhao. Large language models need holistically thought in medical conversational qa. arXiv preprint arXiv:2305.05410, 2023.
  • Kottlors et al. 2023 Jonathan Kottlors, Grischa Bratke, Philip Rauen, Christoph Kabbasch, Thorsten Persigehl, Marc Schlamann, and Simon Lennartz. Feasibility of differential diagnosis based on imaging patterns using a large language model. Radiology, 308(1):e231167, 2023.
  • Yang et al. 2024e Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou. Mentallama: interpretable mental health analysis on social media with large language models. In Proceedings of the ACM on Web Conference 2024, pages 4489–4500, 2024e.
  • Chen et al. 2024f Xiaolan Chen, Weiyi Zhang, Ziwei Zhao, Pusheng Xu, Yingfeng Zheng, Danli Shi, and Mingguang He. Icga-gpt: report generation and question answering for indocyanine green angiography images. British Journal of Ophthalmology, 2024f.
  • Wu et al. 2024e Jiageng Wu, Xian Wu, Yefeng Zheng, and Jie Yang. Medkp: Medical dialogue with knowledge enhancement and clinical pathway encoding. arXiv preprint arXiv:2403.06611, 2024e.
  • Guo et al. 2024 Siyin Guo, Ruicen Li, Genpeng Li, Wenjie Chen, Jing Huang, Linye He, Yu Ma, Liying Wang, Hongping Zheng, Chunxiang Tian, et al. Comparing chatgpt’s and surgeon’s responses to thyroid-related questions from patients. The Journal of Clinical Endocrinology & Metabolism, page dgae235, 2024.
  • He et al. 2024c Yuhong He, Yongqi Zhang, Shizhu He, and Jun Wan. Bp4er: Bootstrap prompting for explicit reasoning in medical dialogue generation. arXiv preprint arXiv:2403.19414, 2024c.
  • Savage et al. 2024 Thomas Savage, Ashwin Nayak, Robert Gallo, Ekanath Rangan, and Jonathan H Chen. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digital Medicine, 7(1):20, 2024.
  • Zhou et al. 2024e Shuang Zhou, Sirui Ding, Jiashuo Wang, Mingquan Lin, Genevieve B Melton, and Rui Zhang. Interpretable differential diagnosis with dual-inference large language models. arXiv, 2024e.
  • Kang et al. 2024 Seil Kang, Donghyun Kim, Junhyeok Kim, Hyo Kyung Lee, and Seong Jae Hwang. Wolf: Wide-scope large language model framework for cxr understanding. CoRR, 2024.
  • Xu et al. 2024c Kaishuai Xu, Yi Cheng, Wenjun Hou, Qiaoyu Tan, and Wenjie Li. Reasoning like a doctor: Improving medical dialogue systems via diagnostic reasoning process alignment. In Findings of the Association for Computational Linguistics ACL 2024, pages 6796–6814. Association for Computational Linguistics, 2024c.
  • Wang et al. 2023c Guangyu Wang, Guoxing Yang, Zongxin Du, Longjun Fan, and Xiaohu Li. Clinicalgpt: large language models finetuned with diverse medical data and comprehensive evaluation. arXiv preprint arXiv:2306.09968, 2023c.
  • Li et al. 2024c Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov. Mediq: Question-asking llms for adaptive and reliable medical reasoning. arXiv preprint arXiv:2406.00922, 2024c.
  • Sandmann et al. 2024 Sarah Sandmann, Sarah Riepenhausen, Lucas Plagwitz, and Julian Varghese. Systematic analysis of chatgpt, google search and llama 2 for clinical decision support tasks. Nature Communications, 15(1):2050, 2024.
  • Hu et al. 2024b Jiaxiong Hu, Junze Li, Yuhang Zeng, Dongjie Yang, Danxuan Liang, Helen Meng, and Xiaojuan Ma. Designing scaffolding strategies for conversational agents in dialog task of neurocognitive disorders screening. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–21, 2024b.
  • Kim et al. 2024c Yubin Kim, Xuhai Xu, Daniel McDuff, Cynthia Breazeal, and Hae Won Park. Health-llm: Large language models for health prediction via wearable sensor data. Conference on Health, Inference, and Learning, 2024c.
  • Deng et al. 2024 Shijian Deng, Erin E Kosloski, Siddhi Patel, Zeke A Barnett, Yiyang Nan, Alexander Kaplan, Sisira Aarukapalli, William T Doan, Matthew Wang, Harsh Singh, et al. Hear me, see me, understand me: Audio-visual autism behavior recognition. arXiv preprint arXiv:2406.02554, 2024.
  • McInerney et al. 2024b Denis Jered McInerney, William Dickinson, Lucy Flynn, Andrea Young, Geoffrey Young, Jan-Willem van de Meent, and Byron C Wallace. Towards reducing diagnostic errors with interpretable risk prediction. arXiv preprint arXiv:2402.10109, 2024b.
  • Sun et al. 2024b Zhoujian Sun, Cheng Luo, and Zhengxing Huang. Conversational disease diagnosis via external planner-controlled large language models. arXiv preprint arXiv:2404.04292, 2024b.
  • Adler-Milstein et al. 2021 Julia Adler-Milstein, Jonathan H Chen, and Gurpreet Dhaliwal. Next-generation artificial intelligence for diagnosis: from predicting diagnostic labels to “wayfinding”. Jama, 326(24):2467–2468, 2021.
  • Shi et al. 2024 Xiaoming Shi, Zeming Liu, Li Du, Yuxuan Wang, Hongru Wang, Yuhang Guo, Tong Ruan, Jie Xu, Xiaofan Zhang, and Shaoting Zhang. Medical dialogue system: A survey of categories, methods, evaluation and challenges. In Findings of the Association for Computational Linguistics ACL 2024, 2024.
  • Yi et al. 2024 Zihao Yi, Jiarui Ouyang, Yuwen Liu, Tianhao Liao, Zhe Xu, and Ying Shen. A survey on recent advances in llm-based multi-turn dialogue systems. arXiv preprint arXiv:2402.18013, 2024.
  • Zou et al. 2024 Xuan Zou, Weijie He, Yu Huang, Yi Ouyang, Zhen Zhang, Yu Wu, Yongsheng Wu, Lili Feng, Sheng Wu, Mengqi Yang, et al. Ai-driven diagnostic assistance in medical inquiry: Reinforcement learning algorithm development and validation. Journal of Medical Internet Research, 26:e54616, 2024.
  • Giuffrè et al. 2024 Mauro Giuffrè, Simone Kresevic, Kisung You, Johannes Dupont, Jack Huebner, Alyssa Ann Grimshaw, and Dennis Legen Shung. Systematic review: The use of large language models as medical chatbots in digestive diseases. Alimentary Pharmacology & Therapeutics, 2024.
  • He et al. 2024d Zhe He, Balu Bhasuran, Qiao Jin, Shubo Tian, Karim Hanna, Cindy Shavor, Lisbeth Garcia Arguello, Patrick Murray, and Zhiyong Lu. Quality of answers of generative large language models vs peer patients for interpreting lab test results for lay patients: Evaluation study. ArXiv, 2024d.
  • Ren et al. 2024b Zhiyao Ren, Yibing Zhan, Baosheng Yu, Liang Ding, and Dacheng Tao. Healthcare copilot: Eliciting the power of general llms for medical consultation. arXiv preprint arXiv:2402.13408, 2024b.
  • Rajashekar et al. 2024b Niroop Channa Rajashekar, Yeo Eun Shin, Yuan Pu, Sunny Chung, Kisung You, Mauro Giuffre, Colleen E Chan, Theo Saarinen, Allen Hsiao, Jasjeet Sekhon, et al. Human-algorithmic interaction using a large language model-augmented artificial intelligence clinical decision support system. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–20, 2024b.
  • Zöller et al. 2024 Nikolas Zöller, Julian Berger, Irving Lin, Nathan Fu, Jayanth Komarneni, Gioele Barabucci, Kyle Laskowski, Victor Shia, Benjamin Harack, Eugene A Chu, et al. Human-ai collectives produce the most accurate differential diagnoses. arXiv preprint arXiv:2406.14981, 2024.
  • Wang et al. 2023d Sheng Wang, Zihao Zhao, Xi Ouyang, Qian Wang, and Dinggang Shen. Chatcad: Interactive computer-aided diagnosis on medical image using large language models. arXiv preprint arXiv:2302.07257, 2023d.
  • Haltaufderheide and Ranisch 2024 Joschka Haltaufderheide and Robert Ranisch. The ethics of chatgpt in medicine and healthcare: a systematic review on large language models (llms). NPJ Digital Medicine, 7(1):183, 2024.
  • Hager et al. 2024 Paul Hager, Friederike Jungmann, Robbie Holland, Kunal Bhagat, Inga Hubrecht, Manuel Knauer, Jakob Vielhauer, Marcus Makowski, Rickmer Braren, Georgios Kaissis, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nature medicine, pages 1–10, 2024.
  • Yue and Zhou 2020 Xiang Yue and Shuang Zhou. Phicon: Improving generalization of clinical text de-identification models via data augmentation. In Clinical Natural Language Processing Workshop, 2020.
  • Zhou et al. 2024f Juexiao Zhou, Xiaonan He, Liyuan Sun, Jiannan Xu, Xiuying Chen, Yuetan Chu, Longxi Zhou, Xingyu Liao, Bin Zhang, Shawn Afvari, et al. Pre-trained multimodal large language model enhances dermatological diagnosis using skingpt-4. Nature Communications, 15(1):5649, 2024f.
  • Poulain et al. 2024 Raphael Poulain, Hamed Fayyaz, and Rahmatollah Beheshti. Bias patterns in the application of llms for clinical decision support: A comprehensive study. arXiv preprint arXiv:2404.15149, 2024.
  • Spitale et al. 2024 Micol Spitale, Jiaee Cheong, and Hatice Gunes. Underneath the numbers: Quantitative and qualitative gender fairness in llms for depression prediction. arXiv, 2024.
  • Zhou et al. 2024g Shuang Zhou, Daochen Zha, Xiao Shen, Xiao Huang, Rui Zhang, and Korris Chung. Denoising-aware contrastive learning for noisy time series. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024g.
  • Mohsen et al. 2023 Farida Mohsen, Hamada RH Al-Absi, Noha A Yousri, Nady El Hajj, and Zubair Shah. A scoping review of artificial intelligence-based methods for diabetes risk prediction. NPJ Digital Medicine, 6(1):197, 2023.
  • Wang et al. 2024e Hanyin Wang, Chufan Gao, Christopher Dantona, Bryan Hull, and Jimeng Sun. Drg-llama: tuning llama model to predict diagnosis-related group for hospitalized patients. npj Digital Medicine, 7(1):16, 2024e.
  • Chen et al. 2023b Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023b.
  • Xie et al. 2024 Qianqian Xie, Qingyu Chen, Aokun Chen, Cheng Peng, Yan Hu, Fongci Lin, Xueqing Peng, Jimin Huang, Jeffrey Zhang, Vipina Keloth, et al. Me llama: Foundation large language models for medical applications. arXiv preprint arXiv:2402.12749, 2024.
  • Ke et al. 2024 Yu He Ke, Rui Yang, Sui An Lie, Taylor Xin Yi Lim, Hairil Rizal Abdullah, Daniel Shu Wei Ting, and Nan Liu. Enhancing diagnostic accuracy through multi-agent conversations: Using large language models to mitigate cognitive bias. arXiv preprint arXiv:2401.14589, 2024.
  • Kim et al. 2024d Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik Siu Chan, Xuhai Xu, Daniel McDuff, Cynthia Breazeal, and Hae Won Park. Adaptive collaboration strategy for llms in medical decision making. arXiv preprint arXiv:2404.15155, 2024d.
  • Dou et al. 2024b Chengfeng Dou, Ying Zhang, Yanyuan Chen, Zhi Jin, Wenpin Jiao, Haiyan Zhao, and Yu Huang. Detection, diagnosis, and explanation: A benchmark for chinese medial hallucination evaluation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4784–4794, 2024b.
  • Xu et al. 2024d Derong Xu, Ziheng Zhang, Zhihong Zhu, Zhenxi Lin, Qidong Liu, Xian Wu, Tong Xu, Xiangyu Zhao, Yefeng Zheng, and Enhong Chen. Editing factual knowledge and explanatory ability of medical large language models. arXiv preprint arXiv:2402.18099, 2024d.
  • Huang et al. 2023 Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
  • Peng et al. 2023b Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. A study of generative large language model for medical research and healthcare. NPJ digital medicine, 6(1):210, 2023b.
  • Barile et al. 2024 Joseph Barile, Alex Margolis, Grace Cason, Rachel Kim, Saia Kalash, Alexis Tchaconas, and Ruth Milanaik. Diagnostic accuracy of a large language model in pediatric case studies. JAMA pediatrics, 178(3):313–315, 2024.
  • Chen et al. 2024g Xiaolan Chen, Ziwei Zhao, Weiyi Zhang, Pusheng Xu, Le Gao, Mingpu Xu, Yue Wu, Yinwen Li, Danli Shi, and Mingguang He. Eyegpt: Ophthalmic assistant with large language models. arXiv preprint arXiv:2403.00840, 2024g.
  • Liu et al. 2024b Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 2024b.
  • Huang et al. 2020 Lin Huang, Lin Wang, Xiaomeng Hu, Sen Chen, Yunwen Tao, Haiyang Su, Jing Yang, Wei Xu, Vadanasundari Vedarethinam, Shu Wu, et al. Machine learning of serum metabolic patterns encodes early-stage lung adenocarcinoma. Nature Communications, 11(1):3556, 2020.

Acknowledgments

This work was supported by the National Institutes of Health’s National Center for Complementary and Integrative Health under grant number R01AT009457, National Institute on Aging under grant number R01AG078154, and National Cancer Institute under grant number R01CA287413. The content is solely the responsibility of the authors and does not represent the official views of the National Institutes of Health. We also acknowledge the support from the Center for Learning Health System Sciences.

Author contributions

S.Z. conceptualized the study and led the work. Z.Z., S.Z., J.Y., and M.Z. searched papers. S.Z., Z.X., M.Z., C.X., Y.G., Z.Z., S.D., J.W., K.X., Y.F., L.X., and J.Y. conducted paper screening and data extraction. S.Z., Z.X., M.Z., and C.X. performed data synthesis and contributed to the writing. D.Z., G.M., and R.Z. revised the manuscript. R.Z. supervised the study. All authors read and approved the final version.

Competing interests

The authors declare no competing interests.