LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

Justin Zhao, Timothy Wang
Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Alex Sherstinsky,
Piero Molino, Travis Addair, Devvret Rishi
Predibase

Abstract

Low Rank Adaptation (LoRA) has emerged as one of the most widely adopted methods for Parameter Efficient Fine-Tuning (PEFT) of Large Language Models (LLMs). LoRA reduces the number of trainable parameters and memory usage while achieving comparable performance to full fine-tuning. We aim to assess the viability of training and serving LLMs fine-tuned with LoRA in real-world applications. First, we measure the quality of LLMs fine-tuned with quantized low rank adapters across 10 base models and 31 tasks for a total of 310 models. We find that 4-bit LoRA fine-tuned models outperform base models by 34 points and GPT-4 by 10 points on average. Second, we investigate the most effective base models for fine-tuning and assess the correlative and predictive capacities of task complexity heuristics in forecasting the outcomes of fine-tuning. Finally, we evaluate the latency and concurrency capabilities of LoRAX, an open-source Multi-LoRA inference server that facilitates the deployment of multiple LoRA fine-tuned models on a single GPU using shared base model weights and dynamic adapter loading. LoRAX powers LoRA Land, a web application that hosts 25 LoRA fine-tuned Mistral-7B LLMs on a single NVIDIA A $100$ GPU with $80$ GB memory. LoRA Land highlights the quality and cost-effectiveness of employing multiple specialized LLMs over a single, general-purpose LLM.

Refer to caption — Figure 1: Average model performance for GPT-3.5, GPT-4, and 310 LLMs, before and after fine-tuning with LoRA, across 31 different tasks and 10 different base models. Zephyr-7b and Mistral-7b models exhibit the best performance after LoRA-based fine-tuning.

1 Introduction

Fine-tuning Large Language Models (LLMs) [23, 3] is a highly effective way to improve their performance, and add desirable or remove undesirable behaviors [28, 12, 13, 29]. Low Rank Adaptation (LoRA) [14] is one of the most widely adopted methods for fine-tuning LLMs, showing significant promise for enabling smaller, specialized models to outperform larger, more general models on specific tasks, with a fraction of trainable parameters, challenging the notion that bigger general models always outperform smaller ones.

Despite the rapid advancement and release of new base models, such as Gemma [36], Llama [37], and Mistral [15], which claim ease of fine-tuning across various tasks, comprehensive evaluations of these models remain scarce. Broad knowledge and reasoning-based benchmarks like MMLU [11] and HellaSwag [44] are commonly used in leaderboards like the Open LLM Leaderboard [2], however, this is not necessarily representative of task-specific performance, before or after fine-tuning. Technical reports [36, 37, 15, 26, 35] often leave training configurations unspecified, with claims of ease of fine-tuning left unmeasured. While the effectiveness of fine-tuning has been broadly demonstrated [17, 45], the lack of large-scale experimentation leaves several pivotal questions unanswered, particularly regarding the consistency and predictability of performance improvements through fine-tuning, and the impact of model size, base model, and task complexity.

Evaluations are sensitive to prompting, and there are significant variation in the formulations used in publications and libraries¹¹1https://github.com/openai/simple-evals. Technical reports often showcase model performance using specialized, dataset-specific prompting strategies such as role-playing prompts (e.g. "Assume you are an expert"), maj@k voting [40], varied n-shot [34], MedPrompt [25], and chain-of-thought [43] prompting. While these methods are intended to highlight the optimal capabilities of models, the use of such diverse prompting techniques can make direct comparisons across models and tasks challenging.

In this work, we seek to bridge these gaps by conducting an extensive analysis of LoRA-based fine-tuning across 10 base models and 31 tasks, totaling 310 LLMs fine-tuned with LoRA. We deliberately maintain that all LLMs are fine-tuned with the same training parameters and emphasize querying with zero or single-shot, completion-style prompts, with simple instructions like "Solve the following multiple choice problem". Altogether, this provides a standardized framework to compare and assess the intrinsic capabilities of different base models when fine-tuned with LoRA under consistent conditions, across specific tasks.

We also aim to explore the viability of serving multiple LoRA models in a real-world production application. LoRAX [1] enables serving multiple LoRA models simultaneously on a single GPU by leveraging shared base model weights and dynamic adapter loading [12]. We measure latency and concurrency metrics of this library. We use LoRAX to deploy 25 fine-tuned LLM served on a single A100²²2https://www.nvidia.com/en-us/data-center/a100/ in the LoRA Land web application. Our successful implementation showcases the economic efficiency of serving multiple LoRA-adapted LLMs for specialized tasks.

Finally, we release all 25 of the fine-tuned models on the LoRA Land web application and their training recipes on (Hugging Face) to allow further analysis and replication by the community.

2 Related work

Parameter-Efficient Fine-Tuning (PEFT) methods are designed to reduce the high expense of fine-tuning large-scale models. They achieve this by training a relatively small subset of parameters, compared to the total number of parameters, for adapting to downstream tasks. Existing PEFT strategies can be divided into two categories: Prompt-based methods add extra soft tokens (prompts) to the initial input and focus solely on fine-tuning these trainable vectors [19, 31, 42]. Adapter-based methods introduce additional trainable modules into the original frozen backbone [12, 32, 30, 33]. LoRA [14] expands upon adapter-based fine-tuning by adding a small number of trainable low-rank matrices alongside layers of frozen weights, which introduces a negligible inference overhead. Variants of LoRA include works like [22], which employs SVD decomposition to prune less significant singular values for more efficient updates. Another variation, DoRA [21], decomposes pre-trained weights into magnitude and direction components while applying LoRA the latter. QLoRA [8] optimizes LoRA’s design one step further, using 4-bit NF4 weights, double quantization to reduce the memory footprint, and paged optimizers to alleviate memory spikes. In our experiments, we focus on the original implementation of LoRA with 4-bit quantization.

Efficient serving of LoRA models. The main challenges for serving multiple fine-tuned models efficiently are:

1.

Scalability: As the demand for model inference grows, the system must scale efficiently to handle the increased load. This involves not just scaling up the computational resources but also managing the load distribution among models to maintain performance.
2.

Cost: The computational resources required to serve multiple fine-tuned models can lead to significant costs. Efficiently managing these costs while maintaining high performance and availability is a major challenge.

Techniques like Segmented Gather Matrix-Vector Multiplication (SGMV) [4] aim to address these challenges by optimizing the way computations are performed and resources are used. Open source tools like DeepSpeed³³3https://github.com/microsoft/DeepSpeed, FasterTransformer⁴⁴4https://github.com/NVIDIA/FasterTransformer, and vLLM [18] also aim to enable cost-effective and scalable serving of fine-tuned models. In this paper, we use LoRAX⁵⁵5https://github.com/predibase/lorax, which is specifically designed for the efficient serving of LLMs fine-tuned with LoRA. LoRAX supports dynamic adapter loading so adapters can be downloaded asynchronously during inference, multiple model families like Llama [37] and Mistral [15], and bitsandbytes⁶⁶6https://github.com/TimDettmers/bitsandbytes-quantized models.

3 Methodology

3.1 Task selection

In selecting datasets and tasks for our study, we prioritize those that are widely accessible via Kaggle⁷⁷7https://www.kaggle.com and HuggingFace⁸⁸8https://huggingface.co and those that are commonly used for benchmarking such as those on the Open LLM Leaderboard [2].

Our selection includes datasets like MMLU [11] for broad domain knowledge, Jigsaw [6] for content moderation, WikiSQL [46] for SQL generation, and GLUE benchmarks [39]. We categorize the tasks encompassed by these datasets into 5 types:

•

Classic NLP: Tasks derived from common NLP datasets published between 2018 and 2022 covering tasks like named entity recognition, data-to-text, and headline generation.
•

Coding: SQL query generation, and Python programming questions, which are mostly centered on algorithms and object-oriented design.
•

Knowledge: Knowledge-based multiple choice questions.
•

Reasoning: Reasoning-based multiple choice questions.
•

Math: Numerical, math-based word problems.

Category	Task Name	Task Description	Dataset Link	Metric	Range # Tokens	P95 # Tokens	# examples			Split Used for Evaluation
Category	Task Name	Task Description	Dataset Link	Metric	Range # Tokens	P95 # Tokens	train	validation	test	Split Used for Evaluation
Classic NLP	bc5cdr	Chemical and disease recognition	hf://tner/bc5cdr	rouge	143 - 570	226	5228	5330	5865	validation
	conllpp	Named entity recognition	hf://conllpp	rouge	110 - 401	170	14041	3250	3453	test
	e2e_nlg	Translation from meaning representation to natural language	hf://e2e_nlg	rouge	92 - 213	153	42061	4672	4693	test
	tldr_content_gen	Content generation given a headline	hf://JulesBelveze/tldr_news	rouge	46 - 425	204	7138	–	794	test
	tldr_headline_gen	Headline generation given news content	hf://JulesBelveze/tldr_news	rouge	41 - 420	199	7138	–	794	test
	viggo	Translation of video game meaning representations to natural language	hf://GEM/viggo	rouge	151 - 304	240	5103	714	1083	test
	webnlg	Translation of triples to natural language	hf://web_nlg (release_v3.0_en)	rouge	88 - 345	215	13211	1667	5713	test
Coding	magicoder	Coding tasks in multiple languages	hf://ise-uiuc/Magicoder-OSS-Instruct-75K	humaneval	141 - 1661	805	75197	–	–	(human_eval)
Coding	wikisql	SQL generation given a table and question	hf://wikisql	rouge	198 - 72472	1941	56355	8421	15878	test
Knowledge	boolq	Knowledge-based yes/no questions.	hf://google/boolq	accuracy	30 - 898	271	9427	3270	–	validation
	dbpedia	Topic extraction from a news article and title	hf://fancyzhx/dbpedia_14	accuracy	102 - 387	211	560000	–	70000	test
	customer_support	Customer support call classification given call transcript	github://cricketclub/gridspace-stanford-harper-valley	accuracy	151 - 679	377	1245	245	391	test
	glue_qnli	Does the response answer the question?	hf://glue/viewer/qnli	accuracy	52 - 350	123	104743	5463	5463	validation
	glue_stsb	How similar are the sentences?	hf://glue/viewer/stsb	mae	74 - 187	124	5749	1500	1379	validation
	legal	Legal document classification	kaggle://bahushruth/legalclausedataset	rouge	143 - 885	489	17000	2000	1000	test
	reuters	Topic extraction from Reuters news articles	hf://reuters21578/viewer/ModLewis (modlewis)	rouge	51 - 2056	637	13625	–	6188	test
	mmlu	General domain multiple-choice questions	hf://cais/mmlu/viewer/all	accuracy	47 - 1491	578	99842	1531	14042	validation
Reasoning	winogrande	Common sense 2-option task	hf://winogrande	accuracy	48 - 75	63	9248	1767	1267	test
	arc_combined	Multiple-choice science questions	hf://allenai/ai2_arc	accuracy	68 - 232	143	3370	869	3548	test
	glue_cola	Grammar and syntax acceptability	hf://glue/viewer/cola	accuracy	45 - 87	59	8551	1043	1063	validation
	glue_mnli	Does the hypothesis entail the premise?	hf://nyu-mll/glue/viewer/mnli	accuracy	64 - 339	128	392702	19647	19643	validation
	glue_mrpc	Do the sentences have the same meaning?	hf://glue/viewer/mrpc	accuracy	67 - 157	123	3668	408	1725	validation
	glue_qqp	Do the questions have the same meaning?	hf://glue/viewer/qqp	accuracy	60 - 351	102	363846	40430	390965	validation
	glue_sst2	Binary sentiment detection	hf://glue/viewer/sst2	accuracy	33 - 91	63	67349	872	1821	validation
	glue_wnli	Pronoun resolution	hf://glue/viewer/wnli	accuracy	73 - 160	134	635	71	146	validation
	covid	Sentiment detection of COVID-19 tweets	kaggle://datatattle/covid-19-nlp-text-classification	accuracy	131 - 292	223	37361	–	3798	test
	hellaswag	Multiple-choice sentence completion	hf://Rowan/hellaswag	accuracy	120 - 407	341	39905	10003	10042	validation
	hellaswag_processed	Sentence completion	hf://Rowan/hellaswag	rouge	75 - 205	185	39905	10003	10042	validation
	jigsaw	Toxic comment classification	kaggle://c/jigsaw-unintended-bias-in-toxicity-classification	accuracy	409 - 715	601	159571	–	153164	test
	drop	Question answering given a passage	hf://drop	rouge	87 - 2275	571	77400	9535	–	validation
Math	gsm8k	Grade school math problems	hf://gsm8k (main)	accuracy	58 - 465	276	7473	–	1319	test

Table 1: Tasks and datasets used. tldr_news and hellaswag datasets are used for multiple tasks. The length of the texts vary substantially across tasks. Many tasks and datasets exhibit a long-tail distribution, where a small number of examples have significantly longer sequences than the average. Token counts are based on the tiktoken package [27].

3.2 Prompt selection

Previous studies have demonstrated the potential of leveraging prompt engineering techniques, such as the use of majority voting [48], the inclusion of multiple in-context examples (n-shot) [34], MedPrompt [25], chain-of-thought prompting [43], etc., to enhance model performance on specific tasks.

In our evaluations, we consciously choose not to employ additional prompt engineering or tuning strategies for any specific dataset, task, or model. Although using more in-context examples or a more selective approach in n-shot prompting might yield better results, we prioritize reproducibility and the minimization of biases that could arise from customized in-context learning. Instead, we opt to use simple zero or single-shot completion-style prompts for all tasks. Our prompts are written in the completion style, described in Figure 2, to provide a fair comparison across fine-tuned, instruction-tuned, and auto-complete models. For classification tasks, the prompt includes all possible classes to inform the model’s responses. For more specialized tasks, where describing the expected output format is challenging, we use a single in-context example – the first example from the published training split – to guide the model.

[Uncaptioned image] — Table 2: Examples of prompts that are used in this study, all written in completion style. For more specialized tasks, where describing the expected output format is challenging (e.g. bc5cdr), we use a single in-context example — the first example from the published training split — to guide the model.

Finally, we follow prescribed prompt tagging conventions for each model, as outlined in the respective model’s documentation on HuggingFace, to ensure proper querying of pre-trained and instruction-tuned base models. This includes using "<s>[INST] … [/INST]" for prompts intended for Mistral Instruct, and "<bos><start_of_turn>user … <end_of_turn><start_of_turn><model>" for Gemma’s instruction-tuned models. For detailed information on the exact prompt templates applied to each task and model, please see Appendix A.

3.3 Base models

All base models are listed in Table 3. We use GPT-4 (gpt-4-0613) and GPT-3.5-Turbo (gpt-3.5-turbo-0125) as two strong LLM baselines. Our selection of these ten base models was guided by several key considerations, including their widespread adoption within the AI community, availability with permissive licenses, and availability of technical reports. We specifically choose base models with $\leq 8$ billion parameters to ensure that each model can be efficiently trained within the resource limits of a single A10G GPU.

Model Name	Creator	# of Parameters	Date Released
Llama-2-7b	Meta	7B	July 18, 2023
Llama-2-7b-chat	Meta	7B	July 18, 2023
Mistral-7b-v0.1	Mistral AI	7.24B	September 20, 2023
Mistral-7b-Instruct-v0.1	Mistral AI	7.24B	September 27, 2023
Zephyr-7b	Hugging Face	7.24B	October 26, 2023
Phi-2b	Microsoft	2.78B	December 13, 2023
Gemma-2b	Google	2.51B	February 21, 2024
Gemma-2b-it	Google	2.51B	February 21, 2024
Gemma-7b	Google	8.54B	February 21, 2024
Gemma-7b-it	Google	8.54B	February 21, 2024

Table 3: Base models used in LoRA-based fine-tuning experiments. To train all models on A10G hardware, all chosen base models are 7B parameters or smaller.

3.4 Training parameters

Each model is trained with published train splits⁹⁹9customer_support and legal are the only two tasks in our list without official splits. The exact splits for these datasets are published on ¡github.com/predibase/lora-bakeoff¿.. Each model is trained for $40000$ training steps with batch size 1, 4-bit quantization using bitsandbytes and a LoRA rank of $8$ . We use the paged adam optimizer[8], a learning rate of $0.002$ , and a cosine learning rate scheduler with a $0.03$ warm-up fraction ( $1200$ training steps). Gradients are applied over $16$ accumulation steps for an effective batch size of $16$ .

These training parameters, combined with gradient checkpointing, allow each LLM to be fine-tuned on a single A10 GPU with 24 GB of memory. For tasks where training on the full sequence lengths would still produce a GPU Out-Of-Memory (OOM) error, we first truncate example inputs to a maximum sequence length set as the 95th percentile of all task inputs.

To guarantee a consistent and straightforward basis of comparison across models, no additional hyperparameter tuning is applied to any specific dataset, task, or base model.

Training recipes for each model are provided as Ludwig [24] configurations for each of the fine-tuned LLMs and can be found at https://huggingface.co/predibase. Figure 3 shows an example of a config.

3.5 Evaluation

As specified in Table 1, models are evaluated on the test split if it exists and is labeled, or the validation set otherwise¹⁰¹⁰10MMLU has a published test set with labels, however, we use validation split to be consistent with the HELM benchmark [20]. We employ a tailored set of evaluation metrics to accurately assess the performance across all of the tasks. We use accuracy for classification tasks, (1 - mean absolute error) for regression tasks¹¹¹¹11Mean absolute error (MAE) is used because the range of target values are integer-like and small., and rouge-L¹²¹²12Text generation tasks are complicated to evaluate automatically [16]. ROUGE-L is a widely adopted proxy metric that focuses on the longest common subsequence between the generated text and the reference text, which captures the semantic similarity between the generated and reference texts rather than relying solely on exact word matches. ROUGE-L may not fully capture aspects like fluency, coherence and should be used in conjunction with other metrics and human evaluations to provide a fuller assessment of text generation quality. for generation tasks. The WikiSQL dataset has its own evaluation suite, however due to challenges integrating the WikiSQL evaluation suite, we have adopted the ROUGE metric as a proxy for assessing query quality¹³¹³13Although ROUGE is not tailored for SQL queries, it offers a viable alternative for gauging the alignment between generated and target queries.. For coding, we use HumanEval [5]. For GSM8K [7], a regex-based heuristic [9] is used to extract the mathematical answer to be consistent with the Open LLM Leaderboard [2]. All metrics are on a 0 to 1 scale, where 0 is the worst possible score, and 1 the best possible score.

Non-fine-tuned models often generate more varied outputs, including unintended artifacts such as additional words or explanations not specified in the prompt. For classification tasks, sometimes these models will generate the actual class string like "Yes/No", "positive/negative" or "True/False" spelled out, instead of the true "1/0" label in the dataset even when instructed. To minimize metric deductions due to response parsing strictness, we first use a regex-based extraction step to map the model’s response to the ground truth vocabulary. If there are multiple matches in the generated text, the first valid match is used. The code for regex-based pre-metric response extractions are available at github.com/predibase/lora-bakeoff.

Financial constraints associated with LLM APIs are not trivial. For example, using GPT-4 to assess the complete WikiSQL test set of 15,878 examples would cost approximately $400, considering the average input (805) and output (16) token counts per example. Such costs can be prohibitive, especially for organizations or researchers operating on limited budgets.

To manage costs while maintaining rigor, we restrict evaluations to the first 1000 examples for datasets with evaluation splits larger than 1000 examples. We acknowledge that this method may introduce selection bias and affect the generalizability of our findings. We recommend that future research considers more expansive evaluations as resources permit.

4 Results

LoRA fine-tuning provides a consistent and significant boost from fine-tuning across base models and tasks, as seen in Figure 4. Before fine-tuning, GPT-4 and GPT-3.5 have the strongest performance out of the box compared to all other base models, with 0.599 and 0.661 overall scores, respectively. Performance boosts from fine-tuning range from +26.3 to +51.2 points of improvement depending on the base model, and +38.7 on average (Table 4). Depending on the task, the best fine-tuned LLM outperforms the best base model from +8.3 to +67.5 points, +25.0 points on average (Table 5).

Task	Metric	Best BM	Best FT	GPT-4	Lift over BM	Lift over GPT-4
magicoder	humaneval	0.201	0.433	0.829	0.232	-0.396
mmlu	accuracy	0.506	0.589	0.774	0.083	-0.185
glue_wnli	accuracy	0.437	0.873	0.93	0.436	-0.057
arc_combined	accuracy	0.673	0.915	0.947	0.242	-0.032
wikisql	rouge	0.301	0.898	0.909	0.597	-0.011
boolq	accuracy	0.764	0.909	0.911	0.145	-0.002
customer_support	accuracy	0.850	1.000	1.000	0.150	0.000
glue_cola	accuracy	0.797	0.872	0.864	0.075	0.008
winogrande	accuracy	0.576	0.84	0.832	0.264	0.008
glue_sst2	accuracy	0.933	0.961	0.942	0.028	0.019
dbpedia	accuracy	0.868	0.988	0.965	0.120	0.023
hellaswag	accuracy	0.393	0.834	0.805	0.441	0.029
glue_qnli	accuracy	0.743	0.931	0.902	0.188	0.029
e2e_nlg	rouge	0.482	0.552	0.513	0.070	0.039
glue_qqp	accuracy	0.708	0.883	0.841	0.175	0.042
bc5cdr	rouge	0.703	0.972	0.89	0.269	0.082
glue_mnli	accuracy	0.455	0.899	0.803	0.444	0.096
webnlg	rouge	0.563	0.681	0.583	0.118	0.098
tldr_content_gen	rouge	0.183	0.23	0.125	0.047	0.105
glue_mrpc	accuracy	0.694	0.887	0.777	0.193	0.11
jigsaw	accuracy	0.704	0.867	0.754	0.163	0.113
hellaswag_processed	rouge	0.146	0.261	0.134	0.115	0.127
viggo	rouge	0.374	0.505	0.374	0.131	0.131
glue_stsb	mae	0.814	0.913	0.773	0.099	0.14
gsm8k	accuracy	0.364	0.569	0.373	0.205	0.196
conllpp	rouge	0.733	0.989	0.742	0.256	0.247
tldr_headline_gen	rouge	0.174	0.441	0.175	0.267	0.266
drop	rouge	0.066	0.741	0.393	0.675	0.348
legal	rouge	0.158	0.683	0.305	0.525	0.378
reuters	rouge	0.010	0.479	0.014	0.469	0.465
covid	accuracy	0.322	0.843	0.309	0.521	0.534
Average		0.506	0.756	0.661	0.250	0.095

Table 4: Best model performance for each task, before and after fine-tuning, compared to GPT-4.

After fine-tuning, 301/310 models surpass their base model counterpart¹⁴¹⁴14Most instances where fine-tuning was worse than the base model were in the family of Gemma models. This is possibly due to the bugs with the Gemma family of models as identified by Unsloth[10], which were not accounted for when benchmarks were collected., while 224/310 fine-tuned LLMs surpass the benchmark set by GPT-4 (Table 5). Gemma-2b is the worst performing base model after fine-tuning, but also experiences the largest lift from fine-tuning overall, which suggests that models with lower initial scores stand to benefit the most from fine-tuning (Figure 1).

By overall average across all tasks, all fine-tuned models perform better than GPT-3.5, and all 7B fine-tuned models perform better than GPT-4, except for gemma-7b and gemma-7b-it. Phi-2, with as few as 2 billion parameters, exhibits performance competitive with GPT-4 after fine-tuning, consistent with the findings of the Phi-2 technical report [46].

Averaged over 31 tasks, the overall performance of the best fine-tuned LLMs (0.756) are significantly higher than GPT-4 (0.661) (Table 5). A detailed breakdown of performance per model, per task, can be found in Appendix C.

Base Model

No FT

With FT

Average lift

from FT

Average lift

from FT

vs. GPT-4

Frequency

FT >No FT

Frequency

FT >GPT-4

Frequency

FT = max(task)

gpt-3.5-turbo

0.599

—

0/31

gemma-2b-instruct

0.326

0.645

0.319

-0.016

96.7% (30/31)

64.5% (20/31)

0/31

gemma-7b

0.187

0.645

0.458

-0.016

93.5% (29/31)

64.5% (20/31)

1/31

gemma-7b-instruct

0.377

0.656

0.279

-0.005

83.8% (26/31)

64.5% (20/31)

0/31

gemma-2b

0.145

0.657

0.512

-0.004

100.0% (31/31)

67.7% (21/31)

0/31

gpt-4

0.661

—

6/31

phi-2

0.274

0.677

0.403

0.016

100.0% (31/31)

71.0% (22/31)

1/31

llama-2-7b

0.252

0.696

0.444

0.035

96.7% (30/31)

67.7% (21/31)

0/31

llama-2-7b-chat

0.370

0.708

0.337

0.047

100.0% (31/31)

74.2% (23/31)

0/31

mistral-7b-instruct

0.462

0.724

0.263

0.063

100.0% (31/31)

77.4% (24/31)

3/31

mistral-7b

0.271

0.732

0.461

0.071

100.0% (31/31)

83.8% (26/31)

10/31

zephyr-7b-beta

0.350

0.742

0.392

0.081

100.0% (31/31)

87.1% (27/31)

8/31

Average

0.301

0.688

0.387

0.027

97.1% (301/310)

72.3% (224/310)

Table 5: Model performance by base model averaged over 31 tasks, before and after fine-tuning.

5 Discussion and Analysis

5.1 Which Base Model is the best for LoRA Fine-tuning?

Mistral-7B and Zephyr-7b-beta emerge as leaders, albeit in different categories. Mistral-7B frequently achieves top performance across the most number of tasks (10/31), suggesting a high adaptability (Figure 5). Conversely, Zephyr boasts the highest overall average performance (0.731). Mistral-7b, Mistral-7b-instruct, and Zephyr-7b-beta (which is itself based on Mistral-7b-instruct [38]) lead the pack for LoRA fine-tuning performance, ahead of Llama, Phi, and Gemma families.

5.2 Does size matter for LoRA fine-tuning? 2B vs. 7B

The 2B parameter Phi-2 model, after fine-tuning, outperforms all of the 2B and 7B Gemma models by overall overage, and is only 1.9 points behind the next highest performing 7B model, Llama-2-7b (0.677 vs. 0.696). Despite this, we find that fine-tuned 7B models are almost always better than fine-tuned 2B models (29/31 tasks). Among 2B parameter models in particular (Phi and Gemma), we see that all Gemma instruct models were better than Phi out of the box, however, Phi-2 performs better than all other Gemma models after fine-tuning.

5.3 Is fine-tuning better with Instruction-tuned or Auto-complete models?

In Figure 6, we observe that before fine-tuning, instruction-tuned models outperform auto-complete models, despite using completion style prompts. A qualitative analysis shows that auto-complete models were much more likely to "go off the rails", and generate long irrelevant text sequences, and instruction-tuned models demonstrate a higher consistency in correctly attempting the imminent task.

After fine-tuning, the performance disparities between the models narrow. The average instruction-tuned model slightly outperforms the average auto-complete model by a margin of +0.009, however the reverse is true when comparing the best fine-tuned instruction-tuned model and the best fine-tuned auto-complete model (-0.002). Auto-complete models, possibly due to their broader and less specialized knowledge base, may be inherently more adaptable to a variety of tasks. However, with adequate fine-tuning, both types of models achieve comparable performance levels. We encourage further research to explore how the foundational design of instruction-tuned models influences their adaptability and effectiveness in task-specific fine-tuning.

5.4 When does GPT-4 consistently outperform fine-tuned models?

We observe a distinct advantage for fine-tuned LLMs on narrowly-scoped tasks, such as those within the GLUE benchmarks. These tasks, primarily classification-oriented, saw fine-tuned LLMs achieve near 90% accuracy, outperforming GPT-4. GPT-4 continues to outperform fine-tuned models in 6 out of 31 tasks, particularly in broader, more complex domains such as Python coding and MMLU.

5.5 Quantifying the relationship between fine-tuning quality lift and task complexity

If fine-tuned models perform better on specialized "narrow" tasks and worse on "broader" tasks, can we establish a predictive relationship between the complexity of a task and the efficacy of LoRA fine-tuning? Identifying such a relationship could provide a valuable predictive tool for assessing the potential benefits of fine-tuning enhancements on new tasks before the fine-tuning process begins.

5.5.1 Heuristics for fine-tuning quality, quality lift, and task complexity

To quantify task complexity, we use several heuristics:

•

Number of training examples
•

Lengths of inputs and outputs ( $\mu$ , $\sigma$ , and 95th percentile).
•

Compressibility¹⁵¹⁵15https://docs.python.org/3/library/gzip.html ( $\mu$ and $\sigma$ )
•

Diversity of content, which we approximate by measuring the rouge-L similarity between inputs and outputs) [41] ( $\mu$ and $\sigma$ ).

For task complexity heuristic,

For model quality measurements, we track:

•

Baseline GPT-4 score
•

Lift from the best fine-tuned model vs. GPT-4 ("Max GPT-4 Lift")
•

Average fine-tuning lift over the base model
•

Best base model score without fine-tuning
•

Average base model score without fine-tuning
•

Best fine-tuned model score
•

Average fine-tuned model score

Refer to Table 6 for a complete example.

	Metric	arc_combined	bc5cdr	boolq
Model quality measurements	Max GPT-4 Lift	-0.03	0.08	0.00
	Average Base Model Lift	0.32	0.75	0.19
	Best Base Model Score	0.67	0.70	0.76
	Average Base Model Score	0.41	0.22	0.64
	Best Fine-tuned Score	0.92	0.97	0.91
	Average Fine-Tuned Score	0.73	0.97	0.82
Task complexity heuristics	Input length p95	143.00	175.00	270.70
	Input length $\mu$	102.89	142.15	145.23
	Input length $\sigma$	21.68	19.17	69.03
	Output length p95	1.00	58.00	1.00
	Output length $\mu$	1.00	37.11	1.00
	Output length $\sigma$	0.00	11.27	0.00
	Example length $\mu$	102.92	178.26	146.23
	Example length p95	143.00	226.05	271.70
	Example length $\sigma$	21.66	27.84	69.03
	I/O rougeL similarity $\mu$	0.03	0.19	0.00
	I/O rougeL similarity $\sigma$	0.01	0.03	0.00
	Compressibility $\mu$	0.64	0.55	0.60
	Compressibility $\sigma$	0.06	0.01	0.07
	# training examples	3370	5228	9427

Table 6: Model quality measurements and task complexity heuristics for 3 different tasks (example). Refer to the Appendix C. for all measurements and heuristics for all 31 tasks.

5.5.2 Correlating fine-tuning quality and quality lift with task complexity

We find several intriguing correlations suggesting significant interactions between our task complexity heuristics and measurements of model performance. Key observations include:

•

Compressibility exhibited a dual influence, correlating positively with both best and average base model scores (0.36), while correlating negatively with these scores when the variance in compressibility increased (-0.37). This indicates that while uniform compressibility supports model performance, higher variability in compressibility tends to degrade it.
•

Input and Output Lengths: Longer and more varied output lengths correlated positively with the maximum lift from GPT-4 fine-tuning, suggesting that tasks with extended and more varied outputs are not detrimental for fine-tuning. Conversely, longer and more varied input and output lengths negatively correlate with absolute base and fine-tuned model scores.
•

Input and Output Rouge-L Similarity: A higher standard deviation in input/output Rouge-L similarity correlates negatively with both base and fine-tuned model scores. This suggests that greater variability in content similarity within a dataset may pose difficulties for model learning.
•

Number of training examples: No significant correlation was found with the number of training examples, pointing to the possibility that once a sufficient sample size is achieved, additional examples do not necessarily contribute to improved fine-tuning efficacy.
•

Model quality inter-correlations reveal that better average scores (both base and fine-tuned) strongly predict the best scores obtained, suggesting a general consistency in model performance across different training instances.

Overall, these observations are consistent with our hypothesis that narrower easier tasks are more likely to see success with fine-tuned adapters.

5.5.3 Predicting fine-tuning quality and quality lift given task complexity heuristics

We train linear regression models to predict the quality lift achievable through adapter-based fine-tuning, using z-score normalized dataset complexity heuristics (described in Table 6) as predictors. Results are summarized in Table 7, where we find that linear models yield root mean squared errors (RMSE) of 0.166 to 0.092, depending on the model quality metric in question.

Incorporating the score of the average base model without fine tuning as an additional feature improves prediction accuracy for all model quality metrics (+0.004 to +0.069). This demonstrates some predictive power in knowing base model performance for anticipating potential gains from fine-tuning. RMSE errors are rather low, suggesting that upfront heuristics-based measurements of dataset complexity can be reasonable indicators of positive fine-tuning impact.

Model Quality Metric

With average base model score

as a feature

(RMSE)

With average base model score

as a feature

(RMSE)

GPT-4 Score

0.140

0.121

Max GPT-4 Lift

0.092

0.085

Average Base Model Score

0.099

N/A (0.000)

Best Base Model Score

0.166

0.097

Average Base Model Lift

0.099

0.095

Average Fine-Tuned Score

0.119

0.095

Best Fine-tuned Score

0.097

0.091

Table 7: The performance of linear regression models predicting model quality heuristics before and after fine-tuning, given z-score normalized dataset complexity heuristics, with and without a representative base model score.

6 Performance Benchmarks of LoRAX Deployments

To assess the viability of serving many LoRA fine-tuned LLMs simultaneously in a real-world application, we launch LoRA Land. LoRA Land is a web application that serves 25 fine-tuned Mistral-7b LLMs served to thousands of users from a single A100 GPU.

6.1 LoRAX in a Nutshell

LoRA Exchange (LoRAX) [1] is an open source Multi-LoRA inference server specifically designed for serving many fine-tuned models at once using a shared set of GPU resources. Compared with conventional dedicated LLM deployments, LoRAX consists of three novel components:

•

Dynamic Adapter Loading, allowing each set of fine-tuned LoRA weights to be loaded from storage just-in-time as requests come in at runtime, without blocking concurrent requests.
•

Continuous Multi-Adapter Batching, a fair scheduling policy for optimizing aggregate throughput of the system that extends the popular continuous batching strategy to work across multiple sets of LoRA adapters in parallel.
•

Tiered Weight Caching, to support fast exchanging of LoRA adapters between requests, and offloading of adapter weights to CPU and disk to avoid out-of-memory errors.

6.2 Benchmarking Results

We run benchmarks in order to understand the impact of serving multiple adapters on the relevant metrics, described below. We also test the scalability of the system with respect to the following factors:

•

Number of concurrent users submitting LLM prompts
•

Number of adapters concurrently being queried
•

Number of input tokens
•

Number of output tokens

LLM serving performance metrics include: time to first token (TFTT), total request time, token streaming time, and throughput (tokens per second). We run our benchmarks from a t3.2xlarge EC2 instance in the AWS zone us-west-2. All benchmarks are based on the Mistral-7b-instruct LLM, deployed on an A100 GPU with 80GB of RAM. The script used to benchmark LLM serving performance can be found in Appendix B.

The following is a summary of relevant terminology:

•

Total request time (ms): total time from when the request is sent to when the last token is streamed back to the client.
•

Time to first token, TTFT (ms): time from when the request is sent to the first token is received by the client
•

Token streaming time (ms): time from when the first token is received by the client to when the last token is received by the client.
•

Throughput (token/s): number of tokens generated per seconds, computed by (Token streaming time (ms) / number of output tokens)
•

Concurrent users: number of users that make requests to the LLM, wait until they receive a full response, then make another request until the end of testing time.

6.3 Latency from adapter switching and concurrent users

The following reported benchmarks come from 2-minute runs that continuously stream requests to the LLM deployment. Our experiments indicate that a duration of two minutes provides an adequate volume of data to obtain stable and reliable metrics.

Table 8 shows the impact LLM query performance isolated to adapter switching mechanics. In the multi-adapter, multi-user case, we see that the token streaming time is the same, but the total request time differs by 7.21ms which illustrates the cost of handling requests from 100 concurrent users that lead to switching between 25 adapters.

	0 adapters (base model), 1 concurrent user		25 adapters (base model), 100 concurrent user
	Average	p90	Average	p90
Total request time (ms)	191.81	192.3	199.02	201.82
Time to first token, TTFT (ms)	122.19	191.16	128.79	199.11
Token streaming time (ms)	70	92.38	70.14	96.62

Table 8: Measuring LLM querying metrics from adapter switching mechanics only. To eliminate extra, non-adapter-switching factors related to input and generation, simulated requests contain 1 input token and max_new_tokens is capped at 1. Throughput metrics are excluded since only 1 output token is generated.

# concurrent users		1	5	10	20	50
Total request time (ms)	average	943.03	1165.71	1359.39	2004.9	2981.66
Total request time (ms)	p90	1567.66	1925.96	2147.84	3287.21	4673.52
Time to first token, TTFT (ms)	average	121.84	121.80	143.68	135.43	136.17
Time to first token, TTFT (ms)	p90	191.08	195.85	199.98	199.76	199.54
Token streaming time (ms)	average	821.09	1043.79	1215.6	1869.36	2845.38
Token streaming time (ms)	p90	1468.76	1804.16	2007.89	3130.72	4544.64

Table 9: Benchmarking base LLM deployments on 1xA100 with queries that simulate real load.

To simulate realistic traffic payloads, we generate random payloads with 30-500 input tokens and 1-120 output tokens, modeled off of the tasks defined in Table 1. We vary the number of concurrent users from 1 to 50, and payloads are issued randomly between 25 different adapter endpoints.

When scaling from 1 to 50 concurrent users, which also increases load by 50X, the average time to first token (TTFT) is slightly affected (+21.84ms or 17.9% increase). We see a 3.46X decrease in throughput for the same 50X increase in load.

# concurrent users		1	5	10	20	50
Total request time (ms)	average	956.56	1272.16	1528.99	1896.1	3336.27
Total request time (ms)	p90	1758.53	2164.08	2612.05	3222.73	5330.84
Time to first token, TTFT (ms)	average	170.62	148.14	157.49	167.28	153.89
Time to first token, TTFT (ms)	p90	199.36	198.98	199.41	200.99	200.2
Token streaming time (ms)	average	785.82	1123.91	1371.39	1728.71	3182.27
Token streaming time (ms)	p90	1594.65	2023.33	2468.87	3047.92	5169.05

Table 10: Benchmarking 25 adapters on 1xA100 with queries that simulate real load.

Table 10 shows that there’s no significant difference between querying the base LLM vs. the 25 adapters when it comes to TTFT or throughput. The cost of adapter switching is overshadowed by the time it takes to generate tokens once requests come in. Comparing average case numbers vs. p90 numbers for TTFT, the largest disparity is between 121.8ms (average) and 195.95ms (p90) for a 60.87% increase. Additionally, we consistently see that TTFT is at or under the 200ms mark.

On throughput, we observe that it takes between 12 and 13.5ms to generate a single token on an A100 GPU both for base deployments and deployments where adapter weights have been added. This means that the aggregate throughput for the LLM deployment on that GPU is between 74 tokens/s and 83 tokens/s.

6.4 Analyzing the performance impact of additional deployment replicas

In Table 11, we run benchmarks for 25 adapters queried concurrently by 50 users, with a LoRAX deployment on 1 replica. We then run benchmarks where we scale the LoRAX deployment to 2 replicas placed behind a round robin load balancer to route equal amounts of traffic to each replica, while also scaling the load to 100 concurrent users. We see that the numbers are stable across the board, signifying that replicas can be scaled linearly with load to achieve comparable metrics.

		50 Concurrent users, 1 replica	100 Concurrent users, 2 replicas
Total request time (ms)	average	3336.27	3368.53
Total request time (ms)	p90	5330.84	5382.61
Time to first token, TTFT (ms)	average	153.89	161.97
Time to first token, TTFT (ms)	p90	200.2	199.83
Token streaming time (ms)	average	3182.27	3206.46
Token streaming time (ms)	p90	5169.05	5248.97

Table 11: Benchmarking 25 adapters on 1 LoRAX replica vs. 2 replicas with queries that simulate real load.

7 Limitations

Our experimental design has many limitations, including:

•

Restricted Evaluation Scope: Our evaluations are limited to the first 1000 examples of datasets with larger evaluation splits to manage costs while maintaining rigor. This may introduce selection bias and limit the generalizability of our findings. Future research should consider more comprehensive evaluations as resources allow.
•

Prompt Engineering Constraints: Our study does not employ advanced prompt engineering techniques such as majority voting, n-shot prompting, or specialized tuning methods like MedPrompt or chain-of-thought prompting. In this study, we prioritize reproducibility and minimize biases from selective example choice by using simple zero or single-shot prompts across all tasks, however these techniques have shown potential in enhancing task-specific performance.
•

Training Constraints: All LLMs are fine-tuned with the same Models are trained with consistent parameters: 40K examples, batch size of 1, 4-bit quantization, and a LoRA rank of 8, using an adam optimizer and a cosine learning rate scheduler with specific settings. Training is conducted on a single A10 GPU, using gradient checkpointing to manage memory limitations. For datasets where full sequence lengths induce memory overflow, we truncate sequences to the 95th percentile length. This approach may impact the thoroughness of model training, particularly on datasets where 40K steps do not complete even one full epoch. Expanding hardware capabilities, increasing batch sizes, or adjusting hyperparameters like the learning rate or scheduler could potentially enhance outcomes.
•

Limited Model Variety: Our experiments are limited to LoRA fine-tuning on two model sizes, 2B and 7B. Exploring a broader range of model sizes, including larger models such as 13B or 70B, could provide insights into the scalability and effectiveness of fine-tuning across different computational capacities.

We maintain that LoRA Land successfully demonstrates the practical efficiency of training and serving several task-specialized LLMs that rival GPT-4 in a production application powered by LoRAX, despite these limitations.

8 Conclusion

In this study, we assess the efficacy of Low Rank Adaptation (LoRA) for fine-tuning Large Language Models (LLMs) across a broad range of tasks and models and the viability of serving multiple fine-tuned LoRA LLMs in production.

On model quality, our results confirm that LoRA fine-tuning significantly enhances LLM performance, surpassing non-fine-tuned bases and GPT-4. The standout performance of models like Mistral-7B across multiple tasks highlights the importance of base model selection in fine-tuning success. We find that dataset complexity heuristics can be reasonably leveraged as potential predictors of fine-tuning success, suggesting that the nature of the task plays an important role in the effectiveness of fine-tuning.

Despite these outcomes, limitations such as the scale of evaluations, training constraints, and the simplicity of our prompt engineering approaches suggest areas for future improvement. We release all of our models and training setups for further community validation and experimentation.

On serving, we demonstrate the practical deployment of these models using the LoRAX framework through the LoRA Land web application. We provide benchmarks for time to first token (TFTT), total request time, and token streaming time, and measure LoRAX’s latency robustness to up to 100 concurrent users.

Altogether, LoRA Land emphasizes the quality and cost-effectiveness of employing multiple specialized LLMs over a single, general-purpose LLM.

9 Acknowledgements

Justin Zhao led the research and wrote the paper. Justin Zhao and Timothy Wang designed the experiments, created the evaluation harness, ran experiments, and analyzed the data. Wael Abid led LoRAX performance benchmarks and wrote section 6 of the paper. Piero Molino was an early advocate for the idea and provided feedback on the writing, experiments, and data analysis. We thank Martin Davis, Kabir Brar, and Jackie Ho for designing and developing the LoRA Land web application. We thank Travis Addair, Geoffrey Angus, Magdy Saleh, Noah Yoshida, Jeffrey Tang, and open source contributors for developing LoRAX. We thank Noah Yoshida and Gyanesh Mishra for supporting deployments. We thank Arnav Garg, Geoffrey Angus, Arnav Garg, Jeff Kinnison, Alex Shertinsky, Travis Addair, Piero Molino, and open source contributors for Ludwig. We thank Will Gorman, Michael Gonzales, and Devvret Rishi for support, discussion, and feedback.

References

Addair and Angus [2023] Travis Addair and Geoffrey Angus. LoRA Exchange (LoRAX): Serve 100s of Fine-Tuned LLMs for the Cost of 1 - Predibase — predibase.com. https://predibase.com/blog/lora-exchange-lorax-serve-100s-of-fine-tuned-llms-for-the-cost-of-one, 2023. [Accessed 15-04-2024].
Beeching et al. [2023] Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
Bommasani et al. [2021] Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren E. Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas F. Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, O. Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir P. Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Benjamin Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, J. F. Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Robert Reich, Hongyu Ren, Frieda Rong, Yusuf H. Roohani, Camilo Ruiz, Jack Ryan, Christopher R’e, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishna Parasuram Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei A. Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. On the opportunities and risks of foundation models. ArXiv, 2021. URL https://crfm.stanford.edu/assets/report.pdf.
Chen et al. [2023] Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. Punica: Multi-tenant lora serving, 2023.
Chen et al. [2021] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021.
cjadams et al. [2019] cjadams, Daniel Borkan, inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and nithum. Jigsaw unintended bias in toxicity classification, 2019. URL https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.
Cobbe et al. [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021.
Dettmers et al. [2023] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023.
Gao et al. [2023] Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
Han and Han [2024] Daniel Han and Michael Han. Unsloth Fixing Gemma bugs — unsloth.ai. https://unsloth.ai/blog/gemma-bugs, 2024. [Accessed 15-04-2024].
Hendrycks et al. [2020] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2020.
Houlsby et al. [2019] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp, 2019.
Howard and Ruder [2018] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification, 2018.
Hu et al. [2021] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021.
Jiang et al. [2023] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023.
Kocmi et al. [2021] Tom Kocmi, Christian Federmann, Roman Grundkiewicz, Marcin Junczys-Dowmunt, Hitokazu Matsushita, and Arul Menezes. To ship or not to ship: An extensive evaluation of automatic metrics for machine translation, 2021.
Kohút and Hradiš [2023] Jan Kohút and Michal Hradiš. Finetuning is a surprisingly effective domain adaptation baseline in handwriting recognition, 2023.
Kwon et al. [2023] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
Lester et al. [2021] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning, 2021.
Liang et al. [2022] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic evaluation of language models. Published in Transactions on Machine Learning Research (TMLR), 2023, 2022.
Liu et al. [2024] Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation, 2024.
Meng et al. [2024] Xiangdi Meng, Damai Dai, Weiyao Luo, Zhe Yang, Shaoxiang Wu, Xiaochen Wang, Peiyi Wang, Qingxiu Dong, Liang Chen, and Zhifang Sui. Periodiclora: Breaking the low-rank bottleneck in lora optimization, 2024.
Minaee et al. [2024] Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey, 2024.
Molino et al. [2019] Piero Molino, Yaroslav Dudin, and Sai Sumanth Miryala. Ludwig: a type-based declarative deep learning toolbox, 2019.
Nori et al. [2023] Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, and Eric Horvitz. Can generalist foundation models outcompete special-purpose tuning? case study in medicine, 2023.
OpenAI [2023] OpenAI. Gpt-4 technical report, 2023.
OpenAI [2024] OpenAI. GitHub - openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI’s models. — github.com. https://github.com/openai/tiktoken, 2024. [Accessed 15-04-2024].
Ouyang et al. [2022] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022.
Peters et al. [2019] Matthew E. Peters, Sebastian Ruder, and Noah A. Smith. To tune or not to tune? adapting pretrained representations to diverse tasks, 2019.
Pfeiffer et al. [2020] Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. Proceedings of EACL 2021, 2020.
Razdaibiedina et al. [2023] Anastasia Razdaibiedina, Yuning Mao, Rui Hou, Madian Khabsa, Mike Lewis, and Amjad Almahairi. Progressive prompts: Continual learning for language models, 2023.
Rebuffi et al. [2017] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters, 2017.
Rücklé et al. [2020] Andreas Rücklé, Gregor Geigle, Max Glockner, Tilman Beck, Jonas Pfeiffer, Nils Reimers, and Iryna Gurevych. Adapterdrop: On the efficiency of adapters in transformers, 2020.
Song et al. [2022] Yisheng Song, Ting Wang, Subrota K Mondal, and Jyoti Prakash Sahoo. A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities, 2022.
Team [2023] Gemini Team. Gemini: A family of highly capable multimodal models, 2023.
Team [2024] Gemma Team. Gemma: Open models based on gemini research and technology, 2024.
Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023.
Tunstall et al. [2023] Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct distillation of lm alignment, 2023.
Wang et al. [2018] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2018.
Wang et al. [2022a] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2022a.
Wang et al. [2022b] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions, 2022b.
Wang et al. [2021] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning, 2021.
Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2022.
Zellers et al. [2019] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019.
Zheng et al. [2024] Jiawei Zheng, Hanghai Hong, Xiaoli Wang, Jingsong Su, Yonggui Liang, and Shikai Wu. Fine-tuning large language models for domain-specific machine translation, 2024.
Zhong et al. [2017] Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103, 2017.

Appendix A Prompts for all tasks

The preprocessing code, prompts, configuration, and splits used for all experiments can be found at https://github.com/predibase/lora_bakeoff.

Appendix B LoRAX benchmarking scripts

The load testing script and instructions can be found at https://github.com/predibase/lora_bakeoff.

Appendix C Full Results Tables

Category	Task	Metric	Microsoft	Google				Meta		Mistral		Hugging Face	OpenAI
Category	Task	Metric	phi-2	gemma-2b	gemma-2b-instruct	gemma-7b	gemma-7b-instruct	llama-2-7b	llama-2-7b-chat	mistral-7b	mistral-7b-instruct	zephyr-7b-beta	gpt-3.5-turbo	gpt-4
Classic NLP	bc5cdr	rouge	0.172	0.013	0.494	0.075	0.198	0.185	0.024	0.177	0.703	0.146	0.732	0.890
	conllpp	rouge	0.101	0.011	0.647	0.085	0.120	0.108	0.115	0.148	0.733	0.088	0.810	0.742
	e2e_nlg	rouge	0.132	0.174	0.281	0.152	0.434	0.087	0.442	0.167	0.482	0.122	0.467	0.513
	tldr_content_gen	rouge	0.158	0.117	0.160	0.089	0.141	0.148	0.183	0.153	0.163	0.164	0.173	0.125
	tldr_headline_gen	rouge	0.169	0.034	0.155	0.063	0.152	0.078	0.174	0.071	0.171	0.120	0.195	0.175
	viggo	rouge	0.133	0.093	0.237	0.123	0.313	0.141	0.356	0.044	0.374	0.193	0.372	0.374
	webnlg	rouge	0.120	0.055	0.312	0.257	0.453	0.148	0.563	0.091	0.541	0.512	0.581	0.583
Coding	magicoder	humaneval	0.012	0.037	0.024	0.030	0.018	0.012	0.134	0.201	0.152	0.049	0.683	0.829
Coding	wikisql	rouge	0.143	0.030	0.301	0.036	0.244	0.043	0.093	0.265	0.134	0.080	0.887	0.909
Knowledge	boolq	accuracy	0.691	0.447	0.661	0.300	0.735	0.645	0.759	0.669	0.764	0.683	0.870	0.911
	dbpedia	dbpedia	0.268	0.018	0.086	0.021	0.089	0.043	0.868	0.036	0.313	0.578	0.853	0.965
	customer_support	accuracy	0.250	0.120	0.380	0.100	0.850	0.110	0.630	0.030	0.730	0.540	1.000	1.000
	glue_qnli	accuracy	0.496	0.439	0.444	0.463	0.685	0.510	0.736	0.533	0.743	0.569	0.829	0.902
	glue_stsb	mae	0.682	0.197	0.590	0.537	0.729	0.651	0.680	0.672	0.723	0.814	0.857	0.773
	legal	rouge	0.008	0.010	0.037	0.019	0.053	0.009	0.026	0.001	0.158	0.039	0.266	0.305
	reuters	rouge	0.003	0.001	0.010	0.001	0.009	0.003	0.010	0.004	0.010	0.005	0.026	0.014
	mmlu	accuracy	0.339	0.160	0.279	0.302	0.460	0.189	0.349	0.402	0.446	0.506	0.504	0.774
Reasoning	winogrande	accuracy	0.380	0.309	0.515	0.390	0.576	0.503	0.515	0.498	0.546	0.532	0.569	0.832
	arc_combined	accuracy	0.323	0.180	0.254	0.272	0.657	0.304	0.379	0.573	0.673	0.497	0.926	0.947
	glue_cola	accuracy	0.463	0.152	0.642	0.062	0.749	0.691	0.691	0.691	0.797	0.788	0.843	0.864
	glue_mnli	accuracy	0.328	0.053	0.347	0.213	0.272	0.315	0.293	0.327	0.455	0.348	0.588	0.803
	glue_mrpc	accuracy	0.652	0.265	0.664	0.654	0.652	0.679	0.674	0.684	0.694	0.676	0.689	0.777
	glue_qqp	accuracy	0.327	0.138	0.337	0.316	0.396	0.345	0.340	0.327	0.708	0.340	0.830	0.841
	glue_sst2	accuracy	0.487	0.407	0.719	0.187	0.682	0.306	0.695	0.115	0.933	0.706	0.933	0.942
	glue_wnli	accuracy	0.437	0.183	0.437	0.366	0.437	0.423	0.437	0.437	0.437	0.437	0.521	0.930
	covid	accuracy	0.207	0.154	0.317	0.169	0.322	0.162	0.212	0.191	0.297	0.243	0.334	0.309
	hellaswag	accuracy	0.371	0.117	0.023	0.112	0.201	0.381	0.264	0.246	0.249	0.393	0.622	0.805
	hellaswag_processed	rouge	0.037	0.056	0.146	0.109	0.143	0.044	0.089	0.038	0.134	0.040	0.140	0.134
	jigsaw	accuracy	0.491	0.490	0.482	0.233	0.520	0.486	0.545	0.475	0.704	0.472	0.735	0.754
	drop	rouge	0.018	0.013	0.034	0.024	0.042	0.010	0.047	0.011	0.066	0.023	0.119	0.393
Math	gsm8k	accuracy	0.083	0.026	0.082	0.039	0.364	0.051	0.160	0.114	0.275	0.133	0.622	0.373

Table 12: Base model performance for every task and base model, before fine-tuning.

LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

Abstract

1 Introduction

2 Related work

3 Methodology

3.1 Task selection

3.2 Prompt selection

3.3 Base models

3.4 Training parameters

3.5 Evaluation

4 Results

5 Discussion and Analysis

5.1 Which Base Model is the best for LoRA Fine-tuning?

5.2 Does size matter for LoRA fine-tuning? 2B vs. 7B

5.3 Is fine-tuning better with Instruction-tuned or Auto-complete models?

5.4 When does GPT-4 consistently outperform fine-tuned models?

5.5 Quantifying the relationship between fine-tuning quality lift and task complexity

5.5.1 Heuristics for fine-tuning quality, quality lift, and task complexity

5.5.2 Correlating fine-tuning quality and quality lift with task complexity

5.5.3 Predicting fine-tuning quality and quality lift given task complexity heuristics

6 Performance Benchmarks of LoRAX Deployments

6.1 LoRAX in a Nutshell

6.2 Benchmarking Results

6.3 Latency from adapter switching and concurrent users

6.4 Analyzing the performance impact of additional deployment replicas

7 Limitations

8 Conclusion

9 Acknowledgements

References

Appendix A Prompts for all tasks

Appendix B LoRAX benchmarking scripts

Appendix C Full Results Tables

Appendix D Other