How To Fine-Tune LLMs in 2024 With Hugging Face
How To Fine-Tune LLMs in 2024 With Hugging Face
How To Fine-Tune LLMs in 2024 With Hugging Face
go back to the drawing board. I want to mention that not all use cases require fine-
tuning and it is always recommended to evaluate and try out already fine-tuned
models or API-based models before fine-tuning your own model.
As an example, we are going to use the following use case:
“We want to fine-tune a model, which can generate SQL queries based on a
natural language instruction, which can then be integrated into our BI tool. The
goal is to reduce the time it takes to create a SQL query and make it easier for
non-technical users to create SQL queries.”
Text to SQL can be a good use case for fine-tuning LLMs, as it is a complex task that
requires a lot of (internal) knowledge about the data and the SQL language.
If you are using a GPU with Ampere architecture (e.g. NVIDIA A10G or RTX
4090/3090) or newer you can use Flash attention. Flash Attention is a an method that
reorders the attention computation and leverages classical techniques (tiling,
recomputation) to significantly speed it up and reduce memory usage from quadratic
https://www.philschmid.de/fine-tune-llms-in-2024-with-trl 2/13
6/24/24, 6:14 PM How to Fine-Tune LLMs in 2024 with Hugging Face
to linear in sequence length. The TL;DR; accelerates training up to 3x. Learn more at
FlashAttention.
Note: If your machine has less than 96GB of RAM and lots of CPU cores, reduce the
number of `MAX_JOBS`. On the `g5.2xlarge` we used `4`.
import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flas
# install flash-attn
!pip install ninja packaging
!MAX_JOBS=4 pip install flash-attn --no-build-isolation
Installing flash attention can take quite a bit of time (10-45 minutes).
We will use the Hugging Face Hub as a remote model versioning service. This means
we will automatically push our model, logs and information to the Hub during training.
You must register on the Hugging Face for this. After you have an account, we will
use the `login` util from the `huggingface_hub` package to log into our account and
store our token (access key) on the disk.
from huggingface_hub import login
login(
token="", # ADD YOUR TOKEN HERE
add_to_git_credential=True
)
Each of the methods has its own advantages and disadvantages and depends on the
budget, time, and quality requirements. For example, using an existing dataset is the
easiest but might not be tailored to your specific use case, while using humans might
be the most accurate but can be time-consuming and expensive. It is also possible to
combine several methods to create an instruction dataset, as shown in Orca:
Progressive Learning from Complex Explanation Traces of GPT-4.
In our example we will use an already existing dataset called sql-create-context,
which contains samples of natural language instructions, schema definitions and the
corresponding SQL query.
With the latest release of `trl` we now support popular instruction and conversation
dataset formats. This means we only need to convert our dataset to one of the
supported formats and `trl` will take care of the rest. Those formats include:
conversational format
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"
instruction format
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
In our example we are going to load our open-source dataset using the 🤗 Datasets
library and then convert it into the the conversational format, where we include the
schema definition in the system message for our assistant. We'll then save the
dataset as jsonl file, which we can then use to fine-tune our model. We are randomly
downsampling the dataset to only 10,000 samples.
Note: This step can be different for your use case. For example, if you have already a
dataset from, e.g. working with OpenAI, you can skip this step and go directly to the
fine-tuning step.
from datasets import load_dataset
https://www.philschmid.de/fine-tune-llms-in-2024-with-trl 4/13
6/24/24, 6:14 PM How to Fine-Tune LLMs in 2024 with Hugging Face
def create_conversation(sample):
return {
"messages": [
{"role": "system", "content": system_message.format(schema=sample["context"])},
{"role": "user", "content": sample["question"]},
{"role": "assistant", "content": sample["answer"]}
]
}
print(dataset["train"][345]["messages"])
Next, we will load our LLM. For our use case we are going to use CodeLlama 7B.
CodeLlama is a Llama model trained for general code synthesis and understanding.
But we can easily swap out the model for another model, e.g. Mistral or Mixtral
models, TII Falcon, or any other LLMs by changing our `model_id` variable. We will use
bitsandbytes to quantize our model to 4-bit.
Note: Be aware the bigger the model the more memory it will require. In our example
we will use the 7B version, which can be tuned on 24GB GPUs. If you have a smaller
GPU.
Correctly, preparing the model and tokenizer for training chat/conversational models
is crucial. We need to add new special tokens to the tokenizer and model to teach
them the different roles in a conversation. In `trl` we have a convenient method with
setup_chat_format, which:
Adds special tokens to the tokenizer, e.g. `<|im_start|>` and `<|im_end|>`, to
indicate the start and end of a conversation.
Resizes the model’s embedding layer to accommodate the new tokens.
Sets the `chat_template` of the tokenizer, which is used to format the input data
into a chat-like format. The default is `chatml` from OpenAI.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import setup_chat_format
https://www.philschmid.de/fine-tune-llms-in-2024-with-trl 6/13
6/24/24, 6:14 PM How to Fine-Tune LLMs in 2024 with Hugging Face
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_comput
)
# # set chat template to OAI chatML, remove if you start from a fine-tuned model
model, tokenizer = setup_chat_format(model, tokenizer)
The `SFTTrainer` supports a native integration with `peft`, which makes it super easy
to efficiently tune LLMs using, e.g. QLoRA. We only need to create
our `LoraConfig` and provide it to the trainer. Our `LoraConfig` parameters are defined
based on the qlora paper and sebastian's blog post.
from peft import LoraConfig
args = TrainingArguments(
output_dir="code-llama-7b-text-to-sql", # directory to save and repository id
num_train_epochs=3, # number of training epochs
per_device_train_batch_size=3, # batch size per device during training
gradient_accumulation_steps=2, # number of steps before performing a backward/update
gradient_checkpointing=True, # use gradient checkpointing to save memory
optim="adamw_torch_fused", # use fused adamw optimizer
logging_steps=10, # log every 10 steps
save_strategy="epoch", # save checkpoint every epoch
learning_rate=2e-4, # learning rate, based on QLoRA paper
bf16=True, # use bfloat16 precision
tf32=True, # use tf32 precision
max_grad_norm=0.3, # max gradient norm based on QLoRA paper
https://www.philschmid.de/fine-tune-llms-in-2024-with-trl 7/13
6/24/24, 6:14 PM How to Fine-Tune LLMs in 2024 with Hugging Face
warmup_ratio=0.03, # warmup ratio based on QLoRA paper
lr_scheduler_type="constant", # use constant learning rate scheduler
push_to_hub=True, # push model to hub
report_to="tensorboard", # report metrics to tensorboard
)
We now have every building block to create our `SFTTrainer` and start training our
model.
from trl import SFTTrainer
max_seq_length = 3072 # max sequence length for model and packing of the dataset
trainer = SFTTrainer(
model=model,
args=args,
train_dataset=dataset,
peft_config=peft_config,
max_seq_length=max_seq_length,
tokenizer=tokenizer,
packing=True,
dataset_kwargs={
"add_special_tokens": False, # We template with special tokens
"append_concat_token": False, # No need to add additional separator token
}
)
We can start training our model by calling the `train()` method on our `Trainer`
instance. This will start the training loop and train our model for 3 epochs. Since we
are using a PEFT method, we will only save the adapted model weights and not the
full model.
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()
# save model
trainer.save_model()
The training with Flash Attention for 3 epochs with a dataset of 10k samples took
01:29:58 on a `g5.2xlarge`. The instance costs `1,212$/h` which brings us to a total
cost of only `1.8$`.
# free the memory again
del model
del trainer
torch.cuda.empty_cache()
https://www.philschmid.de/fine-tune-llms-in-2024-with-trl 8/13
6/24/24, 6:14 PM How to Fine-Tune LLMs in 2024 with Hugging Face
peft_model_id = "./code-llama-7b-text-to-sql"
# peft_model_id = args.output_dir
https://www.philschmid.de/fine-tune-llms-in-2024-with-trl 9/13
6/24/24, 6:14 PM How to Fine-Tune LLMs in 2024 with Hugging Face
model = AutoPeftModelForCausalLM.from_pretrained(
peft_model_id,
device_map="auto",
torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
# load into pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
# Test on sample
prompt = pipe.tokenizer.apply_chat_template(eval_dataset[rand_idx]["messages"][:2], tokenize=Fals
outputs = pipe(prompt, max_new_tokens=256, do_sample=False, temperature=0.1, top_k=50, top_p=0.1,
print(f"Query:\n{eval_dataset[rand_idx]['messages'][1]['content']}")
print(f"Original Answer:\n{eval_dataset[rand_idx]['messages'][2]['content']}")
print(f"Generated Answer:\n{outputs[0]['generated_text'][len(prompt):].strip()}")
Nice! Our model was able to generate a SQL query based on the natural language
instruction. Lets evaluate our model on the full 2,500 samples of our test dataset.
Note: As mentioned above, evaluating generative models is not a trivial task. In our
example we used the accuracy of the generated SQL based on the ground truth SQL
query as our metric. An alternative way could be to automatically execute the
generated SQL query and compare the results with the ground truth. This would be a
more accurate metric but requires more work to setup.
from tqdm import tqdm
def evaluate(sample):
prompt = pipe.tokenizer.apply_chat_template(sample["messages"][:2], tokenize=False, add_gener
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0
predicted_answer = outputs[0]['generated_text'][len(prompt):].strip()
if predicted_answer == sample["messages"][2]["content"]:
return 1
else:
return 0
success_rate = []
number_of_eval_samples = 1000
# iterate over eval dataset and predict
for s in tqdm(eval_dataset.shuffle().select(range(number_of_eval_samples))):
https://www.philschmid.de/fine-tune-llms-in-2024-with-trl 10/13
6/24/24, 6:14 PM How to Fine-Tune LLMs in 2024 with Hugging Face
success_rate.append(evaluate(s))
# compute accuracy
accuracy = sum(success_rate)/len(success_rate)
print(f"Accuracy: {accuracy*100:.2f}%")
We evaluated our model on 1000 samples from the evaluation dataset and got an
accuracy of `79.50%`, which took ~25 minutes.
This is quite good, but as mentioned you need to take this metric with a grain of salt.
It would be better if we could evaluate our model by running the qureies against a real
database and compare the results. Since there might be different "correct" SQL
queries for the same instruction. There are also several ways on how we could
improve the performance by using few-shot learning, using RAG, Self-healing to
generate the SQL query.
%%bash
# model=$PWD/{args.output_dir} # path to model
model=$(pwd)/code-llama-7b-text-to-sql # path to model
num_shard=1 # number of shards
max_input_length=1024 # max input length
max_total_tokens=2048 # max total tokens
output = resp.json()["generated_text"].strip()
time_per_token = resp.headers.get("x-time-per-token")
time_prompt_tokens = resp.headers.get("x-prompt-tokens")
# Print results
print(f"Query:\n{eval_dataset[rand_idx]['messages'][1]['content']}")
print(f"Original Answer:\n{eval_dataset[rand_idx]['messages'][2]['content']}")
print(f"Generated Answer:\n{output}")
print(f"Latency per token: {time_per_token}ms")
print(f"Latency prompt encoding: {time_prompt_tokens}ms")
Awesome, Don't forget to stop your container once you are done.
!docker stop tgi
Conclusion
https://www.philschmid.de/fine-tune-llms-in-2024-with-trl 12/13
6/24/24, 6:14 PM How to Fine-Tune LLMs in 2024 with Hugging Face
Large Language Models and the availability of tools TRL make it an ideal time for
companies to invest in open LLM technology. Fine-tuning open LLMs for specific
tasks can significantly enhance efficiency and open new opportunities for innovation
and improved services. With the increasing accessibility and cost-effectiveness, there
has never been a better time to start using open LLMs.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or
LinkedIn.
PHILIPP SCHMID © 2024 IMPRINT
https://www.philschmid.de/fine-tune-llms-in-2024-with-trl 13/13