LLM Fine Tuning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

LoRA SOFT PROMPTS

Parameter Efficient Fine-Tuning


Method to reduce the number of trainable parameters during fine-tuning Unlike prompt engineering, whose limits are:
(PEFT) Methods by freezing all original model parameters and injecting a pair of rank • The manual effort requirements
decomposition matrices alongside the original weights
• The length of the context window

PEFT Prompt tuning: Add trainable tensors to the model input embeddings,
Full fine-tuning of LLMs is challenging: commonly known as “soft prompts,” optimized directly through
1 - Keep the majority of the original
gradient descent.
h = W0.x + AB.x LLM weights frozen.

Outputs h 2 - Introduce a pair of rank


Gradients decomposition matrices.
+ 3 - Train the new matrices A and B.
Optimizer states Activations
B
Pre-trained Pre-trained LLM
rank r Model weights update:
weights W0
Trainable weights Temporary variables A
1 - Matrix multiplication:
Requires a lot
of memory B * A = BxA Tunable soft prompt Input text
Inputs x (Typically, 20-100 tokens)
2 - Add to original weights :
PEFT methods only update a small number of model parameters.
LoRA
Examples of PEFT techniques: + BxA
• Freeze most model weights, and fine tune only specific layer parameters.
Soft prompt vectors:
• Keep existing parameters untouched; add only a few new ones or layers
for fine-tuning. • Equal in length to the embedding vectors of the input language tokens
The trained parameters can account for only 15%-20% of the Additional notes: • Can be seen as virtual tokens which can take any value within the
original LLM weights. multidimensional embedding space
• No impact on inference latency.
Main benefits: In prompt tuning, LLM weights are frozen:
• Fine-tuning specifically on the self-attention layers using LoRA is often
enough to enhance performance for a given task. • Over time, the embedding vector of the soft prompt is adjusted to optimize
• Decrease memory usage, often requiring just 1 GPU. model’s completion of the prompt
• Weights can be switched out as needed, allowing for training on many
• Mitigate risk of catastrophic forgetting. different tasks. • Only few parameters are updated
• Limit storage to only the new PEFT weights. • A different set of soft prompts can be trained for each task and easily swapped
out during inference (occupying very little space on disk).
Multiple methods exist with trade-offs on parameters or memory efficiency, Rank Choice for LoRA Matrices:
training speed, model quality, and inference costs. From literature, it is shown that at 10B parameters, prompt tuning is as efficient
Three PEFT methods classes from literature: Trade-Off: A smaller rank reduces parameters and accelerates training as full fine-tuning.
but risks lower adaptation quality due to reduced task-specific
Selective Reparameterization Additive information capture. ! Interpreting virtual tokens can pose challenges
Augment the pre-trained In literature, it appears that a rank between 4-32 is a good trade-off. (nearest neighbor tokens to the soft prompt location can be used).
Fine-tune only Use low-rank representations
specific parts of to reduce the number of model with new parameters
the original LLM. trainable parameters. or layers, training only
the additions. LoRA can be combined with quantization (=QLoRA).
E.g., LoRA
Adapter
Soft prompts

You might also like