LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

Guo, Han; Greengard, Philip; Xing, Eric P.; Kim, Yoon

Computer Science > Computation and Language

arXiv:2311.12023 (cs)

[Submitted on 20 Nov 2023 (v1), last revised 27 Aug 2024 (this version, v4)]

Title:LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

Authors:Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim

View PDF HTML (experimental)

Abstract:We propose a simple approach for memory-efficient adaptation of pretrained language models. Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. During finetuning, the quantized component remains fixed and only the low-rank component is updated. We present an integer linear programming formulation of the quantization component which enables dynamic configuration of quantization parameters (e.g., bit-width, block size) for each matrix given an overall target memory budget. We further explore a data-aware version of the algorithm which uses an approximation of the Fisher information matrix to weight the reconstruction objective during matrix decomposition. Experiments on finetuning RoBERTa and LLaMA-2 (7B and 70B) demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines and enables aggressive quantization to sub-3 bits with only minor performance degradations. When finetuned on a language modeling calibration dataset, LQ-LoRA can also be used for model compression; in this setting our 2.75-bit LLaMA-2-70B model (which has 2.85 bits on average when including the low-rank components and requires 27GB of GPU memory) performs respectably compared to the 16-bit baseline.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2311.12023 [cs.CL]
	(or arXiv:2311.12023v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.12023

Submission history

From: Han Guo [view email]
[v1] Mon, 20 Nov 2023 18:57:41 UTC (415 KB)
[v2] Wed, 17 Jan 2024 17:01:57 UTC (417 KB)
[v3] Sun, 30 Jun 2024 22:43:35 UTC (454 KB)
[v4] Tue, 27 Aug 2024 00:48:35 UTC (454 KB)

Computer Science > Computation and Language

Title:LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators