QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Liu, Jing; Gong, Ruihao; Wei, Xiuying; Dong, Zhiwei; Cai, Jianfei; Zhuang, Bohan

Computer Science > Computation and Language

arXiv:2310.08041 (cs)

[Submitted on 12 Oct 2023 (v1), last revised 6 Apr 2024 (this version, v3)]

Title:QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Authors:Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

Comments:	ICLR 2024 camera ready; Code is available at this https URL and this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2310.08041 [cs.CL]
	(or arXiv:2310.08041v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.08041

Submission history

From: Bohan Zhuang [view email]
[v1] Thu, 12 Oct 2023 05:25:49 UTC (1,424 KB)
[v2] Wed, 21 Feb 2024 06:40:49 UTC (1,656 KB)
[v3] Sat, 6 Apr 2024 10:22:57 UTC (1,659 KB)

Computer Science > Computation and Language

Title:QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators