Mar 29, 2024 · In this paper, we focus on the most critical problem of limited KV-cache storage. We propose a novel approach enabling usage of low precision BFP formats.
scholar.google.com › citations
Mar 29, 2024 · If some blocks contain outliers, their overall quantization accuracy will be poor because the smallest elements might be rounded to zero. Next ...
This paper proposes a novel approach enabling usage of low precision BFP formats without compromising the resulting model accuracy by exploiting the common ...
Mar 31, 2024 · The paper focuses on the problem of effectively quantizing large language models (LLMs) for efficient inference, while preserving the accuracy ...
Apr 16, 2024 · This study explores deeper into the intricacies of inferring on heavily quantised models. The potential outcome is the possibility of utilizing ...
Post-training quantization (PTQ) of trans- former language models faces significant challenges due to the existence of detrimental outliers in activations.
Jul 10, 2024 · In this article, we will deeply explore quantization and some state-of-the-art quantization methods. We will also see how to use them.
Feb 21, 2024 · Each block is then quantized individually to mitigate the effect of outliers and increase precision.
Jan 10, 2024 · Smaller blocks yield higher accuracy, however the trade-off is that this increases the number of parameters that need to be stored as now there ...
Various works have been proposed to suppress these outliers to improve the quantized LLMs. Two most commonly used methods are per-channel scaling (Xiao et ...