aXi: How does the accuracy degrade with ...

Accurate LoRA-Finetuning Quantization of LLMs via Information Retention

February 8, 2024

72% Match

Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, ... , Magno Michele

Machine Learning

Computation and Language

The LoRA-finetuning quantization of LLMs has been extensively studied to obtain accurate yet compact LLMs for deployment on resource-constrained hardware. However, existing methods cause the quantized LLM to severely degrade and even fail to benefit from the finetuning of LoRA. This paper proposes a novel IR-QLoRA for pushing quantized LLMs with LoRA to be highly accurate through information retention. The proposed IR-QLoRA mainly relies on two technologies derived from the p...

Find SimilarView on arXiv

What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation

March 11, 2024

70% Match

Zhuocheng Gong, Jiahao Liu, Jingang Wang, Xunliang Cai, ... , Yan Rui

Machine Learning

Artificial Intelligence

Quantization has emerged as a promising technique for improving the memory and computational efficiency of large language models (LLMs). Though the trade-off between performance and efficiency is well-known, there is still much to be learned about the relationship between quantization and LLM performance. To shed light on this relationship, we propose a new perspective on quantization, viewing it as perturbations added to the weights and activations of LLMs. We call this appr...

Find SimilarView on arXiv

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

October 12, 2023

69% Match

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, ... , Zhuang Bohan

Computation and Language

Artificial Intelligence

Machine Learning

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviat...

Find SimilarView on arXiv

EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs

March 5, 2024

69% Match

Hanlin Tang, Yifu Sun, Decheng Wu, Kai Liu, ... , Kang Zhanhui

Artificial Intelligence

Machine Learning

Large language models (LLMs) have proven to be very superior to conventional methods in various tasks. However, their expensive computations and high memory requirements are prohibitive for deployment. Model quantization is an effective method for reducing this overhead. The problem is that in most previous works, the quantized model was calibrated using few samples from the training data, which might affect the generalization of the quantized LLMs to unknown cases and tasks....

Find SimilarView on arXiv

How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study

April 22, 2024

69% Match

Wei Huang, Xudong Ma, Haotong Qin, Xingyu Zheng, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, ... , Magno Michele

Machine Learning

Meta's LLaMA family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLaMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insi...

Find SimilarView on arXiv

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

October 8, 2023

69% Match

Cheng Zhang, Jianyi Cheng, Ilia Shumailov, ... , Zhao Yiren

Machine Learning

The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has merged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work, we explore the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets. To address this, we adapt block quantisations for LLMs, a family of methods that share scaling...

Find SimilarView on arXiv

Accurate Block Quantization in LLMs with Outliers

March 29, 2024

68% Match

Nikita Trukhanov, Ilya Soloveychik

Artificial Intelligence

Hardware Architecture

Numerical Analysis

The demand for inference on extremely large scale LLMs has seen enormous growth in the recent months. It made evident the colossal shortage of dedicated hardware capable of efficient and fast processing of the involved compute and memory movement. The problem is aggravated by the exploding raise in the lengths of the sequences being processed, since those require efficient on-chip storage of the KV-cache of size proportional to the sequence length. To make the required comput...

Find SimilarView on arXiv

LCQ: Low-Rank Codebook based Quantization for Large Language Models

May 31, 2024

68% Match

Wen-Pu Cai, Wu-Jun Li

Machine Learning

Computation and Language

Large language models~(LLMs) have recently demonstrated promising performance in many tasks. However, the high storage and computational cost of LLMs has become a challenge for deploying LLMs. Weight quantization has been widely used for model compression, which can reduce both storage and computational cost. Most existing weight quantization methods for LLMs use a rank-one codebook for quantization, which results in substantial accuracy loss when the compression ratio is hig...

Find SimilarView on arXiv

LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models

May 9, 2024

68% Match

Ruihao Gong, Yang Yong, Shiqiao Gu, Yushi Huang, Yunchen Zhang, ... , Tao Dacheng

Machine Learning

Artificial Intelligence

Computation and Language

Recent advancements in large language models (LLMs) are propelling us toward artificial general intelligence, thanks to their remarkable emergent abilities and reasoning capabilities. However, the substantial computational and memory requirements of LLMs limit their widespread adoption. Quan- tization, a key compression technique, offers a viable solution to mitigate these demands by compressing and accelerating LLMs, albeit with poten- tial risks to model accuracy. Numerous ...

Find SimilarView on arXiv

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

February 26, 2024

68% Match

Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, ... , Xiong Deyi

Computation and Language

Artificial Intelligence

Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular due to the rise of LLMs. However, most quantization studies use pre-trained LLMs, and the impact of quantization on instruction-tu...

Find SimilarView on arXiv