aXi: How does the accuracy degrade with ...

Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox

June 15, 2024

68% Match

Yijun Liu, Yuan Meng, Fang Wu, Shenhao Peng, Hang Yao, Chaoyu Guan, Chen Tang, Xinzhu Ma, ... , Zhu Wenwu

Machine Learning

Artificial Intelligence

Computation and Language

Large language models (LLMs) have exhibited exciting progress in multiple scenarios, while the huge computational demands hinder their deployments in lots of real-world applications. As an effective means to reduce memory footprint and inference cost, quantization also faces challenges in performance degradation at low bit-widths. Understanding the impact of quantization on LLM capabilities, especially the generalization ability, is crucial. However, the community's main focu...

Find SimilarView on arXiv

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

May 29, 2023

67% Match

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, ... , Chandra Vikas

Computation and Language

Several post-training quantization methods have been applied to large language models (LLMs), and have been shown to perform well down to 8-bits. We find that these methods break down at lower bit precision, and investigate quantization aware training for LLMs (LLM-QAT) to push quantization levels even further. We propose a data-free distillation method that leverages generations produced by the pre-trained model, which better preserves the original output distribution and al...

Find SimilarView on arXiv

Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs

May 23, 2024

67% Match

Qingyuan Li, Ran Meng, Yiduo Li, Bo Zhang, Yifan Lu, Yerui Sun, ... , Xie Yuchen

Machine Learning

Artificial Intelligence

We introduce Integer Scale, a novel post-training quantization scheme for large language models that effectively resolves the inference bottleneck in current fine-grained quantization approaches while maintaining similar accuracies. Integer Scale is a free lunch as it requires no extra calibration or fine-tuning which will otherwise incur additional costs. It can be used plug-and-play for most fine-grained quantization methods. Its integration results in at most 1.85x end-to-...

Find SimilarView on arXiv

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

November 18, 2022

67% Match

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, ... , Han Song

Computation and Language

Artificial Intelligence

Machine Learning

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same time. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy ...

Find SimilarView on arXiv

When Quantization Affects Confidence of Large Language Models?

May 1, 2024

67% Match

Irina Proskurina, Luc Brun, ... , Velcin Julien

Computation and Language

Artificial Intelligence

Recent studies introduced effective compression techniques for Large Language Models (LLMs) via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference, existing works have indicated that quantization might compromise performance and exacerbate biases in LLMs. This study investigates the confidence and calibration of quantized models, considering factors such as language model type and sca...

Find SimilarView on arXiv

Enabling Fast 2-bit LLM on GPUs: Memory Alignment and Asynchronous Dequantization

November 28, 2023

67% Match

Jinhao Li, Shiyao Li, Jiaming Xu, Shan Huang, Yaoxiu Lian, Jun Liu, ... , Dai Guohao

Machine Learning

Distributed, Parallel, and C...

Large language models (LLMs) have demonstrated impressive abilities in various domains while the inference cost is expensive. The state-of-the-art methods use 2-bit quantization for mainstream LLMs. However, challenges still exist: (1) Nonnegligible accuracy loss for 2-bit quantization. Weights are quantized by groups, while the ranges of weights are large in some groups, resulting in large quantization errors and nonnegligible accuracy loss (e.g. >3% for Llama2-7b with 2-bit...

Find SimilarView on arXiv

Efficient Post-training Quantization with FP8 Formats

September 26, 2023

67% Match

Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, ... , Wang Mengni

Machine Learning

Artificial Intelligence

Computation and Language

Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification,...

Find SimilarView on arXiv

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

February 6, 2024

66% Match

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, ... , Qi Xiaojuan

Machine Learning

Artificial Intelligence

Computation and Language

Pretrained large language models (LLMs) exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a powerful compression technology, binarization can extremely reduce model weights to a mere 1 bit, lowering the expensive computation and memory requirements. However, existing quantization techniques fall short of maintaining LLM performance under ultra-low bit-widths. In response to this challenge, ...

Find SimilarView on arXiv

FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization

February 28, 2024

66% Match

Yi Zhang, Fei Yang, Shuang Peng, ... , Pan Aimin

Machine Learning

Artificial Intelligence

Computation and Language

Large language models (LLMs) have demonstrated state-of-the-art performance across various tasks. However, the latency of inference and the large GPU memory consumption of LLMs restrict their deployment performance. Recently, there have been some efficient attempts to quantize LLMs, yet inference with large batch size or long sequence still has the issue of being compute-bound. Fine-grained quantization methods have showcased their proficiency in achieving low-bit quantizatio...

Find SimilarView on arXiv

Quantifying the Capabilities of LLMs across Scale and Precision

May 6, 2024

66% Match

Sher Badshah, Hassan Sajjad

Machine Learning

Artificial Intelligence

Computation and Language

Scale is often attributed as one of the factors that cause an increase in the performance of LLMs, resulting in models with billion and trillion parameters. One of the limitations of such large models is the high computational requirements that limit their usage, deployment, and debugging in resource-constrained scenarios. Two commonly used alternatives to bypass these limitations are to use the smaller versions of LLMs (e.g. Llama 7B instead of Llama 70B) and lower the memor...

Find SimilarView on arXiv