R$^3$Mem: Bridging Memory Retention and Retrieval via Reversible Compression

February 21, 2025

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

March 14, 2024

91% Match

Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, ... , Ponti Edoardo M.

Computation and Language

Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key-value cache compression at inference time. Most importantly, the model learns to apply different compression rates in...

Find SimilarView on arXiv

Recurrent Context Compression: Efficiently Expanding the Context Window of LLM

June 10, 2024

91% Match

Chensen Huang, Guibo Zhu, Xuepeng Wang, Yifei Luo, Guojing Ge, Haoran Chen, ... , Wang Jinqiao

Computation and Language

Artificial Intelligence

To extend the context length of Transformer-based large language models (LLMs) and improve comprehension capabilities, we often face limitations due to computational resources and bounded memory storage capacity. This work introduces a method called Recurrent Context Compression (RCC), designed to efficiently expand the context window length of LLMs within constrained storage space. We also investigate the issue of poor model responses when both instructions and context are c...

Find SimilarView on arXiv

LoMA: Lossless Compressed Memory Attention

January 16, 2024

91% Match

Yumeng Wang, Zhenyang Xiao

Machine Learning

Computation and Language

The ability to handle long texts is one of the most important capabilities of Large Language Models (LLMs), but as the text length increases, the consumption of resources also increases dramatically. At present, reducing resource consumption by compressing the KV cache is a common approach. Although there are many existing compression methods, they share a common drawback: the compression is not lossless. That is, information is inevitably lost during the compression process....

Find SimilarView on arXiv

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

January 15, 2024

91% Match

Ninglu Shao, Shitao Xiao, ... , Zhang Peitian

Computation and Language

Large language models (LLMs) are in need of sufficient contexts to handle many critical applications, such as retrieval augmented generation and few-shot learning. However, due to the constrained window size, the LLMs can only access to the information within a limited context. Although the size of context window can be extended by fine-tuning, it will result in a substantial cost in both training and inference stage. In this paper, we present Extensible Tokenization as an al...

Find SimilarView on arXiv

AI-native Memory: A Pathway from LLMs Towards AGI

June 26, 2024

91% Match

Jingbo Shang, Zai Zheng, Xiang Ying, ... , Team Mindverse

Computation and Language

Artificial Intelligence

Large language models (LLMs) have demonstrated the world with the sparks of artificial general intelligence (AGI). One opinion, especially from some startups working on LLMs, argues that an LLM with nearly unlimited context length can realize AGI. However, they might be too optimistic about the long-context capability of (existing) LLMs -- (1) Recent literature has shown that their effective context length is significantly smaller than their claimed context length; and (2) Ou...

Find SimilarView on arXiv

Exploring the landscape of large language models: Foundations, techniques, and challenges

April 18, 2024

91% Match

Milad Moradi, Ke Yan, David Colwell, ... , Asgari Rhona

Artificial Intelligence

In this review paper, we delve into the realm of Large Language Models (LLMs), covering their foundational principles, diverse applications, and nuanced training processes. The article sheds light on the mechanics of in-context learning and a spectrum of fine-tuning approaches, with a special focus on methods that optimize efficiency in parameter usage. Additionally, it explores how LLMs can be more closely aligned with human preferences through innovative reinforcement learn...

Find SimilarView on arXiv

MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained Language Models

February 23, 2024

91% Match

Nathanaël Carraz Rakotonirina, Marco Baroni

Computation and Language

Artificial Intelligence

Machine Learning

Transformer-based language models (LMs) track contextual information through large, hard-coded input windows. We introduce MemoryPrompt, a leaner approach in which the LM is complemented by a small auxiliary recurrent network that passes information to the LM by prefixing its regular input with a sequence of vectors, akin to soft prompts, without requiring LM finetuning. Tested on a task designed to probe a LM's ability to keep track of multiple fact updates, a MemoryPrompt-a...

Find SimilarView on arXiv

LoCoCo: Dropping In Convolutions for Long Context Compression

June 8, 2024

91% Match

Ruisi Cai, Yuandong Tian, ... , Chen Beidi

Machine Learning

Computation and Language

This paper tackles the memory hurdle of processing long context sequences in Large Language Models (LLMs), by presenting a novel approach, Dropping In Convolutions for Long Context Compression (LoCoCo). LoCoCo employs only a fixed-size Key-Value (KV) cache, and can enhance efficiency in both inference and fine-tuning stages. Diverging from prior methods that selectively drop KV pairs based on heuristics, LoCoCo leverages a data-driven adaptive fusion technique, blending previ...

Find SimilarView on arXiv

A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression

December 23, 2024

91% Match

Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Xinting Huang, ... , Dou Zhicheng

Computation and Language

In this work, we provide a thorough investigation of gist-based context compression methods to improve long-context processing in large language models. We focus on two key questions: (1) How well can these methods replace full attention models? and (2) What potential failure patterns arise due to compression? Through extensive experiments, we show that while gist-based compression can achieve near-lossless performance on tasks like retrieval-augmented generation and long-doc...

Find SimilarView on arXiv

Current Limitations of Language Models: What You Need is Retrieval

September 15, 2020

91% Match

Aran Komatsuzaki

Computation and Language

Machine Learning

We classify and re-examine some of the current approaches to improve the performance-computes trade-off of language models, including (1) non-causal models (such as masked language models), (2) extension of batch length with efficient attention, (3) recurrence, (4) conditional computation and (5) retrieval. We identify some limitations (1) - (4) suffer from. For example, (1) currently struggles with open-ended text generation with the output loosely constrained by the input a...

Find SimilarView on arXiv