R$^3$Mem: Bridging Memory Retention and ...

Attention is All You Need Until You Need Retention

January 15, 2025

92% Match

M. Murat Yaslioglu

Machine Learning

Artificial Intelligence

This work introduces a novel Retention Layer mechanism for Transformer based architectures, addressing their inherent lack of intrinsic retention capabilities. Unlike human cognition, which can encode and dynamically recall symbolic templates, Generative Pretrained Transformers rely solely on fixed pretrained weights and ephemeral context windows, limiting their adaptability. The proposed Retention Layer incorporates a persistent memory module capable of real time data popula...

Find SimilarView on arXiv

Structured Token Retention and Computational Memory Paths in Large Language Models

February 5, 2025

92% Match

Jonathan Delena, Augustin Moreau, ... , Chatterton Frederick

Computation and Language

Memory retention mechanisms play a central role in determining the efficiency of computational architectures designed for processing extended sequences. Conventional methods for token management often impose fixed retention thresholds or rely on uniform attention weight distributions, leading to inefficient memory utilization and premature information loss in extended sequence modeling. Structured Token Retention (STR) introduces a probabilistic selection framework that dynam...

Find SimilarView on arXiv

MEMORY-VQ: Compression for Tractable Internet-Scale Memory

August 28, 2023

92% Match

Yury Zemlyanskiy, Jong Michiel de, Luke Vilnis, Santiago Ontañón, William W. Cohen, ... , Ainslie Joshua

Computation and Language

Retrieval augmentation is a powerful but expensive method to make language models more knowledgeable about the world. Memory-based methods like LUMEN pre-compute token representations for retrieved passages to drastically speed up inference. However, memory also leads to much greater storage requirements from storing pre-computed representations. We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance. Our...

Find SimilarView on arXiv

Compressed Context Memory For Online Language Model Interaction

December 6, 2023

92% Match

Jang-Hyun Kim, Junyoung Yeom, ... , Song Hyun Oh

Machine Learning

Computation and Language

This paper presents a novel context compression method for Transformer language models in online scenarios such as ChatGPT, where the context continually expands. As the context lengthens, the attention process requires more memory and computational resources, which in turn reduces the throughput of the language model. To this end, we propose a compressed context memory system that continually compresses the growing context into a compact memory space. The compression process...

Find SimilarView on arXiv

Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs

March 2, 2025

92% Match

Ravi Ghadia, Avinash Kumar, Gaurav Jain, ... , Das Poulami

Computation and Language

Artificial Intelligence

Machine Learning

Autoregressive Transformers rely on Key-Value (KV) caching to accelerate inference. However, the linear growth of the KV cache with context length leads to excessive memory consumption and bandwidth constraints. This bottleneck is particularly problematic in real-time applications -- such as chatbots and interactive assistants -- where low latency and high memory efficiency are critical. Existing methods drop distant tokens or compress states in a lossy manner, sacrificing ac...

Find SimilarView on arXiv

Scaling Transformer to 1M tokens and beyond with RMT

April 19, 2023

92% Match

Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev

Computation and Language

Artificial Intelligence

Machine Learning

This technical report presents the application of a recurrent memory to extend the context length of BERT, one of the most effective Transformer-based models in natural language processing. By leveraging the Recurrent Memory Transformer architecture, we have successfully increased the model's effective context length to an unprecedented two million tokens, while maintaining high memory retrieval accuracy. Our method allows for the storage and processing of both local and glob...

Find Similar View on arXiv

Context Compression for Auto-regressive Transformers with Sentinel Tokens

October 12, 2023

92% Match

Siyu Ren, Qi Jia, Kenny Q. Zhu

Computation and Language

The quadratic complexity of the attention module makes it gradually become the bulk of compute in Transformer-based LLMs during generation. Moreover, the excessive key-value cache that arises when dealing with long inputs also brings severe issues on memory footprint and inference latency. In this work, we propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones, thereby reducing both m...

Find SimilarView on arXiv

Human-like Episodic Memory for Infinite Context LLMs

July 12, 2024

91% Match

Zafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, ... , Wang Jun

Artificial Intelligence

Computation and Language

Machine Learning

Neurons and Cognition

Large language models (LLMs) have shown remarkable capabilities, but still struggle with processing extensive contexts, limiting their ability to maintain coherence and accuracy over long sequences. In contrast, the human brain excels at organising and retrieving episodic experiences across vast temporal scales, spanning a lifetime. In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs, enabling ...

Find SimilarView on arXiv

Leveraging Memory Retrieval to Enhance LLM-based Generative Recommendation

December 23, 2024

91% Match

Chengbing Wang, Yang Zhang, Fengbin Zhu, Jizhi Zhang, ... , Feng Fuli

Information Retrieval

Leveraging Large Language Models (LLMs) to harness user-item interaction histories for item generation has emerged as a promising paradigm in generative recommendation. However, the limited context window of LLMs often restricts them to focusing on recent user interactions only, leading to the neglect of long-term interests involved in the longer histories. To address this challenge, we propose a novel Automatic Memory-Retrieval framework (AutoMR), which is capable of storing...

Find SimilarView on arXiv

Structured Context Recomposition for Large Language Models Using Probabilistic Layer Realignment

January 29, 2025

91% Match

Jonathan Teel, Jocasta Cumberbatch, ... , Baskerville Quentin

Computation and Language

Extended sequence generation often leads to degradation in contextual consistency due to the inability of conventional self-attention mechanisms to effectively retain long-range dependencies. Existing approaches, including memory compression and retrieval-augmented conditioning, introduce computational trade-offs that either increase inference latency or impose additional storage overhead. Structured Context Recomposition (SCR) introduces a probabilistic layer realignment str...

Find SimilarView on arXiv

R$^3$Mem: Bridging Memory Retention and Retrieval via Reversible Compression

Attention is All You Need Until You Need Retention

Structured Token Retention and Computational Memory Paths in Large Language Models

MEMORY-VQ: Compression for Tractable Internet-Scale Memory

Compressed Context Memory For Online Language Model Interaction

Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs

Scaling Transformer to 1M tokens and beyond with RMT

Context Compression for Auto-regressive Transformers with Sentinel Tokens

Human-like Episodic Memory for Infinite Context LLMs

Leveraging Memory Retrieval to Enhance LLM-based Generative Recommendation

Structured Context Recomposition for Large Language Models Using Probabilistic Layer Realignment