MemLong: Memory-Augmented Retrieval for ...

Scaling Transformer to 1M tokens and beyond with RMT

April 19, 2023

92% Match

Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev

Computation and Language

Artificial Intelligence

Machine Learning

This technical report presents the application of a recurrent memory to extend the context length of BERT, one of the most effective Transformer-based models in natural language processing. By leveraging the Recurrent Memory Transformer architecture, we have successfully increased the model's effective context length to an unprecedented two million tokens, while maintaining high memory retrieval accuracy. Our method allows for the storage and processing of both local and glob...

Find Similar View on arXiv

SEAL: Scaling to Emphasize Attention for Long-Context Retrieval

January 25, 2025

92% Match

Changhun Lee, Jun-gyu Jin, ... , Park Eunhyeok

Computation and Language

Artificial Intelligence

Machine Learning

In this work, we introduce a novel approach called Scaling to Emphasize Attention for Long-context retrieval (SEAL), which enhances the retrieval performance of large language models (LLMs) over extended contexts. Previous studies have shown that each attention head in LLMs has a unique functionality and collectively contributes to the overall behavior of the model. Similarly, we observe that specific heads are closely tied to long-context retrieval, showing positive or negat...

Find SimilarView on arXiv

CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling

June 17, 2024

92% Match

Yu Bai, Xiyuan Zou, Heyan Huang, Sanxing Chen, Marc-Antoine Rondeau, ... , Cheung Jackie Chi Kit

Computation and Language

Long sequence modeling has gained broad interest as large language models (LLMs) continue to advance. Recent research has identified that a large portion of hidden states within the key-value caches of Transformer models can be discarded (also termed evicted) without affecting the perplexity performance in generating long sequences. However, we show that these methods, despite preserving perplexity performance, often drop information that is important for solving downstream t...

Find SimilarView on arXiv

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

October 28, 2024

92% Match

Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, ... , Chen Beidi

Machine Learning

Computation and Language

With the widespread deployment of long-context large language models (LLMs), there has been a growing demand for efficient support of high-throughput inference. However, as the key-value (KV) cache expands with the sequence length, the increasing memory footprint and the need to access it for each token generation both result in low throughput when serving long-context LLMs. While various dynamic sparse attention methods have been proposed to speed up inference while maintain...

Find SimilarView on arXiv

InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory

February 7, 2024

92% Match

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, ... , Sun Maosong

Computation and Language

Artificial Intelligence

Machine Learning

Large language models (LLMs) have emerged as a cornerstone in real-world applications with lengthy streaming inputs, such as LLM-driven agents. However, existing LLMs, pre-trained on sequences with restricted maximum length, cannot generalize to longer sequences due to the out-of-domain and distraction issues. To alleviate these issues, existing efforts employ sliding attention windows and discard distant tokens to achieve the processing of extremely long sequences. Unfortuna...

Find SimilarView on arXiv

Current Limitations of Language Models: What You Need is Retrieval

September 15, 2020

92% Match

Aran Komatsuzaki

Computation and Language

Machine Learning

We classify and re-examine some of the current approaches to improve the performance-computes trade-off of language models, including (1) non-causal models (such as masked language models), (2) extension of batch length with efficient attention, (3) recurrence, (4) conditional computation and (5) retrieval. We identify some limitations (1) - (4) suffer from. For example, (1) currently struggles with open-ended text generation with the output loosely constrained by the input a...

Find SimilarView on arXiv

A Controlled Study on Long Context Extension and Generalization in LLMs

September 18, 2024

92% Match

Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T. Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, ... , Rush Alexander M.

Computation and Language

Machine Learning

Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. However, owing to differences in data and model classes, it has been challenging to compare these approaches, leading to uncertainty as to how to evaluate long-context performance and whether it di...

Find SimilarView on arXiv

In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

February 16, 2024

92% Match

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Dmitry Sorokin, ... , Burtsev Mikhail

Computation and Language

Artificial Intelligence

Machine Learning

This paper addresses the challenge of processing long documents using generative transformer models. To evaluate different approaches, we introduce BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts within extensive texts. Our evaluation, which includes benchmarks for GPT-4 and RAG, reveals that common methods are effective only for sequences up to $10^4$ elements. In contrast, fine-tuning GPT-2 with recurrent memory...

Find SimilarView on arXiv

Finch: Prompt-guided Key-Value Cache Compression

July 31, 2024

92% Match

Giulio Corallo, Paolo Papotti

Artificial Intelligence

Recent large language model applications, such as Retrieval-Augmented Generation and chatbots, have led to an increased need to process longer input contexts. However, this requirement is hampered by inherent limitations. Architecturally, models are constrained by a context window defined during training. Additionally, processing extensive texts requires substantial GPU memory. We propose a novel approach, Finch, to compress the input context by leveraging the pre-trained mod...

Find SimilarView on arXiv

MeMo: Towards Language Models with Associative Memory Mechanisms

February 18, 2025

92% Match

Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, ... , Romagnoli Raniero

Computation and Language

Artificial Intelligence

Memorization is a fundamental ability of Transformer-based Large Language Models, achieved through learning. In this paper, we propose a paradigm shift by designing an architecture to memorize text directly, bearing in mind the principle that memorization precedes learning. We introduce MeMo, a novel architecture for language modeling that explicitly memorizes sequences of tokens in layered associative memories. By design, MeMo offers transparency and the possibility of model...

Find SimilarView on arXiv

MemLong: Memory-Augmented Retrieval for Long Text Modeling

Scaling Transformer to 1M tokens and beyond with RMT

SEAL: Scaling to Emphasize Attention for Long-Context Retrieval

CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory

Current Limitations of Language Models: What You Need is Retrieval

A Controlled Study on Long Context Extension and Generalization in LLMs

In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Finch: Prompt-guided Key-Value Cache Compression

MeMo: Towards Language Models with Associative Memory Mechanisms