MemLong: Memory-Augmented Retrieval for ...

Long-Context Inference with Retrieval-Augmented Speculative Decoding

February 27, 2025

92% Match

Guanzheng Chen, Qilong Feng, Jinjie Ni, ... , Shieh Michael Qizhe

Computation and Language

The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference, particularly in managing key-value (KV) caches, presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-...

Find SimilarView on arXiv

Farewell to Length Extrapolation, a Training-Free Infinite Context with Finite Attention Scope

July 21, 2024

92% Match

Xiaoran Liu, Qipeng Guo, Yuerong Song, Zhigeng Liu, Kai Lv, Hang Yan, Linlin Li, ... , Qiu Xipeng

Computation and Language

Artificial Intelligence

The maximum supported context length is a critical bottleneck limiting the practical application of the Large Language Model (LLM). Although existing length extrapolation methods can extend the context of LLMs to millions of tokens, these methods all have an explicit upper bound. In this work, we propose LongCache, a training-free approach that enables LLM to support an infinite context with finite context scope, through full-context cache selection and training-free integrat...

Find SimilarView on arXiv

CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation

February 16, 2025

92% Match

Kun-Hui Lee, Eunhwan Park, ... , Na Seung-Hoon

Computation and Language

Artificial Intelligence

Large Language Models (LLMs) excel across a variety of language tasks yet are constrained by limited input lengths and high computational costs. Existing approaches\textemdash such as relative positional encodings (e.g., RoPE, ALiBi) and sliding window mechanisms\textemdash partially alleviate these issues but often require additional training or suffer from performance degradation with longer inputs. In this paper, we introduce \textbf{\textit{CacheFocus}}, a method that enh...

Find SimilarView on arXiv

Reducing Distraction in Long-Context Language Models by Focused Learning

November 8, 2024

92% Match

Zijun Wu, Bingyuan Liu, Ran Yan, ... , Delteil Thomas

Computation and Language

Recent advancements in Large Language Models (LLMs) have significantly enhanced their capacity to process long contexts. However, effectively utilizing this long context remains a challenge due to the issue of distraction, where irrelevant information dominates lengthy contexts, causing LLMs to lose focus on the most relevant segments. To address this, we propose a novel training method that enhances LLMs' ability to discern relevant information through a unique combination o...

Find SimilarView on arXiv

Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads

October 2, 2024

92% Match

Yuxiang Huang, Binhang Yuan, Xu Han, ... , Liu Zhiyuan

Computation and Language

Large language models (LLMs) have shown remarkable advances in supporting long-context comprehension and processing tasks. However, scaling the generation inference of LLMs to such long contexts incurs significant additional computation load, and demands a substantial GPU memory footprint to maintain the key-value (KV) cache of transformer-based LLMs. Existing KV cache compression methods, such as quantization, face memory bottlenecks as context length increases, while static...

Find SimilarView on arXiv

LaMemo: Language Modeling with Look-Ahead Memory

April 15, 2022

92% Match

Haozhe Ji, Rongsheng Zhang, Zhenyu Yang, ... , Huang Minlie

Computation and Language

Although Transformers with fully connected self-attentions are powerful to model long-term dependencies, they are struggling to scale to long texts with thousands of words in language modeling. One of the solutions is to equip the model with a recurrence memory. However, existing approaches directly reuse hidden states from the previous segment that encodes contexts in a uni-directional way. As a result, this prohibits the memory to dynamically interact with the current conte...

Find SimilarView on arXiv

Human-like Episodic Memory for Infinite Context LLMs

July 12, 2024

92% Match

Zafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, ... , Wang Jun

Artificial Intelligence

Computation and Language

Machine Learning

Neurons and Cognition

Large language models (LLMs) have shown remarkable capabilities, but still struggle with processing extensive contexts, limiting their ability to maintain coherence and accuracy over long sequences. In contrast, the human brain excels at organising and retrieving episodic experiences across vast temporal scales, spanning a lifetime. In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs, enabling ...

Find SimilarView on arXiv

More Room for Language: Investigating the Effect of Retrieval on Language Models

April 16, 2024

92% Match

David Samuel, Lucas Georges Gabriel Charpentier, Sondre Wold

Computation and Language

Retrieval-augmented language models pose a promising alternative to standard language modeling. During pretraining, these models search in a corpus of documents for contextually relevant information that could aid the language modeling objective. We introduce an 'ideal retrieval' methodology to study these models in a fully controllable setting. We conduct an extensive evaluation to examine how retrieval augmentation affects the behavior of the underlying language model. Amon...

Find SimilarView on arXiv

Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free Dynamic Triangular Attention Pattern

December 6, 2024

92% Match

Hongyin Tang, Di Xiu, Lanrui Wang, Xiurui Geng, ... , Cai Xunliang

Computation and Language

Machine Learning

The quadratic computational complexity of the attention mechanism in current Large Language Models (LLMs) renders inference with long contexts prohibitively expensive. To address this challenge, various approaches aim to retain critical portions of the context to optimally approximate Full Attention (FA) through Key-Value (KV) compression or Sparse Attention (SA), enabling the processing of virtually unlimited text lengths in a streaming manner. However, these methods struggl...

Find SimilarView on arXiv

Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study

April 13, 2023

92% Match

Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, ... , Catanzaro Bryan

Computation and Language

Artificial Intelligence

Information Retrieval

Machine Learning

Large decoder-only language models (LMs) can be largely improved in terms of perplexity by retrieval (e.g., RETRO), but its impact on text generation quality and downstream task accuracy is unclear. Thus, it is still an open question: shall we pretrain large autoregressive LMs with retrieval? To answer it, we perform a comprehensive study on a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT incorporated at fine-...

Find SimilarView on arXiv

MemLong: Memory-Augmented Retrieval for Long Text Modeling

Long-Context Inference with Retrieval-Augmented Speculative Decoding

Farewell to Length Extrapolation, a Training-Free Infinite Context with Finite Attention Scope

CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation

Reducing Distraction in Long-Context Language Models by Focused Learning

Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads

LaMemo: Language Modeling with Look-Ahead Memory

Human-like Episodic Memory for Infinite Context LLMs

More Room for Language: Investigating the Effect of Retrieval on Language Models

Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free Dynamic Triangular Attention Pattern

Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study