MemLong: Memory-Augmented Retrieval for ...

ACER: Automatic Language Model Context Extension via Retrieval

October 11, 2024

92% Match

Luyu Gao, Yunyi Zhang, Jamie Callan

Computation and Language

Artificial Intelligence

Information Retrieval

Machine Learning

Long-context modeling is one of the critical capabilities of language AI for digesting and reasoning over complex information pieces. In practice, long-context capabilities are typically built into a pre-trained language model~(LM) through a carefully designed context extension stage, with the goal of producing generalist long-context capabilities. In our preliminary experiments, however, we discovered that the current open-weight generalist long-context models are still lack...

Find SimilarView on arXiv

Memorizing Documents with Guidance in Large Language Models

June 23, 2024

92% Match

Bumjin Park, Jaesik Choi

Computation and Language

Artificial Intelligence

Training data plays a pivotal role in AI models. Large language models (LLMs) are trained with massive amounts of documents, and their parameters hold document-related contents. Recently, several studies identified content-specific locations in LLMs by examining the parameters. Instead of the post hoc interpretation, we propose another approach. We propose document-wise memory architecture to track document memories in training. The proposed architecture maps document represe...

Find SimilarView on arXiv

Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection

May 25, 2024

92% Match

Yun Zhu, Jia-Chen Gu, Caitlin Sikora, Ho Ko, Yinxiao Liu, Chu-Cheng Lin, Lei Shu, Liangchen Luo, Lei Meng, ... , Chen Jindong

Computation and Language

Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility by incorporating external contexts. However, the input length grows linearly in the number of retrieved documents, causing a dramatic increase in latency. In this paper, we propose a novel paradigm named Sparse RAG, which seeks to cut computation costs through sparsity. Specifically, Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced ...

Find SimilarView on arXiv

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

January 15, 2024

92% Match

Ninglu Shao, Shitao Xiao, ... , Zhang Peitian

Computation and Language

Large language models (LLMs) are in need of sufficient contexts to handle many critical applications, such as retrieval augmented generation and few-shot learning. However, due to the constrained window size, the LLMs can only access to the information within a limited context. Although the size of context window can be extended by fine-tuning, it will result in a substantial cost in both training and inference stage. In this paper, we present Extensible Tokenization as an al...

Find SimilarView on arXiv

UniMem: Towards a Unified View of Long-Context Large Language Models

February 5, 2024

92% Match

Junjie Fang, Likai Tang, Hongzhe Bi, Yujia Qin, Si Sun, Zhenyu Li, Haolun Li, Yongjian Li, Xin Cong, Yukun Yan, Xiaodong Shi, Sen Song, Yankai Lin, ... , Sun Maosong

Computation and Language

Artificial Intelligence

Long-context processing is a critical ability that constrains the applicability of large language models. Although there exist various methods devoted to enhancing the long-context processing ability of large language models (LLMs), they are developed in an isolated manner and lack systematic analysis and integration of their strengths, hindering further developments. In this paper, we introduce UniMem, a unified framework that reformulates existing long-context methods from ...

Find SimilarView on arXiv

Parallel Context Windows for Large Language Models

December 21, 2022

92% Match

Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram, Inbal Magar, Omri Abend, Ehud Karpas, Amnon Shashua, ... , Shoham Yoav

Computation and Language

When applied to processing long text, Large Language Models (LLMs) are limited by their context window. Existing efforts to address this limitation involve training specialized architectures, and cannot be easily applied to off-the-shelf LLMs. We present Parallel Context Windows (PCW), a method that alleviates the context window restriction for any off-the-shelf LLM without further training. The key to the approach is to carve a long context into chunks (``windows''), restric...

Find SimilarView on arXiv

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

October 8, 2024

92% Match

Bowen Jin, Jinsung Yoon, ... , Arik Sercan O.

Computation and Language

Artificial Intelligence

Machine Learning

Retrieval-augmented generation (RAG) empowers large language models (LLMs) to utilize external knowledge sources. The increasing capacity of LLMs to process longer input sequences opens up avenues for providing more retrieved information, to potentially enhance the quality of generated outputs. It is plausible to assume that a larger retrieval set would contain more relevant information (higher recall), that might result in improved performance. However, our empirical finding...

Find SimilarView on arXiv

Retrieval Head Mechanistically Explains Long-Context Factuality

April 24, 2024

92% Match

Wenhao Wu, Yizhong Wang, Guangxuan Xiao, ... , Fu Yao

Computation and Language

Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing propertie...

Find SimilarView on arXiv

Structured Token Retention and Computational Memory Paths in Large Language Models

February 5, 2025

92% Match

Jonathan Delena, Augustin Moreau, ... , Chatterton Frederick

Computation and Language

Memory retention mechanisms play a central role in determining the efficiency of computational architectures designed for processing extended sequences. Conventional methods for token management often impose fixed retention thresholds or rely on uniform attention weight distributions, leading to inefficient memory utilization and premature information loss in extended sequence modeling. Structured Token Retention (STR) introduces a probabilistic selection framework that dynam...

Find SimilarView on arXiv

MELODI: Exploring Memory Compression for Long Contexts

October 4, 2024

92% Match

Yinpeng Chen, DeLesley Hutchins, Aren Jansen, Andrey Zhmoginov, ... , Andersen Jesper

Machine Learning

Artificial Intelligence

We present MELODI, a novel memory architecture designed to efficiently process long documents using short context windows. The key principle behind MELODI is to represent short-term and long-term memory as a hierarchical compression scheme across both network layers and context windows. Specifically, the short-term memory is achieved through recurrent compression of context windows across multiple layers, ensuring smooth transitions between windows. In contrast, the long-term...

Find SimilarView on arXiv

MemLong: Memory-Augmented Retrieval for Long Text Modeling

ACER: Automatic Language Model Context Extension via Retrieval

Memorizing Documents with Guidance in Large Language Models

Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

UniMem: Towards a Unified View of Long-Context Large Language Models

Parallel Context Windows for Large Language Models

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

Retrieval Head Mechanistically Explains Long-Context Factuality

Structured Token Retention and Computational Memory Paths in Large Language Models

MELODI: Exploring Memory Compression for Long Contexts