Structured Token Retention and Computati...

Recycled Attention: Efficient inference for long-context language models

November 8, 2024

91% Match

Fangyuan Xu, Tanya Goyal, Eunsol Choi

Computation and Language

Generating long sequences of tokens given a long-context input imposes a heavy computational burden for large language models (LLMs). One of the computational bottleneck comes from computing attention over a long sequence of input at each generation step. In this paper, we propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens. When performing partial attention, we recycle the attention ...

Find SimilarView on arXiv

Exploring Synaptic Resonance in Large Language Models: A Novel Approach to Contextual Memory Integration

February 15, 2025

91% Match

George Applegarth, Christian Weatherstone, Maximilian Hollingsworth, ... , Irvin Marcus

Computation and Language

Artificial Intelligence

Neural and Evolutionary Comp...

Contextual memory integration remains a high challenge in the development of language models, particularly in tasks that require maintaining coherence over extended sequences. Traditional approaches, such as self-attention mechanisms and memory-augmented architectures, often prioritize short-term dependencies, leading to fragmentation and inconsistency in long-range contextual understanding. Inspired by principles of synaptic plasticity observed in biological neural systems, ...

Find SimilarView on arXiv

Retentive Network: A Successor to Transformer for Large Language Models

July 17, 2023

91% Match

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, ... , Wei Furu

Computation and Language

Machine Learning

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows fo...

Find SimilarView on arXiv

$\text{Memory}^3$: Language Modeling with Explicit Memory

July 1, 2024

91% Match

Hongkang Yang, Zehao Lin, Wenjin Wang, Hao Wu, Zhiyu Li, Bo Tang, Wenqiang Wei, Jinbo Wang, Zeyun Tang, Shichao Song, Chenyang Xi, Yu Yu, Kai Chen, Feiyu Xiong, ... , E Weinan

Computation and Language

Artificial Intelligence

Machine Learning

The training and inference of large language models (LLMs) are together a costly process that transports knowledge from raw data to meaningful computation. Inspired by the memory hierarchy of the human brain, we reduce this cost by equipping LLMs with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG). Conceptually, with most of its knowledge externalized to explicit memories, the LLM can enjoy a smaller parameter size...

Find SimilarView on arXiv

AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference

February 6, 2025

91% Match

Qingyue Yang, Jie Wang, Xing Li, Zhihai Wang, Chen Chen, Lei Chen, Xianzhi Yu, Wulong Liu, Jianye Hao, ... , Li Bin

Computation and Language

Machine Learning

With the development of large language models (LLMs), efficient inference through Key-Value (KV) cache compression has attracted considerable attention, especially for long-context generation. To compress the KV cache, recent methods identify critical KV tokens through heuristic ranking with attention scores. However, these methods often struggle to accurately determine critical tokens as they neglect the \textit{temporal patterns} in attention scores, resulting in a noticeab...

Find SimilarView on arXiv

CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling

June 17, 2024

91% Match

Yu Bai, Xiyuan Zou, Heyan Huang, Sanxing Chen, Marc-Antoine Rondeau, ... , Cheung Jackie Chi Kit

Computation and Language

Long sequence modeling has gained broad interest as large language models (LLMs) continue to advance. Recent research has identified that a large portion of hidden states within the key-value caches of Transformer models can be discarded (also termed evicted) without affecting the perplexity performance in generating long sequences. However, we show that these methods, despite preserving perplexity performance, often drop information that is important for solving downstream t...

Find SimilarView on arXiv

A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression

December 23, 2024

91% Match

Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Xinting Huang, ... , Dou Zhicheng

Computation and Language

In this work, we provide a thorough investigation of gist-based context compression methods to improve long-context processing in large language models. We focus on two key questions: (1) How well can these methods replace full attention models? and (2) What potential failure patterns arise due to compression? Through extensive experiments, we show that while gist-based compression can achieve near-lossless performance on tasks like retrieval-augmented generation and long-doc...

Find SimilarView on arXiv

MELODI: Exploring Memory Compression for Long Contexts

October 4, 2024

91% Match

Yinpeng Chen, DeLesley Hutchins, Aren Jansen, Andrey Zhmoginov, ... , Andersen Jesper

Machine Learning

Artificial Intelligence

We present MELODI, a novel memory architecture designed to efficiently process long documents using short context windows. The key principle behind MELODI is to represent short-term and long-term memory as a hierarchical compression scheme across both network layers and context windows. Specifically, the short-term memory is achieved through recurrent compression of context windows across multiple layers, ensuring smooth transitions between windows. In contrast, the long-term...

Find SimilarView on arXiv

Context-Aware Semantic Recomposition Mechanism for Large Language Models

January 29, 2025

90% Match

Richard Katrix, Quentin Carroway, ... , Heathfield Matthias

Computation and Language

Artificial Intelligence

Context-aware processing mechanisms have increasingly become a critical area of exploration for improving the semantic and contextual capabilities of language generation models. The Context-Aware Semantic Recomposition Mechanism (CASRM) was introduced as a novel framework designed to address limitations in coherence, contextual adaptability, and error propagation in large-scale text generation tasks. Through the integration of dynamically generated context vectors and attenti...

Find SimilarView on arXiv

Latent Lexical Projection in Large Language Models: A Novel Approach to Implicit Representation Refinement

February 3, 2025

90% Match

Ziad Shaker, Brendan Ashdown, Hugo Fitzalan, ... , Huntington Jocasta

Computation and Language

Generating semantically coherent text requires a robust internal representation of linguistic structures, which traditional embedding techniques often fail to capture adequately. A novel approach, Latent Lexical Projection (LLP), is introduced to refine lexical representations through a structured transformation into a latent space, thereby enhancing the alignment between input embeddings and their contextual meanings. The method integrates an optimized projection mechanism w...

Find SimilarView on arXiv

Structured Token Retention and Computational Memory Paths in Large Language Models

Recycled Attention: Efficient inference for long-context language models

Exploring Synaptic Resonance in Large Language Models: A Novel Approach to Contextual Memory Integration

Retentive Network: A Successor to Transformer for Large Language Models

$\text{Memory}^3$: Language Modeling with Explicit Memory

AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference

CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling

A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression

MELODI: Exploring Memory Compression for Long Contexts

Context-Aware Semantic Recomposition Mechanism for Large Language Models

Latent Lexical Projection in Large Language Models: A Novel Approach to Implicit Representation Refinement