Structured Token Retention and Computati...

M+: Extending MemoryLLM with Scalable Long-Term Memory

February 1, 2025

92% Match

Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, ... , He Zexue

Computation and Language

Equipping large language models (LLMs) with latent-space memory has attracted increasing attention as they can extend the context window of existing language models. However, retaining information from the distant past remains a challenge. For example, MemoryLLM (Wang et al., 2024a), as a representative work with latent-space memory, compresses past information into hidden states across all layers, forming a memory pool of 1B parameters. While effective for sequence lengths u...

Find SimilarView on arXiv

Contextual Reinforcement in Multimodal Token Compression for Large Language Models

January 28, 2025

92% Match

Naderdel Piero, Zacharias Cromwell, ... , Nethercott Matthias

Computation and Language

Artificial Intelligence

Effective token compression remains a critical challenge for scaling models to handle increasingly complex and diverse datasets. A novel mechanism based on contextual reinforcement is introduced, dynamically adjusting token importance through interdependencies and semantic relevance. This approach enables substantial reductions in token usage while preserving the quality and coherence of information representation. Incorporating graph-based algorithms and adaptive weighting, ...

Find SimilarView on arXiv

Structured Convergence in Large Language Model Representations via Hierarchical Latent Space Folding

February 13, 2025

91% Match

Fenella Harcourt, Naderdel Piero, Gilbert Sutherland, Daphne Holloway, ... , Ormsby Julian

Computation and Language

Token representations in high-dimensional latent spaces often exhibit redundancy, limiting computational efficiency and reducing structural coherence across model layers. Hierarchical latent space folding introduces a structured transformation mechanism that enforces a multi-scale organization within learned embeddings, refining representational compactness while preserving essential contextual distinctions. The proposed approach incorporates dynamic folding operations that i...

Find SimilarView on arXiv

Structural Embedding Projection for Contextual Large Language Model Inference

January 31, 2025

91% Match

Vincent Enoasmo, Cedric Featherstonehaugh, ... , Huntington Zacharias

Computation and Language

Structured embedding transformations offer a promising approach for enhancing the efficiency and coherence of language model inference. The introduction of Structural Embedding Projection (SEP) provides a mechanism for refining token representations through projection matrices that integrate hierarchical and relational dependencies. The mathematical formulation of SEP enables embedding spaces to capture structured contextual relationships, thereby improving semantic fidelity ...

Find SimilarView on arXiv

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

May 25, 2023

91% Match

Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, ... , Hofmann Thomas

Computation and Language

Machine Learning

Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens in the sequence, thus incurring a quadratic cost. In this study, we present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness, resulting in reduced memory and computational requireme...

Find SimilarView on arXiv

D2O:Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models

June 18, 2024

91% Match

Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, ... , Zhang Mi

Computation and Language

Efficient inference in Large Language Models (LLMs) is impeded by the growing memory demands of key-value (KV) caching, especially for longer sequences. Traditional KV cache eviction strategies, which prioritize less critical KV-pairs based on attention scores, often degrade generation quality, leading to issues such as context loss or hallucinations. To address this, we introduce Dynamic Discriminative Operations (D2O), a novel method that utilizes two-level discriminative s...

Find SimilarView on arXiv

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

September 16, 2024

91% Match

Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, ... , Qiu Lili

Machine Learning

Computation and Language

Transformer-based Large Language Models (LLMs) have become increasingly important. However, due to the quadratic time complexity of attention computation, scaling LLMs to longer contexts incurs extremely slow inference latency and high GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to both accelerate attention computation and reduce GPU memory consumption. By leveraging the dynamic sparsity of attent...

Find SimilarView on arXiv

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

January 15, 2024

91% Match

Ninglu Shao, Shitao Xiao, ... , Zhang Peitian

Computation and Language

Large language models (LLMs) are in need of sufficient contexts to handle many critical applications, such as retrieval augmented generation and few-shot learning. However, due to the constrained window size, the LLMs can only access to the information within a limited context. Although the size of context window can be extended by fine-tuning, it will result in a substantial cost in both training and inference stage. In this paper, we present Extensible Tokenization as an al...

Find SimilarView on arXiv

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

December 3, 2024

91% Match

Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, ... , Yu Kai

Computation and Language

The increasing context window size in Large Language Models (LLMs), such as the GPT and LLaMA series, has improved their ability to tackle complex, long-text tasks, but at the cost of inference efficiency, particularly regarding memory and computational complexity. Existing methods, including selective token retention and window-based attention, improve efficiency but risk discarding important tokens needed for future text generation. In this paper, we propose an approach tha...

Find SimilarView on arXiv

Long-range Language Modeling with Self-retrieval

June 23, 2023

91% Match

Ohad Rubin, Jonathan Berant

Computation and Language

Retrieval-augmented language models (LMs) have received much attention recently. However, typically the retriever is not trained jointly as a native component of the LM, but added to an already-pretrained LM, which limits the ability of the LM and the retriever to adapt to one another. In this work, we propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch for the task of modeling l...

Find SimilarView on arXiv

Structured Token Retention and Computational Memory Paths in Large Language Models

M+: Extending MemoryLLM with Scalable Long-Term Memory

Contextual Reinforcement in Multimodal Token Compression for Large Language Models

Structured Convergence in Large Language Model Representations via Hierarchical Latent Space Folding

Structural Embedding Projection for Contextual Large Language Model Inference

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

D2O:Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

Long-range Language Modeling with Self-retrieval