Structured Token Retention and Computati...

A Survey on Large Language Model Acceleration based on KV Cache Management

December 27, 2024

90% Match

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, ... , Chen Lei

Artificial Intelligence

Distributed, Parallel, and C...

Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the computational and memory demands of LLMs, particularly during inference, pose significant challenges when scaling them to real-world, long-context, and real-time applications. Key-Value (KV) cache management has emerged as a critical optimizati...

Find SimilarView on arXiv

DAST: Context-Aware Compression in LLMs via Dynamic Allocation of Soft Tokens

February 17, 2025

90% Match

Shaoshen Chen, Yangning Li, Zishan Xu, Yinghui Li, Xin Su, ... , Zheng Hai-tao

Computation and Language

Large Language Models (LLMs) face computational inefficiencies and redundant processing when handling long context inputs, prompting a focus on compression techniques. While existing semantic vector-based compression methods achieve promising performance, these methods fail to account for the intrinsic information density variations between context chunks, instead allocating soft tokens uniformly across context chunks. This uniform distribution inevitably diminishes allocatio...

Find SimilarView on arXiv

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

November 5, 2024

90% Match

Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Kun Fu, ... , Xiong Hui

Computation and Language

Artificial Intelligence

Machine Learning

With the development of large language models (LLMs), the ability to handle longer contexts has become a key capability for Web applications such as cross-document understanding and LLM-powered search systems. However, this progress faces two major challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues hinder the application of LLMs in lon...

Find SimilarView on arXiv

WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models

March 3, 2025

90% Match

Jian Yuan, Ziwei He, Haoli Bai, ... , Jiang Bo

Computation and Language

Large Language Models (LLMs) use key-value (KV) cache to reduce redundant computation in autoregressive generation. However, the KV cache size increases linearly during generation, leading to excessive memory usage, especially for long texts. Most KV cache compression methods evict the unimportant KV pairs to maintain a fixed cache size, which leads to the permanent loss of tokens during generation. However, singular value decomposition shows that \textit{values} do not exhib...

Find SimilarView on arXiv

Latent Structure Modulation in Large Language Models Through Stochastic Concept Embedding Transitions

February 8, 2025

90% Match

Stefan Whitaker, Colin Sisate, Marcel Windsor, Nikolai Fairweather, ... , Lindenfeld Oskar

Computation and Language

Stochastic embedding transitions introduce a probabilistic mechanism for adjusting token representations dynamically during inference, mitigating the constraints imposed through static or deterministic embeddings. A transition framework was proposed in which each token embedding evolved through probabilistic updates, ensuring adaptability while preserving semantic integrity across linguistic contexts. Empirical evaluations demonstrated that models incorporating stochastic tra...

Find SimilarView on arXiv

Human-like Episodic Memory for Infinite Context LLMs

July 12, 2024

90% Match

Zafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, ... , Wang Jun

Artificial Intelligence

Computation and Language

Machine Learning

Neurons and Cognition

Large language models (LLMs) have shown remarkable capabilities, but still struggle with processing extensive contexts, limiting their ability to maintain coherence and accuracy over long sequences. In contrast, the human brain excels at organising and retrieving episodic experiences across vast temporal scales, spanning a lifetime. In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs, enabling ...

Find SimilarView on arXiv

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

October 7, 2024

90% Match

Lijie Yang, Zhihao Zhang, Zhuofu Chen, ... , Jia Zhihao

Machine Learning

Artificial Intelligence

Computation and Language

Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer architectures intensifies the memory constraints, particularly during the decoding phase, creating a significant bottleneck. Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail t...

Find SimilarView on arXiv

Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

October 9, 2024

90% Match

Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, ... , Lin Weiyao

Computation and Language

Recent advancements in Transformer-based large language models (LLMs) have set new standards in natural language processing. However, the classical softmax attention incurs significant computational costs, leading to a $O(T)$ complexity for per-token generation, where $T$ represents the context length. This work explores reducing LLMs' complexity while maintaining performance by introducing Rodimus and its enhanced version, Rodimus$+$. Rodimus employs an innovative data-depen...

Find SimilarView on arXiv

Exploring Contextual Flux in Large Language Models: A Novel Approach to Self-Modulating Semantic Networks

February 16, 2025

90% Match

Henry Evidail, Zachary Mountebank, Alistair Hathersage, Peter Stanhope, ... , Waddingham Tobias

Computation and Language

Self-modulating mechanisms introduce dynamic adaptation capabilities within language models through contextual realignment strategies that influence token embedding trajectories across extended sequences. Contextual Flux is explored as an approach to embedding modulation, integrating an auxiliary gating mechanism within the self-attention framework to dynamically adjust token representations based on evolving contextual dependencies. The empirical analysis evaluates entropy v...

Find SimilarView on arXiv

More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression

December 17, 2024

90% Match

Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang, Xiaoguang Li, Lifeng Shang, ... , Li Sujian

Computation and Language

As large language models (LLMs) process increasing context windows, the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension and seldom explore the efficiency of their combination. In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression. Experiments demonstrate that storing mo...

Find SimilarView on arXiv

Structured Token Retention and Computational Memory Paths in Large Language Models

A Survey on Large Language Model Acceleration based on KV Cache Management

DAST: Context-Aware Compression in LLMs via Dynamic Allocation of Soft Tokens

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models

Latent Structure Modulation in Large Language Models Through Stochastic Concept Embedding Transitions

Human-like Episodic Memory for Infinite Context LLMs

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

Exploring Contextual Flux in Large Language Models: A Novel Approach to Self-Modulating Semantic Networks

More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression