ID: 2405.06067

HMT: Hierarchical Memory Transformer for Long Context Language Processing

May 9, 2024

View on ArXiv

Similar papers 5

Linking In-context Learning in Transformers to Human Episodic Memory

May 23, 2024

89% Match
Li Ji-An, Corey Y. Zhou, ... , Mattar Marcelo G.
Computation and Language
Machine Learning

Understanding the connections between artificial and biological intelligent systems can reveal fundamental principles underlying general intelligence. While many artificial intelligence (AI) models have a neuroscience counterpart, such connections are largely missing in Transformer models and the self-attention mechanism. Here, we examine the relationship between attention heads and human episodic memory. We focus on the induction heads, which contribute to the in-context lea...

Find SimilarView on arXiv

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

March 6, 2024

89% Match
Zexuan Qiu, Jingjing Li, Shijue Huang, ... , King Irwin
Computation and Language

Developing Large Language Models (LLMs) with robust long-context capabilities has been the recent research focus, resulting in the emergence of long-context LLMs proficient in Chinese. However, the evaluation of these models remains underdeveloped due to a lack of benchmarks. To address this gap, we present CLongEval, a comprehensive Chinese benchmark for evaluating long-context LLMs. CLongEval is characterized by three key features: (1) Sufficient data volume, comprising 7 d...

Find SimilarView on arXiv

Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing

January 10, 2024

89% Match
Zi Yang, Nan Hua
Computation and Language

As LLMs have become capable of processing more complex types of inputs, researchers have recently studied how to efficiently and affordably process possibly arbitrarily long sequences. One effective approach is to use a FIFO memory to store keys and values of an attention sublayer from past chunks to allow subsequent queries to attend. However, this approach requires a large memory and/or takes into the consideration the specific LM architecture. Moreover, due to the causal n...

Find SimilarView on arXiv

Large Memory Layers with Product Keys

July 10, 2019

89% Match
Guillaume Lample, Alexandre Sablayrolles, Marc'Aurelio Ranzato, ... , Jégou Hervé
Computation and Language
Machine Learning

This paper introduces a structured memory which can be easily integrated into a neural network. The memory is very large by design and significantly increases the capacity of the architecture, by up to a billion parameters with a negligible computational overhead. Its design and access pattern is based on product keys, which enable fast and exact nearest neighbor search. The ability to increase the number of parameters while keeping the same computational budget lets the over...

Find SimilarView on arXiv

Memory Transformer

June 20, 2020

89% Match
Mikhail S. Burtsev, Yuri Kuratov, ... , Sapunov Grigory V.
Computation and Language
Machine Learning
Neural and Evolutionary Comp...

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware representations. However, information about the context is stored mostly in the same element-wise representations. This might limit the processing of properties related to the sequence as a whole more difficult. Adding trainable memory to selective...

Find SimilarView on arXiv

CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory

February 21, 2024

89% Match
Zexue He, Leonid Karlinsky, Donghyun Kim, Julian McAuley, ... , Feris Rogerio
Computation and Language

Large Language Models (LLMs) struggle to handle long input sequences due to high memory and runtime costs. Memory-augmented models have emerged as a promising solution to this problem, but current methods are hindered by limited memory capacity and require costly re-training to integrate with a new LLM. In this work, we introduce an associative memory module which can be coupled to any pre-trained (frozen) attention-based LLM without re-training, enabling it to handle arbitra...

Find SimilarView on arXiv

Retrieval Head Mechanistically Explains Long-Context Factuality

April 24, 2024

89% Match
Wenhao Wu, Yizhong Wang, Guangxuan Xiao, ... , Fu Yao
Computation and Language

Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing propertie...

Find SimilarView on arXiv

Long-range Language Modeling with Self-retrieval

June 23, 2023

89% Match
Ohad Rubin, Jonathan Berant
Computation and Language

Retrieval-augmented language models (LMs) have received much attention recently. However, typically the retriever is not trained jointly as a native component of the LM, but added to an already-pretrained LM, which limits the ability of the LM and the retriever to adapt to one another. In this work, we propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch for the task of modeling l...

Find SimilarView on arXiv

Uncertainty Guided Global Memory Improves Multi-Hop Question Answering

November 29, 2023

89% Match
Alsu Sagirova, Mikhail Burtsev
Computation and Language

Transformers have become the gold standard for many natural language processing tasks and, in particular, for multi-hop question answering (MHQA). This task includes processing a long document and reasoning over the multiple parts of it. The landscape of MHQA approaches can be classified into two primary categories. The first group focuses on extracting supporting evidence, thereby constraining the QA model's context to predicted facts. Conversely, the second group relies on ...

Find SimilarView on arXiv

Enhancing Long-Term Memory using Hierarchical Aggregate Tree for Retrieval Augmented Generation

June 10, 2024

89% Match
Aadharsh Aadhithya A, Sachin Kumar S, Soman K. P
Computation and Language
Artificial Intelligence

Large language models have limited context capacity, hindering reasoning over long conversations. We propose the Hierarchical Aggregate Tree memory structure to recursively aggregate relevant dialogue context through conditional tree traversals. HAT encapsulates information from children nodes, enabling broad coverage with depth control. We formulate finding best context as optimal tree traversal. Experiments show HAT improves dialog coherence and summary quality over baselin...

Find SimilarView on arXiv