Associative Recurrent Memory Transformer

July 5, 2024

Ivan Rodkin, Yuri Kuratov, Aydar Bulatov, Mikhail Burtsev

Computer Science

Computation and Language

Artificial Intelligence

Machine Learning

This paper addresses the challenge of creating a neural architecture for very long sequences that requires constant time for processing new information at each time step. Our approach, Associative Recurrent Memory Transformer (ARMT), is based on transformer self-attention for local context and segment-level recurrence for storage of task specific information distributed over a long context. We demonstrate that ARMT outperfors existing alternatives in associative retrieval tasks and sets a new performance record in the recent BABILong multi-task long-context benchmark by answering single-fact questions over 50 million tokens with an accuracy of 79.9%. The source code for training and evaluation is available on github.

Scaling Transformer to 1M tokens and beyond with RMT

April 19, 2023

92% Match

Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev

Computation and Language

Artificial Intelligence

Machine Learning

This technical report presents the application of a recurrent memory to extend the context length of BERT, one of the most effective Transformer-based models in natural language processing. By leveraging the Recurrent Memory Transformer architecture, we have successfully increased the model's effective context length to an unprecedented two million tokens, while maintaining high memory retrieval accuracy. Our method allows for the storage and processing of both local and glob...

Find Similar View on arXiv

In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

February 16, 2024

91% Match

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Dmitry Sorokin, ... , Burtsev Mikhail

Computation and Language

Artificial Intelligence

Machine Learning

This paper addresses the challenge of processing long documents using generative transformer models. To evaluate different approaches, we introduce BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts within extensive texts. Our evaluation, which includes benchmarks for GPT-4 and RAG, reveals that common methods are effective only for sequences up to $10^4$ elements. In contrast, fine-tuning GPT-2 with recurrent memory...

Find SimilarView on arXiv

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

June 14, 2024

91% Match

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, ... , Burtsev Mikhail

Computation and Language

Artificial Intelligence

In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact ch...

Find SimilarView on arXiv

HMT: Hierarchical Memory Transformer for Long Context Language Processing

May 9, 2024

90% Match

Zifan He, Zongyue Qin, Neha Prakriya, ... , Cong Jason

Computation and Language

Machine Learning

Transformer-based large language models (LLM) have been widely used in language processing applications. However, most of them restrict the context window that permits the model to attend to every token in the inputs. Previous works in recurrent models can memorize past tokens to enable unlimited context and maintain effectiveness. However, they have "flat" memory architectures, which have limitations in selecting and filtering information. Since humans are good at learning a...

Find SimilarView on arXiv

Recurrent Memory Transformer

July 14, 2022

90% Match

Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev

Computation and Language

Machine Learning

Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention. In this work, we propose and study a memory-augmented segment-level...

Find SimilarView on arXiv

CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory

February 21, 2024

90% Match

Zexue He, Leonid Karlinsky, Donghyun Kim, Julian McAuley, ... , Feris Rogerio

Computation and Language

Large Language Models (LLMs) struggle to handle long input sequences due to high memory and runtime costs. Memory-augmented models have emerged as a promising solution to this problem, but current methods are hindered by limited memory capacity and require costly re-training to integrate with a new LLM. In this work, we introduce an associative memory module which can be coupled to any pre-trained (frozen) attention-based LLM without re-training, enabling it to handle arbitra...

Find SimilarView on arXiv

Large Memory Layers with Product Keys

July 10, 2019

89% Match

Guillaume Lample, Alexandre Sablayrolles, Marc'Aurelio Ranzato, ... , Jégou Hervé

Computation and Language

Machine Learning

This paper introduces a structured memory which can be easily integrated into a neural network. The memory is very large by design and significantly increases the capacity of the architecture, by up to a billion parameters with a negligible computational overhead. Its design and access pattern is based on product keys, which enable fast and exact nearest neighbor search. The ability to increase the number of parameters while keeping the same computational budget lets the over...

Find SimilarView on arXiv

ReadTwice: Reading Very Large Documents with Memories

May 10, 2021

89% Match

Yury Zemlyanskiy, Joshua Ainslie, Jong Michiel de, Philip Pham, ... , Sha Fei

Computation and Language

Machine Learning

Knowledge-intensive tasks such as question answering often require assimilating information from different sections of large inputs such as books or article collections. We propose ReadTwice, a simple and effective technique that combines several strengths of prior approaches to model long-range dependencies with Transformers. The main idea is to read text in small segments, in parallel, summarizing each segment into a memory table to be used in a second read of the text. We ...

Find SimilarView on arXiv

Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing

January 10, 2024

89% Match

Zi Yang, Nan Hua

Computation and Language

As LLMs have become capable of processing more complex types of inputs, researchers have recently studied how to efficiently and affordably process possibly arbitrarily long sequences. One effective approach is to use a FIFO memory to store keys and values of an attention sublayer from past chunks to allow subsequent queries to attend. However, this approach requires a large memory and/or takes into the consideration the specific LM architecture. Moreover, due to the causal n...

Find SimilarView on arXiv

Needle in the Haystack for Memory Based Large Language Models

July 1, 2024

89% Match

Subhajit Chaudhury, Soham Dan, Payel Das, ... , Nelson Elliot

Computation and Language

Artificial Intelligence

Machine Learning

In this paper, we demonstrate the benefits of using memory augmented Large Language Model (LLM) architecture in improving the recall abilities of facts from a potentially long context. As a case study we test LARIMAR, a recently proposed LLM architecture which augments a LLM decoder with an external associative memory, on several long-context recall tasks, including passkey and needle-in-the-haystack tests. We demonstrate that the external memory can be adapted at test time t...

Find SimilarView on arXiv